homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Databases
Forum Library, Charter, Moderator: open

Databases Forum

sql server huge collection of ids - how to query?
trying to build a 'customers who bought this also bought this'

 10:00 am on Mar 29, 2009 (gmt 0)

I have a table in which I've stored ever product id that every session id has hit over the past year or so. There are about 50+ million records in it now - so much so that it won't do a simple count(*) query without timing out.
I'd like to do an intensive query whereby for each product I'd check which other products that session visited and find the top one, to produce a 'customers who bought this also bought this' list.

I know how to build the query, the only trouble is the data set is to big it always times out and uses up too much database resources, so I'm finding it impossible to use.

Does anyone have any suggestions what I can do?



 5:26 pm on Mar 29, 2009 (gmt 0)

Break it down into smaller queries eg by date or by ID number. Use this series of queries to populate another table, eg ProdID, OtherProdID, Score. Use this new table to pull the info for the product pages. Run a batch update overnight to keep the summary table up to date.


 4:54 pm on Mar 30, 2009 (gmt 0)

I agree with mattur. And make sure your tables are properly indexed.


 5:01 pm on Mar 30, 2009 (gmt 0)

Thanks. The trouble I'm having is that I can't pull out one product id. To do so I would have to use the query in question - in other words, I have to find all the sessions that have looked at product a, then select all the other products that those sessions have looked at. That in itself makes for a very hefty query that in most cases times out. Is there a standard way of going about building a 'customers who bought this also bought this' database that will scale beyond a few thousand products/records.


 11:38 pm on Apr 2, 2009 (gmt 0)

I've always found these queries to never work right, because someone might buy things in the same session that have no relation to each other.

For a true upsell, create a new table with mappings of products that you feel would be a good fit.


 8:38 am on Apr 3, 2009 (gmt 0)

probably a good idea thanks, but as I have many thousands of products I wanted to find a way to automate it.


 12:34 pm on Apr 3, 2009 (gmt 0)

I've always found these queries to never work right, because someone might buy things in the same session that have no relation to each other.

As you get more data, more useful patterns emerge outweighing the unrelated data.

musicales, what are you using to run the query and what is the timeout error? Perhaps if you post your table structure, indexes and current query someone will be able to suggest a solution.


 12:48 pm on Apr 3, 2009 (gmt 0)

OK the sql table is as follows:

CREATE TABLE [dbo].[related_products] (
[related_id] [int] IDENTITY (1, 1) NOT NULL ,
[product_Id] [int] NULL ,
[session_id] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[member_id] [int] NULL

CREATE INDEX [index_session_id] ON [dbo].[related_products]([session_id]) ON [PRIMARY]

CREATE INDEX [index_product] ON [dbo].[related_products]([product_Id]) ON [PRIMARY]

the query I use, for one individual prod_id, is:

select top 3 a.product_id,count(1) from related_products a where a.product_Id<> " & prod_id & " and a.session_id in (select session_id from related_products where product_Id= " & prod_id & " ) group by a.product_id order by count(1) desc

the idea of the query is to pull out every session_id for a particular product, then re-query finding the three most common other products viewed in those sessions.


 12:25 pm on Apr 4, 2009 (gmt 0)

OK, so you could try adding a new composite index using product_Id and session_id onto the table.

You could also try limiting the query by related_id, to do the calculation in batches eg:

COUNT(*) AS CountofProduct
related_products AS a
a.product_Id<>" & prod_id & "
AND a.session_id IN (SELECT session_id FROM related_products WHERE product_Id=" & prod_id & ")
[b]AND related_id <500000[/b]
GROUP BY a.product_Id

You would repeat the query for each product for related_id <500000, >499999 AND <1000000, etc. So you may want to do it for all related products instead of just the top 3.

Or try eliminating the subquery by using a join, something like:

Count(*) AS CountofProduct
FROM related_products
INNER JOIN related_products AS a
ON related_products.session_id = a.session_id
WHERE related_products.product_Id=" & prod_id & "
AND a.product_Id<>" & prod_id & "
GROUP BY a.product_Id

You could also limit this query by related_id.

With the new index, it looks like the inner join version of the query has the simplest execution plan. Both queries have fewer logical reads with the new index. If you limit by related_id you could add this to the index too. Try running the query in SQL Query Analyzer/Management Studio to optimise it further.

Once you've processed the data to date, archive it and then track sessions in a new i.e. smaller table, updating and archiving on a daily/monthly/whatever basis so you don't have to run big queries in future.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / Databases
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved