Forum Moderators: phranque
I am working on a price comparison engine for a niche, let's say iPods.
The product feeds (CSV) provided by some of the affiliate programs are either outdated or don't contain all the product data that I need.
So I decided to use Ruby's Hpricot library to scrape the product pages of the affiliate programs that provide unsuitable feeds. This way, I can gather all the data I require.
I export the scraped data as CSV, for instance
iPod Video 20GB, $150
I have one scraper script for each merchant. Now, the problem is, the product title on the merchants' sites don't always match. Instead of 'iPod' (merchant 1), they might write 'IPod' (merchant 2), 'i Pod' (merchant 3), 'ipod' (merchant 4)...
Anyway, let's say this are two sample CSV files:
merchant1.csv
iPod Video 20GB, $150
iPod Video 80GB, $400
iPod nano 8GB, $100
iPod touch 60 GB, $300
merchant2.csv
I pod Video 20GB, $140
I pod Video 80GB, $390
I pod nano 8GB, $80
I pod touch 60 GB, $275
What I want to do is: create a price comparison for each model, e.g. one for Video 20GB, one for Video 80 GB, one for nano 4GB, one for nano 8GB and so on... As I said, this is just an example, and in reality, there are a whole lot more different combinations of product names and more merchants.
My questions:
1. Affiliate Links
I will probably save all affiliate links and an identifier in a MySQL table, then use a PHP jump script. This will be some manual work I suppose, generating all the aff links, writing them to a database, adding the IDs... But I only have to do it once, and it's very flexible if aff links change at a later date.
Example: www.example.com/links/jump.php?id=amazon-ipodvideo20
MySQL database: (ID:amazon-ipodvideo20 ¦ LINK:www.here-goes-the-aff-link.com)
Is that a good idea?
2. CSV to DATABASE, Combining data, unique IDs
Now this is where my head aches start. I was thinking about using one database table for each product. For instance, one table for the Video 20GB, one for the Video 60GB, one for the nano 4GB, one for the nano 8GB and so on.
I though about using an identifier like 'ipodvideo20' as a table name, then adding an extra field to all the CSV rows with 'iPod Video 20GB', 'IPOD video 20 GB' and so on for all the different merchants...
This way, at least I could write a script to add the correct product to the correct table.
What I haven't figured out is how to add "id"s: I thought about putting them to lowercase and deleting whitespaces, which in this example might actually work (e.g. 'iPod Video 20GB' -> ipodvideo20gb, 'IPOD video 20 GB' -> ipodvideo20gb), but not if the product name is 'iPod Video blue 20 Gigs'.
Any ideas or suggestions how I can combine scraped data from different sources into the correct database tables?
Cheers, Chris
I would also use numeric auto-increment IDs in almost every single table--it makes referencing rows much, MUCH easier, and can drastically improve retrieval speed.
As for getting the product name...
I would work around this by first of all creating a "product" entry in a products table. Have a "name" column that contains a lowercase string that you know will be in every identifier in all the titles--in your examples, "ipod". If multiple words must be in the product name, separate them by spaces.
Next, get the identifier of the product from the vendor. Remove all spaces from it, and convert it to lowercase. Explode (or split, depending on your language) your "name" column into an array (which is where the space separator came in--you can explode/split by this space character). Then, loop through this array and make sure each element in it is present in the identifier presented by the vendor.