Combining scraped data from different sources

Hi there,

I am working on a price comparison engine for a niche, let's say iPods.

The product feeds (CSV) provided by some of the affiliate programs are either outdated or don't contain all the product data that I need.

So I decided to use Ruby's Hpricot library to scrape the product pages of the affiliate programs that provide unsuitable feeds. This way, I can gather all the data I require.

I export the scraped data as CSV, for instance

iPod Video 20GB, $150

I have one scraper script for each merchant. Now, the problem is, the product title on the merchants' sites don't always match. Instead of 'iPod' (merchant 1), they might write 'IPod' (merchant 2), 'i Pod' (merchant 3), 'ipod' (merchant 4)...

Anyway, let's say this are two sample CSV files:

merchant1.csv
iPod Video 20GB, $150
iPod Video 80GB, $400
iPod nano 8GB, $100
iPod touch 60 GB, $300

merchant2.csv
I pod Video 20GB, $140
I pod Video 80GB, $390
I pod nano 8GB, $80
I pod touch 60 GB, $275

What I want to do is: create a price comparison for each model, e.g. one for Video 20GB, one for Video 80 GB, one for nano 4GB, one for nano 8GB and so on... As I said, this is just an example, and in reality, there are a whole lot more different combinations of product names and more merchants.

My questions:

1. Affiliate Links
I will probably save all affiliate links and an identifier in a MySQL table, then use a PHP jump script. This will be some manual work I suppose, generating all the aff links, writing them to a database, adding the IDs... But I only have to do it once, and it's very flexible if aff links change at a later date.
Example: www.example.com/links/jump.php?id=amazon-ipodvideo20
MySQL database: (ID:amazon-ipodvideo20 Ś LINK:www.here-goes-the-aff-link.com)

Is that a good idea?

2. CSV to DATABASE, Combining data, unique IDs
Now this is where my head aches start. I was thinking about using one database table for each product. For instance, one table for the Video 20GB, one for the Video 60GB, one for the nano 4GB, one for the nano 8GB and so on.

I though about using an identifier like 'ipodvideo20' as a table name, then adding an extra field to all the CSV rows with 'iPod Video 20GB', 'IPOD video 20 GB' and so on for all the different merchants...

This way, at least I could write a script to add the correct product to the correct table.

What I haven't figured out is how to add "id"s: I thought about putting them to lowercase and deleting whitespaces, which in this example might actually work (e.g. 'iPod Video 20GB' -> ipodvideo20gb, 'IPOD video 20 GB' -> ipodvideo20gb), but not if the product name is 'iPod Video blue 20 Gigs'.

Any ideas or suggestions how I can combine scraped data from different sources into the correct database tables?

Cheers, Chris

Combining scraped data from different sources

Lupi

WesleyC

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week