Forum Moderators: skibum

Message Too Old, No Replies

Updates to Data Feeds

How to not overwrite previous changes?

         

rmjvol

12:40 am on Dec 23, 2003 (gmt 0)

10+ Year Member



I've been dabbling with datafeeds from some merchants recently. Some are very clean and ready to use immediately, some are not.

In reviewing previous comments about data feeds, some people encourage the tweaking of the data before publishing it to your affiliate site. Spellchecking and removing extra characters/whitespace were mentioned. It can also be a benefit if you edit the content or create additional content to go along with the feed. That's where I'm kind of stuck.

My plan for one particular feed is to edit the product names, edit the product descriptions, and edit the oganization (it comes with some preset categories that aren't quite adequate). The pages generated are using the product names for file names. Since that's one of the fields I'm editing, my concern is how best to keep these changes when the feed needs to be updated without 1) creating duplicate product pages, 2) trashing my old pages, or 3) requiring a lot of manual intervention.

Any suggestions?

Thanks, rmjvol

jomaxx

5:49 pm on Dec 23, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is one of the hardest parts of my job. What I do is compare the current datafeed with my master file, using the product ID:

1. If the product is in the master file but not the datafeed, I follow up for (possible) deletion from the master file.

2. I don't need to use variable fields such as price, so if the product is in both files, I take no action and continue to use the data in the master file.

3. If the product is in the datafeed but not the master file, I standardize the record and add it to the master file. Usually the number of records in this category is manageably small.

If you want a couple of suggestions for software tools to help you accomplish the above, let me know.

rmjvol

6:27 pm on Dec 25, 2003 (gmt 0)

10+ Year Member



Thanks, jomaxx.

I'd appreciate anything you can pass along. Here or sticky, whatever's appropriate.

Tiebreaker

6:34 pm on Dec 25, 2003 (gmt 0)

10+ Year Member



I'd be interested in this too - I have been trying to figure out the best way of handling feeds that I have modified.

Can you post or sticky?

jomaxx

10:35 pm on Dec 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yeah, I got a sticky mail about this as well. The most important thing is to have a toolbox of software agents that you can use to attack different aspects of a problem...

PC Software required:
MS Excel: Sorting, extracting specific fields, data conversion, data calculations.
TextPad (Shareware professional quality text editor): Editing huge data files, sorting, changing text case, extracting all records matching a certain search criteria, macros, etc. Probably several similar editors are just as good.
uniq.exe (Freeware DOS utility): Removes duplicate lines from a file and optionally counts the number of occurrences. Do a Google search on "uniq dos utility" and you should be able to find it.

Rather than being too specific, I'll just describe how I approach various problems. Taken together, they are sufficient for me to maintain my master file. It requires quite a bit of practice to streamline things perfectly, but the effort is worth it.

-> Extract the product ID's from the datafeed and from your master file:
You can probably extract the datafeed at least into Excel format, so just highlight and copy the product ID column and paste it into a text editor, and you're done.
Rather than use Excel, I prefer to use the great TextPad, and program a macro to do the same thing with a plain text file. Something like:
(a) jump to the beginning of the next product ID,
(b) highlight the line from that point back to the beginning,
(c) delete highlighted text,
(d) jump to the end of that product ID,
(e) highlight the line from that point to the end,
(f) delete highlighted text,
(g) repeat to end of file.
It really takes half a minute to set this up once you have memorized the relevant commands and shortcuts.

-> Sort files:
Excel can do this by field.
TextPad can do this for text files by column. Works well up to a couple of hundred thousand lines.
I also have a DOS utility for even larger files, but it has a bug that causes lines to be dropped occasionally from very large files, so I don't really recommend it.

-> Eliminate duplicate entries:
Can Excel sort a column and remove duplicate entries? I wouldn't be surprised but I don't know how.
I do this in TextPad, sorting a file containing only the product ID's and using the "delete duplicate lines" option.
You can also use the uniq.exe DOS utility.

-> Merge 2 files:
The DOS copy command can copy 2 files into 1.
I tend to open the files in textPad and cut-and-paste into a third file.

-> Find whether product ID's are in both files or not.
(Excel might be able to do this as well, but again I don't know how.)
Merge and sort two files consisting only of non-duplicated product ID's.
Then I use uniq.exe with the -c option in a DOS box like this:
uniq -c <infile.txt >outfile.txt
outfile.txt will contain 1 line per UNIQUE product ID in the input file, preceded by a field containing the number of times that ID occurred in the input file. In this situation, that number should be either 00001 or 00002.

Personally I use textPad to sort outfile.txt and then delete all records preceded 00002 (i.e. product ID is in both files). The remaining ID's are in the master file OR the current datafeed, but not both.

You can then use a macro to extract these remaining product ID's from the master file records and datafeed records. For example, use TextPad to merge the two big files. Then copy the product ID's from outfile.txt to the TOP of the merged file. Then create a macro:
(a) go to top of file,
(b) right-arrow cursor to start of product ID field (if not in column 1),
(c) highlight from there to end of line,
(d) search on highlighted text (ctrl-F),
(e) move cursor to beginning of line,
(f) cut line,
(g) go to bottom of file,
(h) paste line,
(i) repeat N times.

Wow, that's long. Hope some of that is of use to someone. Seriously, I do all kinds of data manipulation every day -- updating database files, extracting info from log files, massaging reports -- and if I had to do all that manually I'd need an army of Agent Smiths to handle the workload.

Trisha

7:13 pm on Dec 26, 2003 (gmt 0)

10+ Year Member



Thanks for the information jomaxx! I'll have to read through that again, a lot of it was over my head the first time!

I'm not real good at manipulating excel type files. My husband is actually better at that than me, I may have to get him to help me!

I've considered creating some sort of PHP script to do this type of stuff, but I'm really not that good at any sort of programming and this may be way over my head.

What I would like to have ideally is something that would check each line in the database by the product field. Then do one of three things:

1 - If the product is new, add it to the database. Then list all new products added so I can edit information about them if needed.

2 - If a product is no longer available, give me an option to delete or edit it. This way if I have an individual page I'm creating for each product and they have been indexed by a SE already, it won't just disappear. I could edit the information and add a link to a similar product and meanwhile add a 'no follow, no index' line also.

3 - Automatically update (or notify me of) information in some fields for existing products if information in one or more fields changes. For example, if the price changes, that could automatically update, but other fields I may want to edit the information first.

I have no idea if this is possible or practical, or if I am capable of doing it, or how long it would take me if I am capable of it. I may try fiddling around with it sometime in the next few months.

jomaxx

7:25 pm on Dec 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Custom programming is also a great idea if you have one specific task to accomplish frequently, though I don't know enough PHP to comment on that part.

The advantages of the kind of tools listed above are
(1) They may be be more time-efficient for tasks that are performed once a month or less.
(2) You can mix and match them to accomplish zillions of ad hoc activities.
(3) You tend to learn quite a bit about the data, and identify a lot of data integrity problems, when doing these tasks manually. The poor data quality in most large databases is and always has been my number 1 bugaboo (or is it bugbear?).