XML Datafeeds to Database

Forum Moderators: open

Message Too Old, No Replies

XML Datafeeds to Database

Speed, Integrity and the Such

dylanz

5:40 am on Jan 15, 2009 (gmt 0)

I'm getting a huge product XML file (3gb) from Commission Junction. My current issues are the following, and I would love any insight/suggestions into any of them:

1. The file doesn't validate to its DTD.

I'm using "xmllint" to verify, and it is indeed broken. Not the end of the world, as I'm running a couple "tr" commands to remove all kinds of funky characters and get the file in working order.

2. Getting the data into my database.

Currently, I'm using libxml's Sax parser to read the file via a stream, which keeps the memory footprint low. However, for each node, I'm having to do a read in my database to see if that product exits, then update that record or create a new one if it doesn't exist. This approach is unfortunately going to take hours (it's running 5 hours plus already.

Any suggestions? Could I do the process differently, or speed it up in any way?

Any feedback will be appreciated! Thanks!

dylanz

7:14 am on Jan 15, 2009 (gmt 0)

Thinking out loud... but could probably spawn a bunch of threads with their own parsers (they're cheap), and have each one parse N+x nodes. That way there's no overlap, and it gets done faster.

Hmm...

httpwebwitch

7:55 pm on Jan 15, 2009 (gmt 0)

oy, that could take a long long time.

funny you should mention this. I'm doing something similar right now, and I've been running it sporadically, about 8 hours at a time, for almost a two weeks.

I wonder how I could speed that up too.

The slowest part is the read->exists?update:insert loop

dylanz

10:51 pm on Jan 15, 2009 (gmt 0)

yeah, it's not necessarily a "speedy" process :)

the consensus is to use a Sax parser, and there really isn't a good way to multi-thread it as far as I can tell. splitting up the file into multiple files would be great, but even with a snazzy lexer, you'd still need to load up the file into memory, which isn't really an option.

so... I think I'm going to stay with the slow single-threaded approach (which it seems most people in this scenario are doing), and try to optimize my code as best as I can.

oh, and httpwebwitch, if you have any extremely fast cpu's in any idle machines laying around, let me know :P

stajer

1:18 am on Jan 16, 2009 (gmt 0)

For speed, you need specialization. Break this project into 2 distinct tasks.

First, use a specialized xml to sql parser to just upload the data into a new set of tables in your DB. Second, use a sql stored procedure to go through the newly uploaded data and update your live data.

There are several inexpensive XML to DB products out there that are very good at processing large amounts of XML into DBs very quickly.

dylanz

2:21 am on Jan 16, 2009 (gmt 0)

Good point stajer... there may indeed be contention on the parser due to the database entries, or vise-versa. I'll separate the tasks and also create a stored procedure (i'm using postgres fwiw) to see if I can optimize the entries.

There is unfortunately quite a lot of application logic hidden behind these entries, but I'll try.

Could you point me to any of the inexpensive XML -> DB products?

stajer

5:09 pm on Jan 16, 2009 (gmt 0)

Google "xml to sql" and you will see several products. I have used the N***x**s product with good results, but that is SQL Server based. There has got to be a postgres solution out there.

dylanz

7:29 pm on Jan 16, 2009 (gmt 0)

Thanks stajer. Yes, Postgres has had some XML initiatives in the works for a while now, but they're not completed yet. They did just roll out native text search, which is outstandingly fast and customizable. I'm a fan :)

stajer

8:19 pm on Jan 16, 2009 (gmt 0)

I think the way the xml2sql programs work is to turn the xml into a series of t-sql statements that are then executed on your db. So, they should be db agnostic as long as they can connect.

tntpower

4:31 am on Jan 20, 2009 (gmt 0)

Maybe the 1st thing you want to do is to ask CJ to provide CSV feed instead of XML feed :)

carguy84

4:56 am on Jan 20, 2009 (gmt 0)

ya, if there's no need for structured data, CSV should be way faster than running it thru an XML parser.

Instead of doing a READ + Update or Insert, just do an Insert, trap any primary key violations and do an update. that way you're doing 1, sometimes 2 calls to the DB instead of always doing 2 calls.

tntpower

5:02 am on Jan 20, 2009 (gmt 0)

My 2 cents:

Choose CSV instead of XML

Use a C or Java library to parse large CSV, much faster than script language

Try multiple threads

swa66

7:54 am on Jan 20, 2009 (gmt 0)

Most XML can be parsed in e.g. php, perl, ... without using XML libraries at all.
A regexp or two to extract the data you need one entry at a time works typically better and faster than trying to parse it properly with the use of libraries.
Basically you can almost treat XML as if it were CSV.

IanKelley

7:58 am on Jan 20, 2009 (gmt 0)

Write your own parser. You probably don't need all of the data from the feed so why extract all of it? Even if you do need all of it, using a handful of regular expressions (if you're creative you rarely even need more than one) to parse a feed is far faster than a full blown XML parser.

By faster I mean there is no comparison.

IanKelley

7:59 am on Jan 20, 2009 (gmt 0)

What swa said... we were posting at the same time it seems :-)

2clean

8:27 am on Jan 20, 2009 (gmt 0)

CSVs are the daddy.

johnnie

11:21 am on Jan 20, 2009 (gmt 0)

What about truncating the table and just overwriting everything? That saves you a DB-check for every item and allows you to just go ahead and push everything in there. Just make sure you're consistent in your product-IDs though (get them from the XML feed) or your additional stuff (reviews, stats, yadayada) might break.

Also, don't even attempt to read all the data you don't need. It just takes up resources for parsing and temporary storage.

maximillianos

1:52 pm on Jan 20, 2009 (gmt 0)

Agree with everyone. XML is wasted space. CSV is the way to go.

dylanz

8:51 pm on Jan 22, 2009 (gmt 0)

Thanks for the great suggestions everyone, very informative.

A caveat to my approach is that I'm versioning product records. There is logic to see if prices, sale prices, descriptions, etc have changed, and then doing the necessary versioning if so. That definitely adds some extra logic, but, some of the approaches above would still fit the bill.

Hah... XML. Coming from a Java background, you start to think XML is the only format to handle structured data ;)

XML does have some perks however, and I'm still going to use it, as the libraries and code used to parse the data ends up being a lot cleaner than using CSV/TAB/etc. Breaking out generic, re-usable bits of processing code in XML seems to be a little more elegant. Also, you'd assume that most third parties would have XML as an option, and I wouldn't have to switch contexts if they didn't provide CSV/TAB.

In the end, I need to run my processing code in production, as the machines are a lot beefier than my local laptop. It's a touchy process, and making sure you pay attention to the operations/memory allocation/etc routines is crucial. I already found one small allocation that was adding about 10 minutes to the processing :)

Thanks again everyone :)

tntpower

11:56 am on Jan 26, 2009 (gmt 0)

I cannot see any benefit of XML over CSV for CJ data feed.

All the info you need (price change, etc) can be easily obtained from CSV, too.

Anyway, it's up to you. :)

htnmmo

6:05 pm on Jan 26, 2009 (gmt 0)

You said you have a Java background but if you're using xmllint it doesn't seem like you're using Java for this.

But explaining how I would do this in Java might help you figure out.

First, make sure you're using connection pooling. If not, that can probably save you a lot of time.

As others have said, you should be able to do an insert or update type statement. If you're using oracle or pgsql you can easily create a function to handle this. Even better, send the whole, or chunked data into a function to do the processing all at once.

One other way I've been able to dramatically increase performance on these types of tasks is to use prepared statemtns and batchUpdate/inserts.

See this link.
[java2s.com...]

stajer

7:44 pm on Jan 26, 2009 (gmt 0)

tntpower - xml is self-describing. You can write a custom function once that can import any xml into a db without knowing any table or column names before hand. It is also relational - within the xml itself it has the relationships that are normally described with multiple tables in a db. As a result it can handle very complex data much easier than csv can.

IanKelley

9:15 pm on Jan 26, 2009 (gmt 0)

Staj, features like that are very useful for situations where you don't know anything about the data you're grabbing. I can't think of a practical web programming situation where that would be the case, but I have no doubt one exists :-)

In the majority of cases, however, all XML does is add extra bytes (which means extra bandwidth and memory) and, by encouraging people to use necessarily bloated parsing libraries, uses far more CPU time.

You can solve the latter by writing your own parser but there's no way to work around the extra bytes.

tntpower

12:18 am on Jan 27, 2009 (gmt 0)

It all depends on real world app. Yes, XML contains valueable info that CSV doesn't (at least not directly). But back to this case, CSV is sufficient. It is just like that you do not need to deploy JBOSS to write a simple guest book by using complex Java EE patterns. PHP can do that very decently with much less work.

I think it's time for me to stop now. It starts to offtopic.