Welcome to WebmasterWorld Guest from

Forum Moderators: httpwebwitch

Message Too Old, No Replies

XML Datafeeds to Database

Speed, Integrity and the Such



5:40 am on Jan 15, 2009 (gmt 0)

5+ Year Member

I'm getting a huge product XML file (3gb) from Commission Junction. My current issues are the following, and I would love any insight/suggestions into any of them:

1. The file doesn't validate to its DTD.

I'm using "xmllint" to verify, and it is indeed broken. Not the end of the world, as I'm running a couple "tr" commands to remove all kinds of funky characters and get the file in working order.

2. Getting the data into my database.

Currently, I'm using libxml's Sax parser to read the file via a stream, which keeps the memory footprint low. However, for each node, I'm having to do a read in my database to see if that product exits, then update that record or create a new one if it doesn't exist. This approach is unfortunately going to take hours (it's running 5 hours plus already.

Any suggestions? Could I do the process differently, or speed it up in any way?

Any feedback will be appreciated! Thanks!


7:14 am on Jan 15, 2009 (gmt 0)

5+ Year Member

Thinking out loud... but could probably spawn a bunch of threads with their own parsers (they're cheap), and have each one parse N+x nodes. That way there's no overlap, and it gets done faster.



7:55 pm on Jan 15, 2009 (gmt 0)

WebmasterWorld Administrator httpwebwitch is a WebmasterWorld Top Contributor of All Time 10+ Year Member

oy, that could take a long long time.

funny you should mention this. I'm doing something similar right now, and I've been running it sporadically, about 8 hours at a time, for almost a two weeks.

I wonder how I could speed that up too.

The slowest part is the read->exists?update:insert loop


10:51 pm on Jan 15, 2009 (gmt 0)

5+ Year Member

yeah, it's not necessarily a "speedy" process :)

the consensus is to use a Sax parser, and there really isn't a good way to multi-thread it as far as I can tell. splitting up the file into multiple files would be great, but even with a snazzy lexer, you'd still need to load up the file into memory, which isn't really an option.

so... I think I'm going to stay with the slow single-threaded approach (which it seems most people in this scenario are doing), and try to optimize my code as best as I can.

oh, and httpwebwitch, if you have any extremely fast cpu's in any idle machines laying around, let me know :P


1:18 am on Jan 16, 2009 (gmt 0)

10+ Year Member

For speed, you need specialization. Break this project into 2 distinct tasks.

First, use a specialized xml to sql parser to just upload the data into a new set of tables in your DB. Second, use a sql stored procedure to go through the newly uploaded data and update your live data.

There are several inexpensive XML to DB products out there that are very good at processing large amounts of XML into DBs very quickly.


2:21 am on Jan 16, 2009 (gmt 0)

5+ Year Member

Good point stajer... there may indeed be contention on the parser due to the database entries, or vise-versa. I'll separate the tasks and also create a stored procedure (i'm using postgres fwiw) to see if I can optimize the entries.

There is unfortunately quite a lot of application logic hidden behind these entries, but I'll try.

Could you point me to any of the inexpensive XML -> DB products?


5:09 pm on Jan 16, 2009 (gmt 0)

10+ Year Member

Google "xml to sql" and you will see several products. I have used the N***x**s product with good results, but that is SQL Server based. There has got to be a postgres solution out there.


7:29 pm on Jan 16, 2009 (gmt 0)

5+ Year Member

Thanks stajer. Yes, Postgres has had some XML initiatives in the works for a while now, but they're not completed yet. They did just roll out native text search, which is outstandingly fast and customizable. I'm a fan :)


8:19 pm on Jan 16, 2009 (gmt 0)

10+ Year Member

I think the way the xml2sql programs work is to turn the xml into a series of t-sql statements that are then executed on your db. So, they should be db agnostic as long as they can connect.


4:31 am on Jan 20, 2009 (gmt 0)

10+ Year Member

Maybe the 1st thing you want to do is to ask CJ to provide CSV feed instead of XML feed :)


4:56 am on Jan 20, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

ya, if there's no need for structured data, CSV should be way faster than running it thru an XML parser.

Instead of doing a READ + Update or Insert, just do an Insert, trap any primary key violations and do an update. that way you're doing 1, sometimes 2 calls to the DB instead of always doing 2 calls.


5:02 am on Jan 20, 2009 (gmt 0)

10+ Year Member

My 2 cents:

Choose CSV instead of XML

Use a C or Java library to parse large CSV, much faster than script language

Try multiple threads


7:54 am on Jan 20, 2009 (gmt 0)

WebmasterWorld Senior Member swa66 is a WebmasterWorld Top Contributor of All Time 10+ Year Member

Most XML can be parsed in e.g. php, perl, ... without using XML libraries at all.
A regexp or two to extract the data you need one entry at a time works typically better and faster than trying to parse it properly with the use of libraries.
Basically you can almost treat XML as if it were CSV.


7:58 am on Jan 20, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

Write your own parser. You probably don't need all of the data from the feed so why extract all of it? Even if you do need all of it, using a handful of regular expressions (if you're creative you rarely even need more than one) to parse a feed is far faster than a full blown XML parser.

By faster I mean there is no comparison.


7:59 am on Jan 20, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

What swa said... we were posting at the same time it seems :-)


8:27 am on Jan 20, 2009 (gmt 0)

5+ Year Member

CSVs are the daddy.


11:21 am on Jan 20, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

What about truncating the table and just overwriting everything? That saves you a DB-check for every item and allows you to just go ahead and push everything in there. Just make sure you're consistent in your product-IDs though (get them from the XML feed) or your additional stuff (reviews, stats, yadayada) might break.

Also, don't even attempt to read all the data you don't need. It just takes up resources for parsing and temporary storage.


1:52 pm on Jan 20, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

Agree with everyone. XML is wasted space. CSV is the way to go.


8:51 pm on Jan 22, 2009 (gmt 0)

5+ Year Member

Thanks for the great suggestions everyone, very informative.

A caveat to my approach is that I'm versioning product records. There is logic to see if prices, sale prices, descriptions, etc have changed, and then doing the necessary versioning if so. That definitely adds some extra logic, but, some of the approaches above would still fit the bill.

Hah... XML. Coming from a Java background, you start to think XML is the only format to handle structured data ;)

XML does have some perks however, and I'm still going to use it, as the libraries and code used to parse the data ends up being a lot cleaner than using CSV/TAB/etc. Breaking out generic, re-usable bits of processing code in XML seems to be a little more elegant. Also, you'd assume that most third parties would have XML as an option, and I wouldn't have to switch contexts if they didn't provide CSV/TAB.

In the end, I need to run my processing code in production, as the machines are a lot beefier than my local laptop. It's a touchy process, and making sure you pay attention to the operations/memory allocation/etc routines is crucial. I already found one small allocation that was adding about 10 minutes to the processing :)

Thanks again everyone :)


11:56 am on Jan 26, 2009 (gmt 0)

10+ Year Member

I cannot see any benefit of XML over CSV for CJ data feed.

All the info you need (price change, etc) can be easily obtained from CSV, too.

Anyway, it's up to you. :)


6:05 pm on Jan 26, 2009 (gmt 0)

5+ Year Member

You said you have a Java background but if you're using xmllint it doesn't seem like you're using Java for this.

But explaining how I would do this in Java might help you figure out.

First, make sure you're using connection pooling. If not, that can probably save you a lot of time.

As others have said, you should be able to do an insert or update type statement. If you're using oracle or pgsql you can easily create a function to handle this. Even better, send the whole, or chunked data into a function to do the processing all at once.

One other way I've been able to dramatically increase performance on these types of tasks is to use prepared statemtns and batchUpdate/inserts.

See this link.


7:44 pm on Jan 26, 2009 (gmt 0)

10+ Year Member

tntpower - xml is self-describing. You can write a custom function once that can import any xml into a db without knowing any table or column names before hand. It is also relational - within the xml itself it has the relationships that are normally described with multiple tables in a db. As a result it can handle very complex data much easier than csv can.


9:15 pm on Jan 26, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

Staj, features like that are very useful for situations where you don't know anything about the data you're grabbing. I can't think of a practical web programming situation where that would be the case, but I have no doubt one exists :-)

In the majority of cases, however, all XML does is add extra bytes (which means extra bandwidth and memory) and, by encouraging people to use necessarily bloated parsing libraries, uses far more CPU time.

You can solve the latter by writing your own parser but there's no way to work around the extra bytes.


12:18 am on Jan 27, 2009 (gmt 0)

10+ Year Member

It all depends on real world app. Yes, XML contains valueable info that CSV doesn't (at least not directly). But back to this case, CSV is sufficient. It is just like that you do not need to deploy JBOSS to write a simple guest book by using complex Java EE patterns. PHP can do that very decently with much less work.

I think it's time for me to stop now. It starts to offtopic.


Featured Threads

Hot Threads This Week

Hot Threads This Month