Forum Moderators: open
1. The file doesn't validate to its DTD.
I'm using "xmllint" to verify, and it is indeed broken. Not the end of the world, as I'm running a couple "tr" commands to remove all kinds of funky characters and get the file in working order.
2. Getting the data into my database.
Currently, I'm using libxml's Sax parser to read the file via a stream, which keeps the memory footprint low. However, for each node, I'm having to do a read in my database to see if that product exits, then update that record or create a new one if it doesn't exist. This approach is unfortunately going to take hours (it's running 5 hours plus already.
Any suggestions? Could I do the process differently, or speed it up in any way?
Any feedback will be appreciated! Thanks!
the consensus is to use a Sax parser, and there really isn't a good way to multi-thread it as far as I can tell. splitting up the file into multiple files would be great, but even with a snazzy lexer, you'd still need to load up the file into memory, which isn't really an option.
so... I think I'm going to stay with the slow single-threaded approach (which it seems most people in this scenario are doing), and try to optimize my code as best as I can.
oh, and httpwebwitch, if you have any extremely fast cpu's in any idle machines laying around, let me know :P
First, use a specialized xml to sql parser to just upload the data into a new set of tables in your DB. Second, use a sql stored procedure to go through the newly uploaded data and update your live data.
There are several inexpensive XML to DB products out there that are very good at processing large amounts of XML into DBs very quickly.
There is unfortunately quite a lot of application logic hidden behind these entries, but I'll try.
Could you point me to any of the inexpensive XML -> DB products?
Instead of doing a READ + Update or Insert, just do an Insert, trap any primary key violations and do an update. that way you're doing 1, sometimes 2 calls to the DB instead of always doing 2 calls.
By faster I mean there is no comparison.
Also, don't even attempt to read all the data you don't need. It just takes up resources for parsing and temporary storage.
A caveat to my approach is that I'm versioning product records. There is logic to see if prices, sale prices, descriptions, etc have changed, and then doing the necessary versioning if so. That definitely adds some extra logic, but, some of the approaches above would still fit the bill.
Hah... XML. Coming from a Java background, you start to think XML is the only format to handle structured data ;)
XML does have some perks however, and I'm still going to use it, as the libraries and code used to parse the data ends up being a lot cleaner than using CSV/TAB/etc. Breaking out generic, re-usable bits of processing code in XML seems to be a little more elegant. Also, you'd assume that most third parties would have XML as an option, and I wouldn't have to switch contexts if they didn't provide CSV/TAB.
In the end, I need to run my processing code in production, as the machines are a lot beefier than my local laptop. It's a touchy process, and making sure you pay attention to the operations/memory allocation/etc routines is crucial. I already found one small allocation that was adding about 10 minutes to the processing :)
Thanks again everyone :)
But explaining how I would do this in Java might help you figure out.
First, make sure you're using connection pooling. If not, that can probably save you a lot of time.
As others have said, you should be able to do an insert or update type statement. If you're using oracle or pgsql you can easily create a function to handle this. Even better, send the whole, or chunked data into a function to do the processing all at once.
One other way I've been able to dramatically increase performance on these types of tasks is to use prepared statemtns and batchUpdate/inserts.
See this link.
[java2s.com...]
In the majority of cases, however, all XML does is add extra bytes (which means extra bandwidth and memory) and, by encouraging people to use necessarily bloated parsing libraries, uses far more CPU time.
You can solve the latter by writing your own parser but there's no way to work around the extra bytes.
I think it's time for me to stop now. It starts to offtopic.