Processing a great big XML doc

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Processing a great big XML doc

Robber

4:31 pm on Jan 9, 2003 (gmt 0)

The story so far is I have knocked up a perl script that can process an xml doc and create a whole site of static html pages based on it.

I have tested it on an xml doc of 300K, I have tested it with an xml doc of 1.5M. But, the live xml doc is going to be about 16m and that could take a whil e to process.

I am using the xpath module as well, so the whole xml doc needs to be stored in memory so it can be traversed using xpath. It therefore slows quite a bit when file size increases.

To speed things up I figured it might be worth breaking it down into chunks, this would mean less stored in memory. The trouble is though, the xml wouldn't be well formed if I didnt process the whole lot at once.

Anyone got ay suggestions on hw I can speed things up as this is going to chew a load of processing power.

Thanks

andreasfriedrich

4:41 pm on Jan 9, 2003 (gmt 0)

I believe there are some XML "twig" modules at CPAN that only keep a little twig of the whole tree in memory.

Andreas

Robber

4:51 pm on Jan 9, 2003 (gmt 0)

Thanks for the tip, I will check it out.

andreasfriedrich

5:25 pm on Jan 9, 2003 (gmt 0)

Please let me know what you find and how it works for you. I was interested in using them as well but couldn´t be bothered to rewrite the software. I bought 128MB additional RAM instead ;) But it would still be nice to know for future reference.

Andreas

Robber

5:40 pm on Jan 9, 2003 (gmt 0)

I'll let you know what I find.

Just did another run on the 1.5mb xml doc with the original script, going on the fact that I had time to have my dinner while it processed it (and it hadn't finished when I got back), I think I might need to use twig when I go on to the 16mb document, either that or another 512 ram and an extra couple of days to sit and watch it!

Robber

5:51 pm on Jan 9, 2003 (gmt 0)

After a quick look, here is what is seems to be about:

If you need to only process some of the xml doc then twig should be handy as it means you only have to load that chunk into memory. I think the authors are trying to make it so you can navigate this "twig" of the xml doc using an xpath subset.

However, if you need to process the whole lot then I dont think it will help - which I think makes sense, because unless you have everything stored in memory the processor wouldnt know how to navigate it.

I guess another option would be to effectively break one xml doc down into smaller valid xml docs - shouldnt hard due to structured nature of xml.

I'll take a closer look at twig later, so I'll let you know if first impressions were wrong.