Forum Moderators: coopster

Message Too Old, No Replies

What is 'best practice' when parsing large, complex XML files?

SimpleXML or another method?

         

max4

6:15 am on Aug 26, 2010 (gmt 0)

10+ Year Member



Hi,

I am using an API on my site which relies on the REST method for supplying XML. Currently I attempt to retrieve the output using the following:


$API = the REST url
$xml = simplexml_load_file($API,'SimpleXMLElement',LIBXML_NOCDATA);


The output generally looks something like this for a single request; multiple requests add another node:


SimpleXMLElement Object ( [@attributes] => Array ( [stat] => ok ) [current] => SimpleXMLElement Object ( [region] => SimpleXMLElement Object ( [id] => usa [name] => United States ) [category] => SimpleXMLElement Object ( [id] => vehicle/powersports/atv [name] => ATVs [abbrev] => ATVs ) [start] => 1 [num] => 1 ) [listings] => SimpleXMLElement Object ( [element] => SimpleXMLElement Object ( [id] => 2113403615 [title] => 2007 Yamaha Rhino Atv 660 4x4 Sport [body] => The sport edition is the Rhino to have, it has a beautiful smoke silver hard coat finish, performance piggy back shocks, aluminium wheels, roof cover, special edition silver seats, sport steering wheels and special edition graphics. It has only 510 miles. Everything works like new and looks like new. The atv is listed nationally.The shipping is FREE within US. Please don't use "Respond" button because I can't check that email right now. If you want to buy it, to contact me please CLICK HERE Thank you! [url] => http://example.com/u_a2xx_/2113403615-P1u546,812-97FD41FAC27E/example.com___Azn_vnZgkb3pBte_XGZA_GoyfVEGcSl_WQxRT4W4Jetc-z2rp3NrYfAAMcibLlyptrCu5C5xOQU4kKrRk1RSDsIDRLlJK54EAauqmvfQtms, [category] => SimpleXMLElement Object ( [id] => vehicle/powersports/atv [name] => ATVs ) [source] => SimpleXMLElement Object ( [id] => www [name] => Oodle ) [location] => SimpleXMLElement Object ( ) [images] => SimpleXMLElement Object ( [element] => Array ( [0] => SimpleXMLElement Object ( [src] => http://example.com/item/2113403615u_0s?1282796183 [width] => 100 [height] => 75 [alt] => 2007 Yamaha Rhino Atv 660 4x4 [num] => 0 [size] => s ) [1] => SimpleXMLElement Object ( [src] => http://example.com/item/2113403615u_0m?1282796183 [width] => 144 [height] => 108 [alt] => 2007 Yamaha Rhino Atv 660 4x4 [num] => 0 [size] => m ) [2] => SimpleXMLElement Object ( [src] => http://example.com/item/2113403615u_0l?1282796183 [width] => 208 [height] => 156 [alt] => 2007 Yamaha Rhino Atv 660 4x4 [num] => 0 [size] => l ) [3] => SimpleXMLElement Object ( [src] => http://example.com/item/2113403615u_0x?1282796183 [width] => 400 [height] => 300 [alt] => 2007 Yamaha Rhino Atv 660 4x4 [num] => 0 [size] => x ) [4] => SimpleXMLElement Object ( [src] => http://example.com/item/2113403615u_1s?1282796184 [width] => 100 [height] => 75 [alt] => 2007 Yamaha Rhino Atv 660 4x4 [num] => 1 [size] => s ) [5] => SimpleXMLElement Object ( [src] => http://example.com/item/2113403615u_1m?1282796184 [width] => 144 [height] => 108 [alt] => 2007 Yamaha Rhino Atv 660 4x4 [num] => 1 [size] => m ) [6] => SimpleXMLElement Object ( [src] => http://example.com/item/2113403615u_1l?1282796184 [width] => 208 [height] => 156 [alt] => 2007 Yamaha Rhino Atv 660 4x4 [num] => 1 [size] => l ) [7] => SimpleXMLElement Object ( [src] => http://example.com/item/2113403615u_1x?1282796183 [width] => 400 [height] => 300 [alt] => 2007 Yamaha Rhino Atv 660 4x4 [num] => 1 [size] => x ) [8] => SimpleXMLElement Object ( [src] => http://example.com/item/2113403615u_2s?1282796184 [width] => 100 [height] => 75 [alt] => 2007 Yamaha Rhino Atv 660 4x4 [num] => 2 [size] => s ) [9] => SimpleXMLElement Object ( [src] => http://example.com/item/2113403615u_2m?1282796184 [width] => 144 [height] => 108 [alt] => 2007 Yamaha Rhino Atv 660 4x4 [num] => 2 [size] => m ) [10] => SimpleXMLElement Object ( [src] => http://example.com/item/2113403615u_2l?1282796184 [width] => 208 [height] => 156 [alt] => 2007 Yamaha Rhino Atv 660 4x4 [num] => 2 [size] => l ) [11] => SimpleXMLElement Object ( [src] => http://example.com/item/2113403615u_2x?1282796184 [width] => 400 [height] => 300 [alt] => 2007 Yamaha Rhino Atv 660 4x4 [num] => 2 [size] => x ) ) ) [ctime] => 1282796183 [paid] => No [revenue_score] => 2 [user] => SimpleXMLElement Object ( [id] => 43618431 [url] => http://example.com/seller/43618431/ [name] => K M. [photo] => http://example.com/user/43618431_1282796543.jpg?nc=1 ) [similar_url] => http://example.com/2006_2008-yamaha/for-sale/atvs/price_2300_3100/mileage_0_20000/?inbs=1 [attributes] => SimpleXMLElement Object ( [condition] => Used [delivery] => Local Delivery [fee] => No [has_photo] => Thumbnail [make] => Yamaha [mileage] => 510 [price] => 2700 [price_display] => $2,700 [private_party] => Yes [seller_type] => Private Party [user_id] => 43618431 [year] => 2007 ) ) ) [meta] => SimpleXMLElement Object ( [total] => 22328 [returned] => 1 [first] => 1 [last] => 1 [search_time] => 0.00000 [search] => SimpleXMLElement Object ( [title] => ATVs For Sale [url] => http://example.com/used-vehicles/for-sale/atvs/ ) [post] => SimpleXMLElement Object ( [generic] => SimpleXMLElement Object ( [url] => http://example.com/post/? ) [category] => SimpleXMLElement Object ( [url] => http://example.com/post/?category=vehicle/powersports/atv ) ) [current_time] => 1282802244 ) )


What I've noticed, and what the problem is, is that this method is very, very slow. A user requests between 10 and 50 of these with any given use of my application and load times range from 7secs to 45+ secs. I have attempted other methods of reading the data output by the REST url including cURL and the file_get_contents() function and the result remains a very slow load time. Parsing other, less complex XML documents is a piece of cake with SimpleXML; but not so with this REST document. Now, I know I must be using an inefficient method as I've seen other sites running the same API with favorable results. If this is the case, then what would be 'best practice' for gathering and parsing data in the form you've seen above? If not, then what could I be doing wrong that is leading to these slow load times?

Please note that the issue is not with my application, when the API is removed load times are in the milliseconds. Further, if I run the following:


$API = the REST url
$xml = simplexml_load_file($API,'SimpleXMLElement',LIBXML_NOCDATA);
print_r($xml);
exit;


to ensure that the rest of my application doesn't factor into the slow load times, I get the same results [7 - 45+ sec load time before the $xml variable is set and printed]. Any help, guidance or suggestions on this matter is much appreciated. Thank you.

Sincerely,
Max

httpwebwitch

6:58 am on Aug 31, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The solution depends what you are trying to get out of the XML file. forget about best practices... the answer is Do Whatever Works.

In situations where I have been repeatedly parsing large piles of XML data, I've found it worthwhile to get data once, and drop the parsed results into a denormalized SQL database. From there, queries can be orders of magnitude faster because you don't need to drag 300Mb of crap into memory just to get 1Kb of data out of it, and you don't have the additional overhead of waiting for someone else's API to respond.

simplexml_load_file does two things at once. It loads the file from the API, then parses it. Convenient.

Just for clarity, you should find out if it's parsing the XML that is taking a long time, or response from the API that is slow. The amount of data you quoted above... should not take more than a few milliseconds to parse. I suspect you're just querying from a slow service.

you can find out easily by getting the data first using CURL - then use curl_getinfo() [php.net] to find out how long it took to load. then pass the string into SimpleXML. Throw some benchmarks in along the way to see how long each part takes to execute.

max4

11:12 pm on Aug 31, 2010 (gmt 0)

10+ Year Member



Hi httpwebwitch,

I used the curl_getinfo() example from php.net:


<?php
// Create a curl handle
$ch = curl_init('http://www.example.com/');

// Execute
curl_exec($ch);

// Check if any error occured
if(!curl_errno($ch))
{
$info = curl_getinfo($ch);

echo 'Took ' . $info['total_time'] . ' seconds to send a request to ' . $info['url'];
}

// Close handle
curl_close($ch);
?>


The result was 9 seconds. I can only conclude that the API response is slow. I'm still puzzled by how other sites are able to accomplish quick (1 - 2 sec) results using this API. Perhaps the API provider selectively offers enhanced service to today's more popular websites.