|fixing bad XML using PHP|
| 1:42 pm on Aug 28, 2008 (gmt 0)|
Say you are using data from a public XML feed. Naturally you'd expect the XML to be well-formed, valid even. But what if it's not? Many times when you're doing XSLT transformation on XML, the slightest glitch in the XML can send your scripts spiralling into an abyss of Exception Agony.
Of course, you can put your XML routines inside a try/catch block. But then you end up with a page of nothingness. Is there a way to iron out XML problems using PHP, or some other server language?
Here's an example. This is what the XML is supposed to look like. A simplified example.
<address>123 Main St.</address>
The XML I'm actually getting looks like this:
<name>Roger Ramjet [red][note the missing end tag here][/red]
<address>123 Main St.</address>
This isn't XML! I can't use any normal XML parsing tools for this. It's got to be done with string manipulation.
Any code I create to fix this problem will be a messy, ad-hoc bandaid. For instance, I could do a simple string replace to turn "<address" into "</name><address>". A lovely, quick fix, for this one instance, assuming that the people delivering this XML will never actually get around to fixing the problem. (and when they do, my code will cause problems).
Couldn't there be a more general, perhaps regex-based, solution for this?
The code would have to make some assumptions, for instance that the <address> isn't really a child of <name>; ergo <name> needs to be closed before <address> is added to the stack. I imagine a robust markup fixer might be able to use a Schema-defined hierarchy to figure these things out; otherwise the safest assumption when a non-closed element is found would be to close the element immediately following any CDATA inside it, before any other elements are parsed. Closed elements tend not to cause well-formedness errors. (validation is another matter)
Here's another example from a while ago which I had to hack:
<name>Roger Ramjet [red]&[/red] Sons</name>
<address>Portage [red]&[/red] Main St.</address>
This (fake example) was a public XML feed provided by a major data service. They encoded ampersands appearing in some of their nodes, but not others. Blooper!
This one was easier to fix. I'd look for any non-greedy string patterns that matched this pseudo-regex, and wrapped the whole string it in <![CDATA[ ]]>:
Common wisdom dictates that when you have broken XML, the solution is to fix the scripts that generate the XML, not try to hack up some scripts that massage the data into well-formedness where well-formedness doesn't exist. But there are situations where having an XML masseuse would be useful. For instance:
- bad XML provided by 3rd parties who don't answer their phones (YOU KNOW WHO YOU ARE!)
- bad static XML provided on disk (ie not from a web service)
- XML provided via user upload or form entry
- corrupt or incomplete XML files, recovered from damaged disks
| 1:50 pm on Aug 28, 2008 (gmt 0)|
My 'average rate' of getting external XML to parse has been so low that I rarely bother now. Each application gets a custom REGEX written which pulls out just the information I need, from that precise script as it gets sent.
That kind of game.
Overall I don't like XML and far prefer more usable formats such as pipe or comma separated values:
Roger Ramjet & Sons¦Portage & Main St¦123-456-7890
Granted; 3D data suits XML better... but with good design that can be easily avoided in most cases...
| 2:15 pm on Aug 28, 2008 (gmt 0)|
Why not use DOMDocument [us.php.net]? It's used to parse fugly HTML. I've not used it for this, but I have been told that this is exactly what it's good for.
Or maybe it only works on the XHTML variant?
| 3:00 pm on Aug 28, 2008 (gmt 0)|
I'm not certain DOMDocument is going to work on the XML. If it can be used, I need to be educated here as well. I've never found an appropriate variant of the DOM methods to get me where I want to be.
I've found the best method so far is to monitor for an exact FALSE value in the simple_xml_loadstring() method return value and handle accordingly. If you still need to parse that code no matter what, handling the string via regex is about your only option. There are some error checking routines you can write if you want to further analyze the invalid incoming XML. Have a look at libxml_get_errors [php.net].
| 2:50 pm on Sep 23, 2008 (gmt 0)|
A post by Sekka in PHP today lead me to another option/attempt to validate XML [webmasterworld.com]. I've not yet tested and won't have time for a bit yet, but thought I would throw it out here for discussion. Has anybody attempted using the XMLReader class for validation [php.net]?