Forum Moderators: open
Of course, you can put your XML routines inside a try/catch block. But then you end up with a page of nothingness. Is there a way to iron out XML problems using PHP, or some other server language?
Here's an example. This is what the XML is supposed to look like. A simplified example.
<xml>
<person>
<name>Roger Ramjet</name>
<address>123 Main St.</address>
<phone>123-456-7890</phone>
</person>
</xml>
The XML I'm actually getting looks like this:
<xml>
<person>
<name>Roger Ramjet [red][note the missing end tag here][/red]
<address>123 Main St.</address>
<phone>123-456-7890</phone>
</person>
</xml>
This isn't XML! I can't use any normal XML parsing tools for this. It's got to be done with string manipulation.
Any code I create to fix this problem will be a messy, ad-hoc bandaid. For instance, I could do a simple string replace to turn "<address" into "</name><address>". A lovely, quick fix, for this one instance, assuming that the people delivering this XML will never actually get around to fixing the problem. (and when they do, my code will cause problems).
Couldn't there be a more general, perhaps regex-based, solution for this?
The code would have to make some assumptions, for instance that the <address> isn't really a child of <name>; ergo <name> needs to be closed before <address> is added to the stack. I imagine a robust markup fixer might be able to use a Schema-defined hierarchy to figure these things out; otherwise the safest assumption when a non-closed element is found would be to close the element immediately following any CDATA inside it, before any other elements are parsed. Closed elements tend not to cause well-formedness errors. (validation is another matter)
Here's another example from a while ago which I had to hack:
<xml>
<person>
<name>Roger Ramjet [red]&[/red] Sons</name>
<address>Portage [red]&[/red] Main St.</address>
<phone>123-456-7890</phone>
</person>
</xml>
This (fake example) was a public XML feed provided by a major data service. They encoded ampersands appearing in some of their nodes, but not others. Blooper!
This one was easier to fix. I'd look for any non-greedy string patterns that matched this pseudo-regex, and wrapped the whole string it in <![CDATA[ ]]>:
<[^>]+>.*&.*</$1>
Common wisdom dictates that when you have broken XML, the solution is to fix the scripts that generate the XML, not try to hack up some scripts that massage the data into well-formedness where well-formedness doesn't exist. But there are situations where having an XML masseuse would be useful. For instance:
- bad XML provided by 3rd parties who don't answer their phones (YOU KNOW WHO YOU ARE!)
- bad static XML provided on disk (ie not from a web service)
- XML provided via user upload or form entry
- corrupt or incomplete XML files, recovered from damaged disks
$matches=preg_match_all("/\<name\>([^\<]*).*?\<address\>([^\<]*).*?......./ism",$xml,$data); Overall I don't like XML and far prefer more usable formats such as pipe or comma separated values:
Roger Ramjet & Sons¦Portage & Main St¦123-456-7890
Next Guy¦Address¦Telephone
etc... Granted; 3D data suits XML better... but with good design that can be easily avoided in most cases...
Or maybe it only works on the XHTML variant?
I've found the best method so far is to monitor for an exact FALSE value in the simple_xml_loadstring() method return value and handle accordingly. If you still need to parse that code no matter what, handling the string via regex is about your only option. There are some error checking routines you can write if you want to further analyze the invalid incoming XML. Have a look at libxml_get_errors [php.net].