|& breaks XML|
| 4:01 pm on Nov 20, 2008 (gmt 0)|
Hello, I'm new to XML so be nice :)
I'm pulling new stories from a database and putting the content into an XML document in order to download as a CSV or RTF file.
But the download breaks if I have an ampersand &. To fix this I have to encode it to &
But this then displays in my downloads, which is undesirable.
First of all, am I correct that ampersands break XML?
Secondly, are there any other characters that I have to watch out for an encode? It's just a shame that I have to display the html entities instead of the actual symbols in the downloads.
| 4:07 pm on Nov 20, 2008 (gmt 0)|
You can enclose text with CDATA if you are concerned about such characters - that works well
| 4:09 pm on Nov 20, 2008 (gmt 0)|
PS: I'm using PHP5, creating the XML file in a variable then pushing it to an XSLT file for the layout with some PHP headers to download the file.
Quite a few Word characters (e.g. ` ) also convert to entities. I'm curently find/replacing them for proper characters like '
This is ok right?
| 4:15 pm on Nov 20, 2008 (gmt 0)|
Thanks Vince! That sounds like it could be what I need. Tell me, have you ever experienced any unparsed characters breaking/causing problems for the XML document? If not, I think you've just saved my bacon.
| 6:54 pm on Nov 20, 2008 (gmt 0)|
My preferred technique is to use a CDATA section to enclose anything that's potentially unfriendly to XML syntax, which really means anything that is user-generated data or human-readable text.
You need to be careful using HTML entities in XML. Here's why.
Say you use PHP's htmlentities() function to turn all your "£" into "£", "©" into "©", ad so on.
<element>My Business ©</element>
They will live comfortably in the XML, and if you're outputting the XML node values on a web page, the browser will display them as £ and ©. Browsers are generally pretty nice about rendering HTML entities.
But the fragment above will not validate, because XML doesn't know what a "©" is. That's an HTML entity, not an XML entity.
So if you're parsing the XML document with PHP's xml_parse(), it'll choke. The reason is the DTD - the built-in XML DTD will have entity declarations for the basic furious five: &, >, <, ", and '. You can include those five entities in any XML document, without worrying about extending the DTD.
But a generic XML document almost certainly won't have a declaration for üaut; or þ. Those are HTML entities, not XML entities. Just as HTML defines a <table> tag, it also defines what £ is. That is, HTML knows what a £ is, but XML does not.
XML is an open language and you can declare your own entities, like &google; or &myvariable; - they're all legitimate, but they do have to be in the DTD or the parser will judge that your document is invalid.
If you're going to use HTML entities in an XML document, you need to declare all the ones you use, right at the beginning, like so:
<!ENTITY pound CDATA "£">
<!ENTITY copy CDATA "©">
If that sounds like an unnecessary inconvenience, that's because it is. You can do it with a custom PHP function [ca3.php.net], but IMHO it's much easier just to use a CDATA container as it's meant to be used.
If you're the only one using your XML output, and you get it doing something useful without it XML being valid (ie, having undeclared entities in it), then you may ignore all of this. But someday, someone else may want to use your data, and if it's not valid they may have problems parsing it, or transforming it with XSLT, or whatever else. I'm sure the guilt will haunt you relentlessly.