Forum Moderators: coopster
I've been trying to resolve a weird problem with PHP (v4.3 and v5.2.5) encoding of unicode data. Here is the situation: web page encoding is set to UTF-8 in meta tags (and I know that works fine per se) and inside PHP script I've got a variable that contains some unicode characters and some umlauts.
If I use htmlentities($var, ENT_NOQUOTES, 'UTF-8'); then I see unicode text just fine, but umlauts are replaced with ? (in preview on this forum the data is garbled further so I can't really post actual example).
If I use utf8_encode($var) instead then I see umlauts just fine, however unicode part of string is garbled seemingly being UTF8 encoded twice for the unicode part of the string.
I really need both to be shown correctly as it is possible these characters could be mixed. The data was pulled from database correctly and put into XML that was ready by PHP script, encoding on that XML data is set to UTF-8 and it seems to work, just not at the same time for the above example!
Has anyone come across with this weird behavior?
I believe what you are stating is that the $var string is being populated by fetching from a database field and that field contains Unicode characters ... and umlauts. I am confused by this part? Are you referring to the umlauts/diaeresis? They are in the Unicode character set:
¨ Ä Ë Ï Ö Ü ä ë ï ö ü
<?xml version="1.0" encoding="utf-8"?>
<Rows>
<Row>ö-ß-€-D¿Ñ_D,D²DµÑ,-</Row>
</Rows>
This is simplified version - note that after numeric entities goes UTF8 encoded letters - this was basically the data I was trying to display in PHP. The data came from database (SQL server) and it was 100% doing it correctly, then it was put into generated XML as above.
I fixed my problem by changing my server code to avoid using numeric entities and instead UTF8 encode them - this made PHP parser happy and I got what I wanted.
What I find was the very odd thing is that I was either getting those entities right OR UTF8 encoded data, but not both - I still have no explanation exactly why, just the solution - UTF8 encode whole lot thereby avoiding mixing non-ASCII character entities and UTF8 encoded characters. Hope it helps someone else - I was buffled by this behavior for too long :)
The parsing of that XML was done in PHP like this:
--------------
$xml_parser=xml_parser_create("");
xml_parser_set_option($xml_parser,XML_OPTION_CASE_FOLDING,0);
xml_parser_set_option($xml_parser,XML_OPTION_SKIP_WHITE,0);
xml_parser_set_option($xml_parser,XML_OPTION_TARGET_ENCODING,"UTF-8");
xml_set_element_handler($xml_parser, "startTag", "endTag");
xml_set_character_data_handler($xml_parser, "contents");
if(!(xml_parse($xml_parser, $XML, true)))
{
// error
}
--------------
I had handlers to pick up data in Rows, this is working just fine, the problem was that when it came to display in HTML parsed data I was getting the problem I described above. The HTML page had correct UTF-8 encoding set via META tag, browser was picking it up just fine.
I have settled on using htmlentities($var, ENT_NOQUOTES, 'UTF-8'); - it works just fine now, so the conclusion that I have is that somehow numeric entities were not parsed correctly by PHP XML parser when additional encoded unicode data was present. Very odd but it was not just my dev PHP, but also hosting PHP too - their version of PHP was 5.2.5, which is current I believe.
P.S. I used entity for the euro sign - € and I was not getting it at all working until I changed it to be UTF8 encoded.
[edited by: Lord_Majestic at 12:56 am (utc) on April 13, 2008]