Forum Moderators: coopster

Message Too Old, No Replies

PHP character encoding problem

In a mix of umlauts and unicode text I can't get both to work with UTF-8

         

Lord Majestic

6:09 pm on Apr 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi,

I've been trying to resolve a weird problem with PHP (v4.3 and v5.2.5) encoding of unicode data. Here is the situation: web page encoding is set to UTF-8 in meta tags (and I know that works fine per se) and inside PHP script I've got a variable that contains some unicode characters and some umlauts.

If I use htmlentities($var, ENT_NOQUOTES, 'UTF-8'); then I see unicode text just fine, but umlauts are replaced with ? (in preview on this forum the data is garbled further so I can't really post actual example).

If I use utf8_encode($var) instead then I see umlauts just fine, however unicode part of string is garbled seemingly being UTF8 encoded twice for the unicode part of the string.

I really need both to be shown correctly as it is possible these characters could be mixed. The data was pulled from database correctly and put into XML that was ready by PHP script, encoding on that XML data is set to UTF-8 and it seems to work, just not at the same time for the above example!

Has anyone come across with this weird behavior?

Lord Majestic

5:31 pm on Apr 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok, this is resolved now - for the benefit of others it turned out the PHP XML parser that I used was not too happy about mixed character entities and UTF-8 encoded words, this was leading to the situation I described above. The solution was to change XML generator to always use encoding for non-ASCII characters rather than numeric entities for those of them that had them defined (like euro sign).

coopster

8:44 pm on Apr 12, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I looked at this yesterday and was thinking perhaps it was the version of libxml that your have installed, I've run into that before on some older servers when using DOM and XSLT. However, it didn't seem like that would be your issue so I hesitated to post. After reading today I'm glad I didn't! However, I'm still trying to track what happened here in my head and I'm having a little difficulty wrapping my mind around it all. I realize you have resolution, but if you get a moment would you mind explaining some things?

I believe what you are stating is that the $var string is being populated by fetching from a database field and that field contains Unicode characters ... and umlauts. I am confused by this part? Are you referring to the umlauts/diaeresis? They are in the Unicode character set:

¨ Ä Ë Ï Ö Ü ä ë ï ö ü

Lord Majestic

10:57 pm on Apr 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am sure now it was some issue inside XML parser or something like this - I had the following XML:

<?xml version="1.0" encoding="utf-8"?>
<Rows>
<Row>&#246;-&#223;-&#8364;-D¿Ñ_D,D²DµÑ,-</Row>
</Rows>

This is simplified version - note that after numeric entities goes UTF8 encoded letters - this was basically the data I was trying to display in PHP. The data came from database (SQL server) and it was 100% doing it correctly, then it was put into generated XML as above.

I fixed my problem by changing my server code to avoid using numeric entities and instead UTF8 encode them - this made PHP parser happy and I got what I wanted.

What I find was the very odd thing is that I was either getting those entities right OR UTF8 encoded data, but not both - I still have no explanation exactly why, just the solution - UTF8 encode whole lot thereby avoiding mixing non-ASCII character entities and UTF8 encoded characters. Hope it helps someone else - I was buffled by this behavior for too long :)

coopster

11:54 pm on Apr 12, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



So you were htmlentities() encoding the data before you generated the XML? What XML processing were you using?

Lord Majestic

12:08 am on Apr 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The XML data were generated by my C# code - it was using equivalent of htmlentities() there, yes.

The parsing of that XML was done in PHP like this:

--------------

$xml_parser=xml_parser_create("");
xml_parser_set_option($xml_parser,XML_OPTION_CASE_FOLDING,0);
xml_parser_set_option($xml_parser,XML_OPTION_SKIP_WHITE,0);
xml_parser_set_option($xml_parser,XML_OPTION_TARGET_ENCODING,"UTF-8");
xml_set_element_handler($xml_parser, "startTag", "endTag");
xml_set_character_data_handler($xml_parser, "contents");

if(!(xml_parse($xml_parser, $XML, true)))
{
// error
}

--------------

I had handlers to pick up data in Rows, this is working just fine, the problem was that when it came to display in HTML parsed data I was getting the problem I described above. The HTML page had correct UTF-8 encoding set via META tag, browser was picking it up just fine.

I have settled on using htmlentities($var, ENT_NOQUOTES, 'UTF-8'); - it works just fine now, so the conclusion that I have is that somehow numeric entities were not parsed correctly by PHP XML parser when additional encoded unicode data was present. Very odd but it was not just my dev PHP, but also hosting PHP too - their version of PHP was 5.2.5, which is current I believe.

coopster

12:19 am on Apr 13, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I have to run for now, but I hope I can revisit this again. It's very interesting. On a side note, htmlentities() has a fourth parameter,
double_encode
that may be of interest, for future reference. Thanks for sharing your process here, I appreciate it very much.

Lord Majestic

12:54 am on Apr 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The only half-logical explanation I have is that somehow when parser was dealing with entities it was decoding them into internal string held in one encoding (whatever is native to PHP), but then after meeting UTF8 it was adding those decoded characters while somehow creating a string with mixed encodings, even though logically internally it should have kept strings in the same encoding (in C# it is UTF16). Hopefully it might help someone else who came across with this bizarre behavior - we might never know what the true reason is, but at least there seems to be a workaround and I am very pleased to have finally found it :)

P.S. I used entity for the euro sign - &#8364; and I was not getting it at all working until I changed it to be UTF8 encoded.

[edited by: Lord_Majestic at 12:56 am (utc) on April 13, 2008]