Character encoding.

Forum Moderators: coopster

Message Too Old, No Replies

Character encoding.

fm86

2:22 pm on Apr 18, 2011 (gmt 0)

Hi everybody!

I can't solve this apparently easy problem.

I have a string $str = "üß" and I'd like to convert it to "üß" or even better to "üß".

I tried
$str = htmlentities($str);

but all I get is: Ã¼ĂĽ

What am I doing wrong?

lucy24

4:18 pm on Apr 18, 2011 (gmt 0)

I was going to say:

Your text was entered in UTF-8, with the two letters ü = C3BC and ß = C39F. It is being interpreted as ISO-Latin-1, giving the four letters C3 = Ă, BC = Ľ, C3 = Ă, 9F = ... whoops! where's that second Ľ coming from? You'd expect ź.

¼ and Ľ are different names for the same character. Are you quoting your actual output?

Somewhere in the bowels of your software there has to be a setting that lets you tell it the encoding of the original text. What happens if your original text includes characters that aren't in ISO-Latin-1?

fm86

5:40 pm on Apr 19, 2011 (gmt 0)

Hi and thanks for the reply!

Your right, the problem seems to be the encoding of the page. I sent a header information to say it was going to be XML and this was causing troubles. Now I tried to modify the code to be like this:

header ("Content-Type:text/xml");
print "<?xml version=\"1.0\" encoding=\"utf-8\"?>";
$text = "üß";
die("<tag>$text</tag>");

But the characters are now shown as �

Do you have any further suggestion?

lucy24

6:15 pm on Apr 19, 2011 (gmt 0)

Urk! Those are hex FFFF and FFFD, where the latter is the utf-8 "replacement character" meaning "I can't deal with this". As it happens, the characters ü and ß both occupy locations that are permitted in Latin-1 but not in UTF-8, in the 0080-009F range. So it sounds as if you have managed to turn the original problem on its head :-) That is, first you had UTF-8 characters being interpreted as Latin-1, and now you have Latin-1 characters being interpreted as UTF-8.

What is the encoding of your original file-- the one on your computer that you're looking at right now? If the file itself is in Latin-1, changing the HTML header to say UTF-8 (or vice versa, or any other permutation of encodings) will not change the text, it will simply make it display incorrectly. See what happens if you leave everything exactly the way it is, but change the "UTF-8" piece to "ISO-Latin-1" (or 8859-1 if that's what the software expects).

Disclaimer: I do not speak php, though I do know German ;-)

fm86

7:04 pm on Apr 19, 2011 (gmt 0)

Servus! :)

Sooo, it's very frustrating... Just to check, I changed the encode of my file to UTF-8 and I couldn't visualize the characters properly. No wonder, that means the file was originally Latin1. I tried to change the encoding of the XML but it didn't work out.

I somehow solved the issue using utf8_encode() but then if I run htmlentities() on the resulting string it's giving the &tilde again. Maybe it works only for latin? Guess I didn't get something very important about character encoding.

lucy24

7:50 pm on Apr 19, 2011 (gmt 0)

Can you take utf8_decode and either put that inside of the htmlentities command, or feed its result to htmlentities?

:: grasping at straws ::