Forum Moderators: coopster

Message Too Old, No Replies

Parsing accented/umlauted characters

         

iceman22

9:16 am on Dec 29, 2004 (gmt 0)

10+ Year Member



I have a PHP script that will get a string from a website and compare it to one in a text database.

I'm having problems with text encoding.

An accented é will appear as an é at the page and in the HTML source. Once the string has been stored, the script outputs it as "é". The script outputs the same character from the database as "ˆ©".

Looking at the database text file it appears as "?©". I have tried saving the text file using all the available text encodings, they all produce different results but not what I'm looking for.

The script uses shell commands so it could be affecting the encoding, when I use the same commands from my shell I get the accented é. In the script once the text has been stored into variables via shell commands, using str_replace or preg_replace I'm unable to modify the "é" or "ˆ©" parts of a string.

I'm looking for a way to get a match between the string in the database and the one from the external page.

mcibor

4:42 pm on Dec 29, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Intriguing!
I use Polish encoding (iso-8859-2) in HTML, Apache and MySQL and don't have any such problems. Maybe you should check what encoding you have in your Apache.

I tried: I placed é in <input type="text" name="a"> and then in php wrote
<?PHP
$a = $_POST["a"];
print("a: $a)";
?>

and it did write é.
So I don't know what is it that you have.
Sorry.

jatar_k

5:47 pm on Dec 29, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



what db are you using and what encoding are you using in the db?

if it is mysql take a read through this
MySQL Character Set Support [dev.mysql.com]

ergophobe

9:50 pm on Dec 29, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are you declaring an encoding on all you html pages? If so, even if it looks funny when browsing your DB with whatever client, it should always look right on the actual pages output by the script.

iceman22

5:26 am on Dec 30, 2004 (gmt 0)

10+ Year Member



Like I said it's just a plain text database. I've tried importing it to a MySQL but I have to experiment with some parsing first to get it to import properly.

Using a simple form to returns the same accented é character.

I do not have any encoding set on this page, I've tried using different encodings with different encodings of the text files and none have worked.

Using the same commands and database through my shell, the characters appear properly, but that's the bash shell of OS X so I think it's applying Mac OS Roman encoding.

iceman22

7:07 am on Dec 30, 2004 (gmt 0)

10+ Year Member



Ok I replaced the "?" and the "?©" with ö and é respectively.

This enabled me to save them as latin encodings. I tried ISO Latin 1, ISO Latin 9, Windows Latin 1, Mac OS Roman.

Even with the characters as é instead of?©, they still wouldn't appear correctly, the only one that did was the one with Mac OS Roman encodings. I use BBEdit to do the encodings btw. It was the same using "iso-8859-1" encoding in the page as with no encoding defined. This could just be how the browser (Safari) is handling the encoding, I do not know much about encoding.

How it displays isn't that important really, what is is how the script differentiates between the "é" and the "ˆ©" (the character appears in the text file as "?©"). It's also important that I can use PHP commands like str_replace and preg_replace to modify these characters, PHP will not replace the characters.

ergophobe

6:12 pm on Dec 30, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I wish I could be more help.

There was a thread here once RE encodings entitled something like "There's no such thing as 'plain text'"... it sounds like your situation is a case study in that!

I'm surprised that the Mac OS Roman encoding is the one that works. I thought the Mac version of ISO-8859-1 was very similar to Windows-1252 and caused problems with lots of punctuation (curly quotes for example).

Any ideas?

Tom

iceman22

1:35 pm on Jan 1, 2005 (gmt 0)

10+ Year Member



I found a solution to the problem.

I only handled the database locally via the shell to make sure I wasn't changing the text encodings with anything, then uploaded the database using 'raw data' mode in Fetch. Then setting the page to use UTF-8 text encoding made it work.

Thanks for all the replies.

ergophobe

5:11 pm on Jan 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Were you uploading it from one machine to another via FTP or something?

Glad your problem is fixed, but I'm still not understanding. Oh well, as long as it works.