Forum Moderators: coopster
This problem of charset support seems to annoy many people, and I haven't found a solution so far.
Our php.ini as well as page content type charset are iso-8859-15 (for support of Euro sign).
In a form, users sometimes paste text written first in MS Word (even if I was not aware of this at first).
Each input is html_entitized before INSERT in the MySQL database.
It seems however that some browsers overwrite the charset encoding of the page, so that I'm not sure what the charset encoding of the input really is.
But since I was not aware of the problem, I haven't checked the input on that. Anyhow, input validated error checking, and was written in database.
Here again, I have no idea with which encoding. At least, I've seen that the charset Index of our MySQL version has no cp1252 nor utf-8, even if I'm not sure if that's an issue or not.
Also, when looking at the content in MySQL, some of the "odd" MS Word characters are no entities, but just glyphs.
Anyhow, when retrieved from database and echoed in a webpage, these are replaced by a "?", meaning at least that they are outside the iso-8859-15 repertoire.
Trying to be smart, I did a
utf8_decode($string);
on a sample, but still got the "?" instead of curly quotes, em dashes, ellipsies etc. utf8_encode made it even worse!
Even a
mb_convert_encoding($string, "iso-8859-15");
on the same sample did not cure the problem, and also returned "?" as substitute character.
As for pasting the "odd" sample in an input field and check that, this will not teach me much, since my configuration has nothing in common with the ones of our users around the world. I even know of someone dealing with Russia having an old French version of MS Office, enabling Cyrillic encoding, all of which is running on a Mac! God only knows what charset that might send through the browser when pasting text in a textarea.
I also tested on just a sample, because the "oddity" can be nested into a very big text, and not always near the beginning of it.
The problem has two sides:
- first, I should find a way to echo correctly what's already stored in database;
- second, I should find an efficient way to identify and convert 100% of the "odd" characters, so as to keep ALL of the input inside the iso-8859-15 repertoire before sending it to the database, to avoid further pollution.
Even the very extensive explanations of ergophobe and some of his code samples did not help to solve this.
Thank you for your help.
Notawiz
Does your MySQL have the iso-8859-15 character set installed?
is it the character set for the given table
But this does not tell me the encoding of the "odd" glyphs of MS Word code page, so that I still don't have a clue how to make these display correctly after a DB query.
I believe however that they are not in the repertoire of latin1 (= iso-8859-1), but in the recordset they are readable. It is only when retrieved from DB and echoed on a page that everything gets messed up.
what version of MySQL are you running?
I am not allowed to install, reconfigure, compile or whatever for MySQL on the server, so I will have to live with those settings, I fear.
So I would like to find a solution for:
- making whatever odd character already in database render correctly;
- identify and alter or filter or convert any future input to avoid such garbage being added to the database.
Any hint would be greatly appreciated.
Notawiz
but in the recordset they are readable. It is only when retrieved from DB and echoed on a page that everything gets messed up.
So the characters are appear correct in the database, but not on the web page? Is that right?
If so, maybe all you have to do is set the character set on your output web page match the database table.
Unfortunately, I don't know MySQL 3 at all, so I can't help you figure that out.
Have you tried using the htmlentities on the text you fetch from the database?
all you have to do is set the character set on your output web page match the database table
Thats just the problem. The charset is consistent through the whole chain, namely iso-8859-1 (or -15) for apache, mysql, php and the individual web pages.
All user input is "entitized" before INSERT in tables, and again entity_decoded before output to user browser.
The functions to entitize/decode are of no help since those glyphs are not in the repertoire of the iso-8859-15 encoding. Php thus leaves them "as is" on their way in, and does not know how to handle them on their way out.
mbstring functions to encode with a "supported" charset could cure this, but therefore I should know the "native" encoding of the content of the input field.
So of course I tried to detect that encoding, but the function always returns "iso-8859-1" (or whatever encoding I place first in the mb_detect_order), which we know is not true.
I was wondering if the "Accept-charset" of the request header could help me, but since my computer is set to handle latin1, I am not able to reproduce the setting of all our users around the world, which can sometimes be very weird, as I said in the original post (old French Office extended to Cyrillic, on a Mac computer, for instance...).
As a result, when trying to reproduce the "odd" input, my test environment "entitizes" it correctly, because it is latin1 (iso-8859-1) from start to end.
Does that shed some more light on the matter?
Notawiz
In an old thread of [thelist] forum I found a piece of ereg_replace statements.
function ms_characters($ms_string) {
$ms_string = ereg_replace(38, "&", $ms_string); // ampersand
$ms_string = ereg_replace(133, "…", $ms_string); // ellipses
$ms_string = ereg_replace(8226, "″", $ms_string); // double prime
$ms_string = ereg_replace(8216, "'", $ms_string); // left single quote
$ms_string = ereg_replace(145, "'", $ms_string); // left single quote
$ms_string = ereg_replace(8217, "'", $ms_string); // right single quote
$ms_string = ereg_replace(146, "'", $ms_string); // right single quote
$ms_string = ereg_replace(8220, """, $ms_string); // left double quote
$ms_string = ereg_replace(147, """, $ms_string); // left double quote
$ms_string = ereg_replace(8221, """, $ms_string); // right double quote
$ms_string = ereg_replace(148, """, $ms_string); // right double quote
$ms_string = ereg_replace(8226, "•", $ms_string); // bullet
$ms_string = ereg_replace(149, "•", $ms_string); // bullet
$ms_string = ereg_replace(8211, "–", $ms_string); // en dash
$ms_string = ereg_replace(150, "–", $ms_string); // en dash
$ms_string = ereg_replace(8212, "—", $ms_string); // em dash
$ms_string = ereg_replace(151, "—", $ms_string); // em dash
$ms_string = ereg_replace(8482, "™", $ms_string); // trademark
$ms_string = ereg_replace(153, "™", $ms_string); // trademark
$ms_string = ereg_replace(169, "©", $ms_string); // copyright mark
$ms_string = ereg_replace(174, "®", $ms_string); // registration mark
return $ms_string;
}
Not very elegant, since it would have been better to prevent them from being written in DB, but I will test if the same function can help me also on the input side.
By the way, since I encountered only 2 such "odd" MS characters, and that it worked for those, I'm not sure if this handles every odd MS code page inconsistencies.
Feel free to let me know if you experience problems with this list.
To round this off, could someone see if it is possible to write this into a single regex range?
Notawiz