Unknow charset in database

Hello again,

This problem of charset support seems to annoy many people, and I haven't found a solution so far.

Our php.ini as well as page content type charset are iso-8859-15 (for support of Euro sign).

In a form, users sometimes paste text written first in MS Word (even if I was not aware of this at first).

Each input is html_entitized before INSERT in the MySQL database.

It seems however that some browsers overwrite the charset encoding of the page, so that I'm not sure what the charset encoding of the input really is.

But since I was not aware of the problem, I haven't checked the input on that. Anyhow, input validated error checking, and was written in database.

Here again, I have no idea with which encoding. At least, I've seen that the charset Index of our MySQL version has no cp1252 nor utf-8, even if I'm not sure if that's an issue or not.

Also, when looking at the content in MySQL, some of the "odd" MS Word characters are no entities, but just glyphs.

Anyhow, when retrieved from database and echoed in a webpage, these are replaced by a "?", meaning at least that they are outside the iso-8859-15 repertoire.

Trying to be smart, I did a
utf8_decode($string);
on a sample, but still got the "?" instead of curly quotes, em dashes, ellipsies etc. utf8_encode made it even worse!

Even a
mb_convert_encoding($string, "iso-8859-15");
on the same sample did not cure the problem, and also returned "?" as substitute character.

As for pasting the "odd" sample in an input field and check that, this will not teach me much, since my configuration has nothing in common with the ones of our users around the world. I even know of someone dealing with Russia having an old French version of MS Office, enabling Cyrillic encoding, all of which is running on a Mac! God only knows what charset that might send through the browser when pasting text in a textarea.

I also tested on just a sample, because the "oddity" can be nested into a very big text, and not always near the beginning of it.

The problem has two sides:
- first, I should find a way to echo correctly what's already stored in database;
- second, I should find an efficient way to identify and convert 100% of the "odd" characters, so as to keep ALL of the input inside the iso-8859-15 repertoire before sending it to the database, to avoid further pollution.

Even the very extensive explanations of ergophobe and some of his code samples did not help to solve this.

Thank you for your help.

Notawiz