They post through a contenteditable area, so if they copy from Word or something then it comes through exactly how they copy it.
This may actually depend on their operating system. But I suppose they’d notice if something in their input area didn’t match what they pasted in. (Long ago, I met a website whose text input only worked correctly if you intentionally told them to use a different language than the one you’re actually in. I have mercifully blocked the details.)
MySQL stores data as "cp1252 West European (latin1)", but PHP sets it to UTF-8 again
Oh, criminy. If that’s the setup, no solution will be perfect. There are at least three stages where text is moved from Point A to Point B (for example, from the input window to the place where it gets screened, and from there to the database, and from there to the visible html, and I’ve probably missed a few). In each of those stages, non-ASCII text has to be stored as some kind of numerical entity *, and then the next stage is faced with a numerica entity that may have a different meaning.
For example:
original text contains the character é (e-acute).
if it is stored in Latin-1 (either 8859-1 or 1252) it becomes E9
(we will not talk about what happens if it is stored in some other one-byte encoding: Mac for example is the forbidden character 8E)
If it is stored in UTF-8 it becomes C3A9
If that E9 is opened in something that expects UTF-8, it will either disappear or it will merge with the following one or two letters, depending on what they are, because E9 by itself has no meaning.
If, contrariwise, that C3A9 is opened in something that expects Latin-1, it will be read as é
I don’t think an existing database can be coverted, as such. It would have to be downloaded, converted into a new encoding, and then re-uploaded. That’s the kind of thing you save for when the site is due for major revisions anyway--at which point you probably decide instead that it is not really necessary to preserve discussion threads from 2007 ;)
Oh yes and ... In Apache, contrary to ordinary usage, a charset declaration in the config file will
override a charset declaration in an individual html document. Most of the time, this will not cause problems in modern browsers, but it's worth remembering.
* Yes, technically ASCII is also stored as numerical entities, but it doesn’t matter because those numbers are the same in all encodings. At least the ones that assume Roman script.