httpwebwitch - 3:33 pm on Jul 14, 2010 (gmt 0)
The (now obvious to me) fact is, if you store Chinese characters in a database that is configured as UTF8, what you should see when you SELECT from it is "è„‚è‚ªéš".
If you SELECT from the database and you see Chinese characters, your data is wrong. That was the crux of my gotcha. I was seeing Chinese characters in the database, thinking that was correct, and that my display methods were wrong. It was the other way around!
Typically, if a page that has a <form> on it is encoded as UTF8, then the $_POST that comes back will be in UTF8. Assuming your database is also configured with UTF8 encoding, you don't need to do anything special to the data before an INSERT or UPDATE - just put it in there exactly as it arrived. (Of course you should still run it through mysql_real_escape_string() on the way in and htmlspecialchars() on the way out, to avoid SQL injection and XSS respectively!)
And when you want to display something UTF8 encoded from your database onto an HTML page with UTF8 encoding, it's simple: SELECT it, and print() or echo() it as is. The bytes sent to your browser are "è„‚è‚ªéš", but what you see rendered is "脂肪魚"
When I first started this project, those pages - with the form on them - were not using UTF8 encoding. So the data coming into $_POST from the <form> was not UTF encoded, and it was being stored in a Latin1 db column.
Later, when I started this whole i18n endeavour, I switched the database to UTF8 and put the UTF8 character headers on the pages too. But the stuff in my database was not multi-byte encoded.
An HTML page using UTF8 expects the source to have "è„‚è‚ªéš" in it. It will render that as "脂肪魚". IF a UTF8 HTML page receives "脂肪魚", that's when you see "?????".
BTW it's not just Chinese that is affected, it's dashes and apostrophes and copyright signs and all the other rich text characters. Users will paste them into the form straight from Word. That's how it is.
Another gotcha was that I'd used other methods to insert data - importing from CSV, importing from TXT files, etc - these were not UTF8 encoded, so they were stored wrong. Looking in my database, I see a word like "Sugestión", but what I SHOULD see is "SugestiÃ³n". The smart thing would have been to read the source data into PHP, and INSERT it after applying utf8_encode() to it.
The moral is: start EVERY project with everything configured in UTF8, from the start, whether you think it'll be required or not. And realize that the data in your database should not look readable, it should look like "è„‚è‚ªéš".
Now, I also learned that MOST browsers will return UTF8 encoded stuff in the $_POST when the <form> is on a page that uses UTF8 encoding. This is *mostly* true, for modern compliant browsers. But many bloggers cite that you can't count on it with 100% certainty. Some browsers might send their $_POST to your server with "脂肪魚" instead of "è„‚è‚ªéš". And as I know now, if I INSERT "脂肪魚" into my database, later on I'll see it on the web site looking like "?????"
The workaround (emphasis on "work") is to put a hidden field in your <form> that contains some extended characters, like "óóó". Then when your server receives the $_POST, you check to see if that field looks like "Ã³Ã³Ã³". If it does, then you know the $_POST is UTF8 encoded, and you may INSERT everything into your UTF8 database as-is. If it's not, there are checks you can do to figure out what encoding it is in, and apply appropriate conversion techniques to all the other data.
So, problem solved, I finally understand where I went wrong and how to fix it. I hope this helps you if you encounter similar problems with your own i18n efforts. Finally I can't emphasize strongly enough, it's WORTH THE TIME to learn and grasp all this encoding stuff, and start every project with everything in UTF8.
Now that domains and URLs and email addresses can contain Arabic characters etc, you really have no excuse to store anything in Latin1. The only things I still keep in Latin1 are MD5 hashes, private ENUMs and other such things that will never require an extended character set.