We recently localized some websites into East European versions (namely Polish, Czech, Russian).
Displaying the web pages in the appropriate character sets is not the problem, but dealing with the feedback of native users is.
We are using several forms to generate feedback mails from our servers. The input should be converted into the most commonly used e-mail-formats in these languages.
1. Are there any statistics to find out which encodings are predominant in which language? (eg. is the common Czech user more likely to use ISO Latin 2 or Windows CP 1250 for his input? And furthermore will he be able to display an e-mail in these character-sets?)
2. Is there a way to analyse which encoding the users use when filling out the form? (Or are there even prefab modules to use? :-))
3. If the encoding can be detected correctly, how do we convert the code into the appropriate encoding for the generated e-mail? There is the convert_cyr_string function for Russian, but how about the other charsets?
I've been playing with this myself. What I'm doing is using Unicode (UTF-8) so I don't have to worry about all the different character sets. It looks like most browsers post form data back using the character set that the page was encoded in. When you send out the email just specify the content type as UTF-8.
I looked around for a Unicode support chart but couldn't find one. So I'm not sure where the breaking point would be according to browser support. It's always tough deciding how far back you're going to support something and if it's worth that 2-4% of your users. Maybe it's more in your case but I get alot of international traffic and those ancient browsers run at about 4%. I often wonder how those poor souls still using Netscape 3/4 get around.
It's interesting to see how other sites are doing it. A good example is Google. Everything there is UTF-8 as far as I can tell. Maybe if I was using an older browser it would be different.