agents:
IE from which ver?
and how about moz/opera?
servers: how mysql support utf-8 now?
editor: your favorite html/code editors, including wysiwyg/text editors, how does them support utf-8
languages: php with mb_string? libiconv? mb regx?
This Usenet Thread [groups.google.com] is an interesting discussion on the topic. That is a serious problem that becomes multiplied in a interactive environment.
What I have decided about the whole thing is that UTF-8 opens more problems than it fixes at this point. If you do not have an interactive site such as a chat room or some type of forum, I would be warry of using it at this point. Default language problems, character sets, forms, and browser support make it a mess.
In order to even begin to make sense of it, I feel it would take a month to even get close to feeling confident about using it on a site full time. The problem is that once you got confident that it was working the way you thought it should, there's a very high probability that you are wrong. What the random user was seeing in the browser would not be what you'd intended and you'd have no way of knowing. Why risk it?
I tried for several weeks to impliment utf-8 here. The problem is with older browsers and editors. The number of browsers returning high ascii characters (as ascii) when utf-8 was requested was astounding. The problem is in deciding just what the users browser was intending on sending back.
Here is an excellent article on the complexities of dealing with UTF-8 forms:
[ppeph.gla.ac.uk...]
According to the HTML4.01 specification, the only characters that you are entitled to rely on in this situation are those of us-ascii, i.e the 7-bit repertoire.Realistically, however, browsers and other client agents do not enforce this restriction, and will typically handle characters outside of that repertoire by applying the same %xx hex coding that they apply to unsafe characters of the us-ascii repertoire. But this is not unproblematical, as we will see. Nevertheless, as an author, this isn't under your control: readers can and will submit extended characters - there's nothing you can do to stop them - so your server-side scripts need to be able to do something with them.
Thats for GET, but that's still the question even with POST, What to do with them? What data did the user really intend to send?
Throw that question into an ecommerce equation where credit card or personal data is requested. Hello!?
Whether that stems from your editor interpreting a page as non-backward compatible (and forcing encoding), an older browser baulking at your char set choice, or forms that return in a different encoding than you sent, means - it's still a mess.
So, until the fog clears a bit more and we have the tools to work with unicode that don't require a degree in languages, the safest thing to do is stick with pure ascii and non-charset specified pages.
There comes a time to learn a technolgoy and a time to wait. Remember in 98 when all the w3c guys were running screaming their fool heads off that CSS would take over the web in a years time? It's marginally even worth learning yet today.
I think the same is true for unicode. If we started on a long discussion and help thread about unicode right now, it would be a running coversation for the next several years.
But I agree it is a mess! I spent quite a bit of time trying to understand the whole thing, and it's still pretty confusing.
Any idea when the major problems will be worked out?
Chris - What was that one problem that you had?