UTF-8, ISO-8859-1, PHP and XHTML

I know this comes up from time to time, but I'm still trying to figure some stuff out. I use entirely Western European languages, so usually this all works out in my case, but I still want to understand it better.

Joel Spolsky says [joelonsoftware.com]

When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.

Unfortunately, he never really gets back to PHP and mostly just rehashes a history of Unicode and a few elementary comments about declaring your character encoding in an HTML document.

I'm curious about two issues:

1. Interaction of PHP/MySQL (which default to Latin-1) and XHTML (which defaults to UTF-8 or UTF-16).

2. Handling of form data.

According to the PHP manual on Multi-Byte String functions and XML Parser functions

PHP is basically designed for ISO-8859-1...
The default source encoding used by PHP is ISO-8859-1.

The same is true for MySQL. According to
the manual [mysql.com]:

By default, MySQL uses the ISO-8859-1 (Latin1) character set with sorting according to Swedish/Finnish.

Of course, the XHTML spec [w3.org] is incompatible with this, in the sense that

Remember, however, that when the XML declaration is not included in a document, the document can only use the default character encodings UTF-8 or UTF-16.

Of course, the XML declaration will trigger quirks mode in IE, since it pushes the DOCTYPE to line 2, so my preference is usually to omit it. In theory that would mean that my documents must be in UTF-8 or UTF-16, while my database defaults to ISO 8859-1.

I wonder how many of you are serving up pages in ISO-8859-1 and how many in UTF-8. Is anyone using UTF-16? I assume anyone doing Asian languages must be using it, but they probably have an appropriate operating system.

I'm especially wondering with respect to user-input data from a form as I'm not really sure what happens there - if someone uses a word processor with a given encoding and pastes that text into a form in a page that specifies the encoding as UTF-8, won't it send this text as UTF-8, perhaps corrupting it? That's what I understand from a recent thread on WebmasterWorld [webmasterworld.com] and from Scott Reynen's article [randomchaos.com], which offers one solution. It's essentially the same solution that DrDoc settled on in this thread regarding use of cyrillic characters [webmasterworld.com], that is to convert everything to unicode character entities before putting it in the database.

Still it seems to me that the "convert everything to unicode" poses a couple of problems.

One is file size and code readability as mentioned in Michael Glaesemann's comments on Jonathon Delacour's blog entry [weblog.delacour.net] on the subject, as well as other follow-up comments [weblog.delacour.net].

The other is that it seems to beg the question - how do I know what I'm starting from? Do people test for encoding with iconv or the multi-byte functions? Otherwise it seems that you would have carefully encoded giberrish into UTF-8. That's wonderful - your gibberish will be perfectly preserved in your database and output exactly as you read it - as gibberish. Isn't that right?

Anyway, I'm not looking for a solution to a specific problem, but whatever comments/insights people are doing out there. How many people worry about it? How many people care?

Tom

UTF-8, ISO-8859-1, PHP and XHTML

and how do you make sure

ergophobe

jatar_k

ergophobe

jatar_k

ergophobe

davidpbrown

ergophobe

davidpbrown

davidpbrown

ergophobe

ergophobe

davidpbrown

ergophobe

scott reynen

ergophobe

scott reynen

ergophobe

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week