Forum Moderators: coopster
Joel Spolsky says [joelonsoftware.com]
When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.
Unfortunately, he never really gets back to PHP and mostly just rehashes a history of Unicode and a few elementary comments about declaring your character encoding in an HTML document.
I'm curious about two issues:
1. Interaction of PHP/MySQL (which default to Latin-1) and XHTML (which defaults to UTF-8 or UTF-16).
2. Handling of form data.
According to the PHP manual on Multi-Byte String functions and XML Parser functions
PHP is basically designed for ISO-8859-1...
The default source encoding used by PHP is ISO-8859-1.
The same is true for MySQL. According to
the manual [mysql.com]:
By default, MySQL uses the ISO-8859-1 (Latin1) character set with sorting according to Swedish/Finnish.
Of course, the XHTML spec [w3.org] is incompatible with this, in the sense that
Remember, however, that when the XML declaration is not included in a document, the document can only use the default character encodings UTF-8 or UTF-16.
Of course, the XML declaration will trigger quirks mode in IE, since it pushes the DOCTYPE to line 2, so my preference is usually to omit it. In theory that would mean that my documents must be in UTF-8 or UTF-16, while my database defaults to ISO 8859-1.
I wonder how many of you are serving up pages in ISO-8859-1 and how many in UTF-8. Is anyone using UTF-16? I assume anyone doing Asian languages must be using it, but they probably have an appropriate operating system.
I'm especially wondering with respect to user-input data from a form as I'm not really sure what happens there - if someone uses a word processor with a given encoding and pastes that text into a form in a page that specifies the encoding as UTF-8, won't it send this text as UTF-8, perhaps corrupting it? That's what I understand from a recent thread on WebmasterWorld [webmasterworld.com] and from Scott Reynen's article [randomchaos.com], which offers one solution. It's essentially the same solution that DrDoc settled on in this thread regarding use of cyrillic characters [webmasterworld.com], that is to convert everything to unicode character entities before putting it in the database.
Still it seems to me that the "convert everything to unicode" poses a couple of problems.
One is file size and code readability as mentioned in Michael Glaesemann's comments on Jonathon Delacour's blog entry [weblog.delacour.net] on the subject, as well as other follow-up comments [weblog.delacour.net].
The other is that it seems to beg the question - how do I know what I'm starting from? Do people test for encoding with iconv or the multi-byte functions? Otherwise it seems that you would have carefully encoded giberrish into UTF-8. That's wonderful - your gibberish will be perfectly preserved in your database and output exactly as you read it - as gibberish. Isn't that right?
Anyway, I'm not looking for a solution to a specific problem, but whatever comments/insights people are doing out there. How many people worry about it? How many people care?
Tom
I have no extensive wisdom on the subject really but ...
We are presently launching our software in China and it is PHP/ORACLE. I can't really respond to the mysql portion of the question but this system will be seperate from our North American english system so it will be customized for chinese.
For oracle we changed the character set and all was well.
For php we have only had problems with form input testing, so far. The 2 functions that like to cause problems are str_replace and stripslashes. This problem hasn't been fully worked out as we are still deploying the system but that is the only real problem.
Deploying all of them within the same system would start causing more problems though. I would think that form data testing would have to be customized to each charset. Essentially we need to understand the data we are error checking and the various charsets will be different.
We also found that the html pages had to be saved as utf-8 on our english systems to keep them working regardless of charsets or methods of entering the chars.
It seems to be very difficult to debug output you can't read or recognize. Problem is when you get it wrong it often changes the right words to wrong words but still words.
It's all chinese to me. ;)
The only problem I've had really is with punctuation marks and special characters, but it's annoying. For example sometimes this <<style of quote mark>> will get converted to +this* and things like that. I have only a few characters that are not in ISO-8859-1 (or at least that don't map the same in multiple character sets) that crop up in my stuff. It's kind of weird. The same person will paste text into the database and sometimes it will be all weird and sometimes it looks fine. I guess it probably depends on encoding that a particular piece of software uses.
Of course, if I do something myself, I can use unicode decimal entities, which should work pretty well in browsers at least. The problem is, as I guess you've seen, that if you can't control the encoding being used to send you stuff, it's pretty hard to know how to convert to entities. Like I said, you just end up with cross-platform garbage instead of encoding-specific garbage!
As for the Chinese, everything I "know" how to say is met with gales of laughter from my in-laws. I take satisfaction in knowing that I can brighten their day without being able to actually communicate with the older generation :-)
Tom
control the encoding being used to send you stuff
Of course, the XML declaration will trigger quirks mode in IE, since it pushes the DOCTYPE to line 2, so my preference is usually to omit it. In theory that would mean that my documents must be in UTF-8 or UTF-16, while my database defaults to ISO 8859-1.
Although I take your point re the XHTML spec, I'd always taken it that XHTML isn't treated as XHTML unless the application/xhtml+xml mime header [xml.com] is present.
I do see [w3.org ] suggesting docs with text/html are still XHTML but maybe only in a theoretical sense. Maybe when HTML browsers are involved, the documents are for all intense and purpose HTML, even if they do carry an odd looking header.. therein lies a clear distinction for browsers to use?
Certainly I've had no trouble serving bland XHTML, or even XHTML1.1 with mime headers, as ISO 8859-1, though I can't suggest where in the XHTML specs this is allowed. Reading those references you suggest has me confused as to when/if the document is not XHTML, and therefore, must be, being interpreted as HTML - since it works.
For the record, as it might help, I use this for most of my documents, although increasingly with utf-8. Currently I've found only Opera to effectively handle the mime type + XHTML <?xml-stylesheet>
<?php
if (stristr($_SERVER["HTTP_ACCEPT"],"application/xhtml+xml")) { $x = "XML";
header("Content-Type: application/xhtml+xml; charset=iso-8859-1");
echo '<?xml version="1.0" encoding="iso-8859-1"?>';
echo "\n";
echo '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">';
echo "\n";
}
else { $x = "normal";
header("Content-Type: text/html; charset=iso-8859-1");
echo '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">';
echo "\n";
}
?>
My impression is that, if it's not XHTML1.1 then there is no clear benefit to including an xml-xhtml mime header.
I'm going to post while this is ~clear as I'm re-reading your post and getting confused again.. :)
I'm going to post while this is ~clear as I'm re-reading your post and getting confused again.. :)
Sorry to be so confusing. I'm mixing a few different issues and I am confused about how they really interelate, so it's a bit hard for me to be more clear.
I didn't know about the accept-charset attribute, but I'm perhaps even more confused now.
If I set the attribute to
accept-charset="ISO-8859-1, UTF-8"
What happens when someone pastes UTF-8 text into a form and I want to put it in MySQL (which defaults to ISO-8859-1)? In other words, how long will the UTF-8 encoding be preserved?
If inserted straight into a MySQL DB and pulled back out and put on a page that sets the encoding to UTF-8, what will it look like? In other words will it be ISO-8859-1 encoded or still UTF-8 that is preserved, but just looks funny when viewed with a MySQL client?
Alternatively, what happens if someone pastes in text that is in EUC-JP or some other charset not in your accept-charset list?
Certainly I've had no trouble serving bland XHTML, or even XHTML1.1 with mime headers, as ISO 8859-1,
Sure, but you probably have control over your character encoding and are accepting form data from users who are using characters that are entirely within the ISO-8859-1 charset (if not US-ASCII).
I'm thinking about cases where you are building a collaborative resource (in this particular case, a biographical database that could have users from many languages). User 1 puts info into a form using EUC-JP and user 2 puts info in using ISO-8859-7. This info is pulled from a DB (whose storage system defaults to ISO-8859-1) and I serve it up on a page whose declared charset is UTF-8. So when user 1 looks at his EUC-JP text, doesn't he just see garbage?
Tom
What happens when someone pastes UTF-8 text into a form and I want to put it in MySQL (which defaults to ISO-8859-1)? In other words, how long will the UTF-8 encoding be preserved?
If you want ISO-8859-1 then wouldn't it be better to suggest accept-charset="ISO-8859-1"?
Then the user, I expect would see his UTF-8 get mashed through the filter that is ISO-8859-1.. ISO-8859-1 being a subset of UTF-8.
Alternatively, what happens if someone pastes in text that is in EUC-JP or some other charset not in your accept-charset list?
#4 in another thread [webmasterworld.com] has more of my understanding about what happens when you dump encodings into things which accept others.
(Post #2 has a link to MySQL Unicode support [mysql.com])
Maybe not relevant to you, but the question I don't have an answer to is how POST works to declare which of the encodings have been used if they are distinct but potentially confusing.. no ideas on that but would like to know..
I don't know how the Japanese for instance switch easily between encodings.. there may be more on unicode.org re how similar other character sets are. It may be that Japanese is naturally 16 bit and can therefore be another subset of Unicode in the same way as ASCII.
brain over..
I should think if MySQL can handle Unicode it could be configured for other 16 bit language encodings.. if that is what Jap encodings are.. I'm guessing.
If you want ISO-8859-1 then wouldn't it be better to suggest accept-charset="ISO-8859-1"?
Then the user, I expect would see his UTF-8 get mashed through the filter that is ISO-8859-1.. ISO-8859-1 being a subset of UTF-8.
Actually, I originally wrote my post with an ISO-8859-1 to UTF conversion example, and realized that should never be a problem and reversed it. The point is that I don't necessarily want ISO-8859-1, but certain pieces of software in the chain may.
Let's say the UTF-8 text includes characters not in the ISO-8859-1 set and it has been "mashed through the filter that is ISO-8859-1." I then use a hexidecimal editor to look at the actual numeric representation of the text (bits and bytes, not how those get mapped to characters by the encoding). In binary (or hex) will it look just like it does before that event?
From what you say, I gather that it will, and that one half of my question is answered. I get something that's in EUC-JP, dump it into the DB. It looks like gibberish, but when I pull it out again and serve it as EUC-JP to someone with Japanese fonts installed, it looks okay. If it goes on a page that's declared as UTF-8, I will have to check for the encoding and, if necessary, convert from EUC-JP to UTF-8 and hope for the best. If I don't do that and just serve up the EUC-JP as UTF-8, it can look like gibberish or, as I understand, look fine but have a different meaning since some codes are valid in both encodings, but map to different characters/words. If I serve it up as ISO-8859-1, I mostly serve a page of boxes and weird glyphs, unless the client end overrides my setting and figures out that it has Japanese on there, but of course I don't want to count on that.
Maybe not relevant to you, but the question I don't have an answer to is how POST works to declare which of the encodings have been used if they are distinct but potentially confusing..
Definitely relevant - that's the other half of my question :-) Perhaps I should just try some testing....
Incidentally and a little OT, a friend I know who does a blog in Japanese has defaulted to the failsafe option - he builds his page as a single photoshop image and saves it as one big jpeg.
Tom
1. it's one of the more difficult cases - if you can solve Chinese and Japanese, you know your stuff!
2. it seems to be the one that people know about because you really can't just let these issues slide.
In fact, I'm mostly concerned with European characters not in the ISO-8859-1 set.
As for MySQL, I appreciate that heads up as well. I see that the support is *way* better in version 4.1. I'm still running 3.x something since it still seems to be the most commonly available version. I know you can compile support for Unicode into it, but that also is not that common and I usually don't run my own server, so I can't count on it.
Tom
Let's say the UTF-8 text includes characters not in the ISO-8859-1 set and it has been "mashed through the filter that is ISO-8859-1." I then use a hexidecimal editor to look at the actual numeric representation of the text (bits and bytes, not how those get mapped to characters by the encoding). In binary (or hex) will it look just like it does before that event?From what you say, I gather that it will..
I'm not sure about that.
If you can mash data up, and be rough with the like of EUC-JP, that would be a great help I guess. It may be that MySQL etc are robust but my take on it is that the information can be easily compromised.. certainly in my own simple way playing on Win98 editors often replace characters they don't understand with? or similar but I suppose there's no need for that and maybe software can retain the information.
I've never been able to reverse a confused text.. it would be interesting to know it's possible. Certainly I've spotted uncoding/wrongly coded pages/emails and corrected the encoding interpretation but that may be different from manipulating the text itself.
Win98 editors often replace characters they don't understand with?
True enough, but is that because the underlying data has been replaced with an ASCII question mark, or because the editor is not smart enough to render data in that encoding? Normally it's the latter, but perhaps once you save a file, it's lost.
Next time I come across this, I'll try to do some tests and report back.
Tom
You're certainly right that Unicode character entities take up more space. I didn't mean to suggest the character entities would be ideal for database storage, but rather for output. I use UTF-8 encoding for database storage, which makes the text as small as it can possibly get. The only problem with this is that I need to remember which tables/records are UTF-8 and which are ASCII, so I can display them properly when I output the text.
Is anyone using UTF-16? I assume anyone doing Asian languages must be using it, but they probably have an appropriate operating system.
UTF-8 and UTF-16 can both represent any language, even the more complex Asian languages. I use UTF-8 to deal with Japanese text.
how do I know what I'm starting from?
You must have control over the input form(s), and specify an encoding there. If you don't know what encodings you're being sent, it's not going to work. There's no way to detect this, because a given set of bytes could translate to different and valid characters in different encodings.
I didn't realize you were a reader here. Can't remember how I found your article.
I didn't mean to suggest the character entities would be ideal for database storage, but rather for output.
Sorry, I meant that was the solution proposed in the WebmasterWorld thread I mentioned. I should have said the solution was "related" or something rather than "essentially the same".
UTF-8 and UTF-16 can both represent any language
Thanks. I should have known better.
You must have control over the input form(s), and specify an encoding
That's the real sticker. One can, of course, set the encoding the forms *expect*, but one can't be certain that every user will be pasting text in using that same character encoding. This only leads to the odd character here and there in my case, but it must be a nightmare in Japanese when someone pastes text from a word processor using one encoding into a form that expects text in a different encoding.
Thanks for the input.
[edited by: ergophobe at 3:58 pm (utc) on Sep. 24, 2004]
I didn't realize you were a reader here.
I've dropped by from time to time when it comes up in search results, but I only discovered this thread because of my referrer logs.
One can, of course, set the encoding the forms *expect*, but one can't be certain that every user will be pasting text in using that same character encoding.
When you set a character encoding for an HTML page, that tells the browser to encode *all* input characters that way *before* sending them on to the server. If you set an input page to UTF-8 encoding, the text will be UTF-8 encoded when it gets to the server. If it's not, something is broken.