Forum Moderators: coopster

Message Too Old, No Replies

UTF-8, ISO-8859-1, PHP and XHTML

and how do you make sure

         

ergophobe

8:04 pm on Dec 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I know this comes up from time to time, but I'm still trying to figure some stuff out. I use entirely Western European languages, so usually this all works out in my case, but I still want to understand it better.

Joel Spolsky says [joelonsoftware.com]


When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.

Unfortunately, he never really gets back to PHP and mostly just rehashes a history of Unicode and a few elementary comments about declaring your character encoding in an HTML document.

I'm curious about two issues:

1. Interaction of PHP/MySQL (which default to Latin-1) and XHTML (which defaults to UTF-8 or UTF-16).

2. Handling of form data.

According to the PHP manual on Multi-Byte String functions and XML Parser functions

PHP is basically designed for ISO-8859-1...
The default source encoding used by PHP is ISO-8859-1.

The same is true for MySQL. According to
the manual [mysql.com]:

By default, MySQL uses the ISO-8859-1 (Latin1) character set with sorting according to Swedish/Finnish.

Of course, the XHTML spec [w3.org] is incompatible with this, in the sense that


Remember, however, that when the XML declaration is not included in a document, the document can only use the default character encodings UTF-8 or UTF-16.

Of course, the XML declaration will trigger quirks mode in IE, since it pushes the DOCTYPE to line 2, so my preference is usually to omit it. In theory that would mean that my documents must be in UTF-8 or UTF-16, while my database defaults to ISO 8859-1.

I wonder how many of you are serving up pages in ISO-8859-1 and how many in UTF-8. Is anyone using UTF-16? I assume anyone doing Asian languages must be using it, but they probably have an appropriate operating system.

I'm especially wondering with respect to user-input data from a form as I'm not really sure what happens there - if someone uses a word processor with a given encoding and pastes that text into a form in a page that specifies the encoding as UTF-8, won't it send this text as UTF-8, perhaps corrupting it? That's what I understand from a recent thread on WebmasterWorld [webmasterworld.com] and from Scott Reynen's article [randomchaos.com], which offers one solution. It's essentially the same solution that DrDoc settled on in this thread regarding use of cyrillic characters [webmasterworld.com], that is to convert everything to unicode character entities before putting it in the database.

Still it seems to me that the "convert everything to unicode" poses a couple of problems.

One is file size and code readability as mentioned in Michael Glaesemann's comments on Jonathon Delacour's blog entry [weblog.delacour.net] on the subject, as well as other follow-up comments [weblog.delacour.net].

The other is that it seems to beg the question - how do I know what I'm starting from? Do people test for encoding with iconv or the multi-byte functions? Otherwise it seems that you would have carefully encoded giberrish into UTF-8. That's wonderful - your gibberish will be perfectly preserved in your database and output exactly as you read it - as gibberish. Isn't that right?

Anyway, I'm not looking for a solution to a specific problem, but whatever comments/insights people are doing out there. How many people worry about it? How many people care?

Tom

jatar_k

7:44 pm on Dec 8, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Sorry ergophobe, I read this a few times and didn't really get around to responding.

I have no extensive wisdom on the subject really but ...

We are presently launching our software in China and it is PHP/ORACLE. I can't really respond to the mysql portion of the question but this system will be seperate from our North American english system so it will be customized for chinese.

For oracle we changed the character set and all was well.

For php we have only had problems with form input testing, so far. The 2 functions that like to cause problems are str_replace and stripslashes. This problem hasn't been fully worked out as we are still deploying the system but that is the only real problem.

Deploying all of them within the same system would start causing more problems though. I would think that form data testing would have to be customized to each charset. Essentially we need to understand the data we are error checking and the various charsets will be different.

We also found that the html pages had to be saved as utf-8 on our english systems to keep them working regardless of charsets or methods of entering the chars.

ergophobe

10:39 pm on Dec 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Interesting.

Have you tried using the Multi-Byte string functions (which have no equivalent to str_replace, but do have ereg_replace)?

Tom

jatar_k

11:42 pm on Dec 8, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



actually I think the lads over there removed it all together for now. I have not completely spec'ed out possible replacements for it yet. It is really difficult when one cantonese non programmer types it, I test it and then he reads it back to me.

It seems to be very difficult to debug output you can't read or recognize. Problem is when you get it wrong it often changes the right words to wrong words but still words.

It's all chinese to me. ;)

ergophobe

4:38 pm on Dec 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It seems to me that the crux of the problem is, as you say, form input. Once I have the text, I can always try to use one of the functions in PHP to sniff the encoding, but by the time PHP gets it, it's almost certainly encoded to whatever the web page is encoded (as you said earlier). It seems like it's the step of getting the data from the user's word processor to PHP (or my database or word processor or whatever) that makes everything complicated and iffy.

The only problem I've had really is with punctuation marks and special characters, but it's annoying. For example sometimes this <<style of quote mark>> will get converted to +this* and things like that. I have only a few characters that are not in ISO-8859-1 (or at least that don't map the same in multiple character sets) that crop up in my stuff. It's kind of weird. The same person will paste text into the database and sometimes it will be all weird and sometimes it looks fine. I guess it probably depends on encoding that a particular piece of software uses.

Of course, if I do something myself, I can use unicode decimal entities, which should work pretty well in browsers at least. The problem is, as I guess you've seen, that if you can't control the encoding being used to send you stuff, it's pretty hard to know how to convert to entities. Like I said, you just end up with cross-platform garbage instead of encoding-specific garbage!

As for the Chinese, everything I "know" how to say is met with gales of laughter from my in-laws. I take satisfaction in knowing that I can brighten their day without being able to actually communicate with the older generation :-)

Tom

davidpbrown

6:34 pm on Dec 9, 2003 (gmt 0)

10+ Year Member



control the encoding being used to send you stuff

I think the accept-charset [w3.org] attribute of the <form> provides this control. As I understand it the form doesn't have to use the pages charset.

Of course, the XML declaration will trigger quirks mode in IE, since it pushes the DOCTYPE to line 2, so my preference is usually to omit it. In theory that would mean that my documents must be in UTF-8 or UTF-16, while my database defaults to ISO 8859-1.

Could you not put the charset in the documents header instead? Indeed this way seems almost prefered..
From end of 3.1 [w3.org ]
"When it is difficult to specify an explicit charset parameter through a higher-level protocol, authors SHOULD include the XML declaration.."

Although I take your point re the XHTML spec, I'd always taken it that XHTML isn't treated as XHTML unless the application/xhtml+xml mime header [xml.com] is present.

I do see [w3.org ] suggesting docs with text/html are still XHTML but maybe only in a theoretical sense. Maybe when HTML browsers are involved, the documents are for all intense and purpose HTML, even if they do carry an odd looking header.. therein lies a clear distinction for browsers to use?

Certainly I've had no trouble serving bland XHTML, or even XHTML1.1 with mime headers, as ISO 8859-1, though I can't suggest where in the XHTML specs this is allowed. Reading those references you suggest has me confused as to when/if the document is not XHTML, and therefore, must be, being interpreted as HTML - since it works.

For the record, as it might help, I use this for most of my documents, although increasingly with utf-8. Currently I've found only Opera to effectively handle the mime type + XHTML <?xml-stylesheet>
<?php
if (stristr($_SERVER["HTTP_ACCEPT"],"application/xhtml+xml")) { $x = "XML";
header("Content-Type: application/xhtml+xml; charset=iso-8859-1");
echo '<?xml version="1.0" encoding="iso-8859-1"?>';
echo "\n";
echo '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">';
echo "\n";
}
else { $x = "normal";
header("Content-Type: text/html; charset=iso-8859-1");
echo '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">';
echo "\n";
}
?>

My impression is that, if it's not XHTML1.1 then there is no clear benefit to including an xml-xhtml mime header.

I'm going to post while this is ~clear as I'm re-reading your post and getting confused again.. :)

ergophobe

7:24 pm on Dec 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



David, Great info!


I'm going to post while this is ~clear as I'm re-reading your post and getting confused again.. :)

Sorry to be so confusing. I'm mixing a few different issues and I am confused about how they really interelate, so it's a bit hard for me to be more clear.

I didn't know about the accept-charset attribute, but I'm perhaps even more confused now.

If I set the attribute to
accept-charset="ISO-8859-1, UTF-8"

What happens when someone pastes UTF-8 text into a form and I want to put it in MySQL (which defaults to ISO-8859-1)? In other words, how long will the UTF-8 encoding be preserved?

If inserted straight into a MySQL DB and pulled back out and put on a page that sets the encoding to UTF-8, what will it look like? In other words will it be ISO-8859-1 encoded or still UTF-8 that is preserved, but just looks funny when viewed with a MySQL client?

Alternatively, what happens if someone pastes in text that is in EUC-JP or some other charset not in your accept-charset list?


Certainly I've had no trouble serving bland XHTML, or even XHTML1.1 with mime headers, as ISO 8859-1,

Sure, but you probably have control over your character encoding and are accepting form data from users who are using characters that are entirely within the ISO-8859-1 charset (if not US-ASCII).

I'm thinking about cases where you are building a collaborative resource (in this particular case, a biographical database that could have users from many languages). User 1 puts info into a form using EUC-JP and user 2 puts info in using ISO-8859-7. This info is pulled from a DB (whose storage system defaults to ISO-8859-1) and I serve it up on a page whose declared charset is UTF-8. So when user 1 looks at his EUC-JP text, doesn't he just see garbage?

Tom

davidpbrown

7:41 pm on Dec 9, 2003 (gmt 0)

10+ Year Member



Ah, I didn't think your post confusing.. quite the contrary just overloaded my brain with new info. :)

What happens when someone pastes UTF-8 text into a form and I want to put it in MySQL (which defaults to ISO-8859-1)? In other words, how long will the UTF-8 encoding be preserved?

If you want ISO-8859-1 then wouldn't it be better to suggest accept-charset="ISO-8859-1"?
Then the user, I expect would see his UTF-8 get mashed through the filter that is ISO-8859-1.. ISO-8859-1 being a subset of UTF-8.

Alternatively, what happens if someone pastes in text that is in EUC-JP or some other charset not in your accept-charset list?

#4 in another thread [webmasterworld.com] has more of my understanding about what happens when you dump encodings into things which accept others.
(Post #2 has a link to MySQL Unicode support [mysql.com])

Maybe not relevant to you, but the question I don't have an answer to is how POST works to declare which of the encodings have been used if they are distinct but potentially confusing.. no ideas on that but would like to know..

I don't know how the Japanese for instance switch easily between encodings.. there may be more on unicode.org re how similar other character sets are. It may be that Japanese is naturally 16 bit and can therefore be another subset of Unicode in the same way as ASCII.

brain over..

davidpbrown

8:00 pm on Dec 9, 2003 (gmt 0)

10+ Year Member



Reading some of this may also help.. (I haven't)
Problems on Interoperativity between Unicode and CJK Local Encodings [debian.or.jp]
To use Unicode for daily life, there are three major problems for Japanese users. One is Han Unification. The second is mapping problem. The third is width problem...

I should think if MySQL can handle Unicode it could be configured for other 16 bit language encodings.. if that is what Jap encodings are.. I'm guessing.

ergophobe

8:17 pm on Dec 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the link to the thread and the editor - I've spent a fair bit of time at Alan Wood's site already trying to straighten this stuff out in my head. I still have to followup on the Unicode/MySQL link, but I thought Unicode support had to be compiled in and may not be available if you aren't compiling your own MySQL. Anyway, that's another topic.


If you want ISO-8859-1 then wouldn't it be better to suggest accept-charset="ISO-8859-1"?
Then the user, I expect would see his UTF-8 get mashed through the filter that is ISO-8859-1.. ISO-8859-1 being a subset of UTF-8.

Actually, I originally wrote my post with an ISO-8859-1 to UTF conversion example, and realized that should never be a problem and reversed it. The point is that I don't necessarily want ISO-8859-1, but certain pieces of software in the chain may.

Let's say the UTF-8 text includes characters not in the ISO-8859-1 set and it has been "mashed through the filter that is ISO-8859-1." I then use a hexidecimal editor to look at the actual numeric representation of the text (bits and bytes, not how those get mapped to characters by the encoding). In binary (or hex) will it look just like it does before that event?

From what you say, I gather that it will, and that one half of my question is answered. I get something that's in EUC-JP, dump it into the DB. It looks like gibberish, but when I pull it out again and serve it as EUC-JP to someone with Japanese fonts installed, it looks okay. If it goes on a page that's declared as UTF-8, I will have to check for the encoding and, if necessary, convert from EUC-JP to UTF-8 and hope for the best. If I don't do that and just serve up the EUC-JP as UTF-8, it can look like gibberish or, as I understand, look fine but have a different meaning since some codes are valid in both encodings, but map to different characters/words. If I serve it up as ISO-8859-1, I mostly serve a page of boxes and weird glyphs, unless the client end overrides my setting and figures out that it has Japanese on there, but of course I don't want to count on that.


Maybe not relevant to you, but the question I don't have an answer to is how POST works to declare which of the encodings have been used if they are distinct but potentially confusing..

Definitely relevant - that's the other half of my question :-) Perhaps I should just try some testing....

Incidentally and a little OT, a friend I know who does a blog in Japanese has defaulted to the failsafe option - he builds his page as a single photoshop image and saves it as one big jpeg.

Tom

ergophobe

8:20 pm on Dec 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks also for the CJK reference. I'm just using Japanese as an example because

1. it's one of the more difficult cases - if you can solve Chinese and Japanese, you know your stuff!

2. it seems to be the one that people know about because you really can't just let these issues slide.

In fact, I'm mostly concerned with European characters not in the ISO-8859-1 set.

As for MySQL, I appreciate that heads up as well. I see that the support is *way* better in version 4.1. I'm still running 3.x something since it still seems to be the most commonly available version. I know you can compile support for Unicode into it, but that also is not that common and I usually don't run my own server, so I can't count on it.

Tom

davidpbrown

8:37 pm on Dec 9, 2003 (gmt 0)

10+ Year Member



Let's say the UTF-8 text includes characters not in the ISO-8859-1 set and it has been "mashed through the filter that is ISO-8859-1." I then use a hexidecimal editor to look at the actual numeric representation of the text (bits and bytes, not how those get mapped to characters by the encoding). In binary (or hex) will it look just like it does before that event?

From what you say, I gather that it will..

I'm not sure about that.

If you can mash data up, and be rough with the like of EUC-JP, that would be a great help I guess. It may be that MySQL etc are robust but my take on it is that the information can be easily compromised.. certainly in my own simple way playing on Win98 editors often replace characters they don't understand with? or similar but I suppose there's no need for that and maybe software can retain the information.

I've never been able to reverse a confused text.. it would be interesting to know it's possible. Certainly I've spotted uncoding/wrongly coded pages/emails and corrected the encoding interpretation but that may be different from manipulating the text itself.

ergophobe

12:41 am on Dec 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Win98 editors often replace characters they don't understand with?

True enough, but is that because the underlying data has been replaced with an ASCII question mark, or because the editor is not smart enough to render data in that encoding? Normally it's the latter, but perhaps once you save a file, it's lost.

Next time I come across this, I'll try to do some tests and report back.

Tom

scott reynen

11:08 pm on Jan 7, 2004 (gmt 0)

10+ Year Member



I realize this topic's a bit old, but I just noticed it and thought I might help.

You're certainly right that Unicode character entities take up more space. I didn't mean to suggest the character entities would be ideal for database storage, but rather for output. I use UTF-8 encoding for database storage, which makes the text as small as it can possibly get. The only problem with this is that I need to remember which tables/records are UTF-8 and which are ASCII, so I can display them properly when I output the text.

Is anyone using UTF-16? I assume anyone doing Asian languages must be using it, but they probably have an appropriate operating system.

UTF-8 and UTF-16 can both represent any language, even the more complex Asian languages. I use UTF-8 to deal with Japanese text.

how do I know what I'm starting from?

You must have control over the input form(s), and specify an encoding there. If you don't know what encodings you're being sent, it's not going to work. There's no way to detect this, because a given set of bytes could translate to different and valid characters in different encodings.

ergophobe

12:32 am on Jan 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Scott,

I didn't realize you were a reader here. Can't remember how I found your article.

I didn't mean to suggest the character entities would be ideal for database storage, but rather for output.

Sorry, I meant that was the solution proposed in the WebmasterWorld thread I mentioned. I should have said the solution was "related" or something rather than "essentially the same".

UTF-8 and UTF-16 can both represent any language

Thanks. I should have known better.

You must have control over the input form(s), and specify an encoding

That's the real sticker. One can, of course, set the encoding the forms *expect*, but one can't be certain that every user will be pasting text in using that same character encoding. This only leads to the odd character here and there in my case, but it must be a nightmare in Japanese when someone pastes text from a word processor using one encoding into a form that expects text in a different encoding.

Thanks for the input.

[edited by: ergophobe at 3:58 pm (utc) on Sep. 24, 2004]

scott reynen

11:42 pm on Jan 28, 2004 (gmt 0)

10+ Year Member



I didn't realize you were a reader here.

I've dropped by from time to time when it comes up in search results, but I only discovered this thread because of my referrer logs.

One can, of course, set the encoding the forms *expect*, but one can't be certain that every user will be pasting text in using that same character encoding.

When you set a character encoding for an HTML page, that tells the browser to encode *all* input characters that way *before* sending them on to the server. If you set an input page to UTF-8 encoding, the text will be UTF-8 encoded when it gets to the server. If it's not, something is broken.

ergophobe

5:12 pm on Jan 29, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Check.

I guess one problem I have is client side - someone pastes text into an input area and can see that it is a bit screwy, but just leaves it as such. I suppose there's nothing I could ever do about that except send as UTF-8 and hope for the best.

Tom