homepage Welcome to WebmasterWorld Guest from 54.211.95.201
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Unicode Support
About php and UTF-8 encoded strings
gaouzief




msg:1281117
 5:43 pm on Apr 18, 2004 (gmt 0)

Hello,

i'am about to develop a little web app that should later be localised in other languages.

while doing some research on using unicode, i was told PHP (which i usually use for web stuff) does not support Unicode natively, but i still see some functions related to UTF-8 string:

utf_decode() and utf_encode()

Now i'm not an expert at Charsets and i have some "straight" questions to anyone with more experience with this issue:

- Will i be able to manipuate (string manipulation functions ) UTF-8 encoded strings with PHP?

- is it possible to insert and read UTF string to and from a database using php (ie without transforming the string)

and generally are there anythings i should be cautious about using PHP in a localised environnement?

Thank you in adavance

 

gethan




msg:1281118
 6:16 pm on Apr 18, 2004 (gmt 0)

I did some research into this recently and this is where I believe things stand. (so much confusing data out there that I'm not entirely sure).

As of PHP 4.x

- You can not program in UTF8 - therefore no localised variable names - (I think) this is what is meant by PHP not supporting UTF-8 natively.

- (I think that) this also applies to pattern matching, string manipulation etc - basically processing UTF8 strings! - not entirely sure here.

- You can open files of UTF-8 content and pipe to browsers, applies to databases as well.
Caveat - the database must support UTF-8, MySQL 4.1 - does - prior to that seems to convert to htmlencoded characters.

- utf8_decode/encode: "Converts/encodes a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1" - so very rudimentary support for ripping out and entering ISO-8859-1 characters to UTF-8 strings - not really useful IMO, might come in handy when converting an old site to UTF-8.

So you can in most situations work around this for a content site, but something dynamic, and full featured is going to be difficult.

ergophobe




msg:1281119
 8:49 pm on Apr 18, 2004 (gmt 0)

You might find this thread useful

[webmasterworld.com...]

Also, I have since played around with the multi-byte string functions and find that you can do a lot with them, so it's quite reasonable to have one encoding for PHP code and others for output.

One strange quirk, if you put ISO-8859-anything in the mb_detect_order() (or leave it as default) it will always try to detect as ISO-8859-whatever even if
- the doc is encoded as UTF-8
- UTF-8 comes first in the mb_detect_order() list.

Since ISO-8859-1, for example, is merely a subset of UTF-8, if you encode as UTF-8 and then try to detect the encoding, it will always detect as ISO-8859-1 unless you specify a detect order that doesn't include iso-8859-1 at all.

ergophobe




msg:1281120
 8:53 pm on Apr 18, 2004 (gmt 0)

By the way, have you read the current manual page for multi-byte strings [us3.php.net]? It's been significantly updated in the past 6 months or so and gives some pretty good guidance on the issue.

Tom

gaouzief




msg:1281121
 12:08 pm on Apr 19, 2004 (gmt 0)

thank you all for the highly usefull links that answered most of my questions

i still have a couple though:

- is there a reliable way of detecting the charset of a string with php and on the client side

- how safe is it to convert from UTF-8 to another charset and vice-versa using mbstring functions

and finally (little offtopic) i read that javascript was unicode friendly, are there anyways to pre-validate and maybe convert inputs to the correct charset on the client side using javascript

ergophobe




msg:1281122
 10:03 pm on Apr 19, 2004 (gmt 0)

1. I played around a bit and PHP is reliable at detecting an encoding within limits. Those limits are

- it will generally detect as ISO-8859-* unless you take measures to prevent it. That's rarely a problem, because ASCII is a perfect subset of ISO-8859-1 which is a perfect subset of Unicode.

- it will not necessarily tell you what encoding the text was in before the user pasted it into the textarea. In that case, it should get translated to the encoding for the page (or element, if different). The text will not, however, look right unless redisplayed in the original encoding, and that is probably lost to you at that point.

2. In what testing I have done, the mbstring functions worked without problem, though I was only using euro languages and was mostly concerned with the incompatibilities between Windows-1252 and Unicode.

3. I don't know about javascript conversion. As I mentioned in passing in #1, there will be an autmoatic (and generally unwanted) conversion if a user with an application that defaults to Windows-1252 pastes text into your web page that is in UTF-8. Of course, for overlapping code points (most of them actually in this particular case), there is no problem. For non-overlapping code points, you'll get some weird results.

Whether or not it's possible to use JS to capture the paste and detect the encoding being used is beyond me. It sure would be interesting to know.

If the user is *not* pasting from another application, there should be no problem. Everything should work in the encoding that is set for the web page AFAIK (again, I have only used a few western european-based fonts, so it's not exactly the toughest test case).

Tom

and finally (little offtopic) i read that javascript was unicode friendly, are there anyways to pre-validate and maybe convert inputs to the correct charset on the client side using javascript

gaouzief




msg:1281123
 10:56 am on Apr 20, 2004 (gmt 0)

well this is exactly the issue i'm facing

3. I don't know about javascript conversion. As I mentioned in passing in #1, there will be an autmoatic (and generally unwanted) conversion if a user with an application that defaults to Windows-1252 pastes text into your web page that is in UTF-8. Of course, for overlapping code points (most of them actually in this particular case), there is no problem. For non-overlapping code points, you'll get some weird results.

my idea was to convert whatever charset used to utf-8 for processing and storage, but that supposes reliable detection and conversion mechanisms, if as you seem to say, those mechanisms are not reliable in php, i'm left with two solutions:

- either look for some other processing language (python for example)
- or trade off standards for usability, ie - use whatever charset the users use as your processing and storage charset - and again this poses the question of application reusability and localisation as you would have to alter code for any new language etc...

Hmm, i think i will further dig into the javascript charset detection and conversion abilities --

Just a few thaughts

ergophobe




msg:1281124
 5:56 pm on Apr 20, 2004 (gmt 0)


well this is exactly the issue i'm facing

Oh, on this particular topic I've looked into a fair bit. I was thinking you wanted Asian encodings and all that. That's beyond my knowledge. I also think I was unclear or actually wrong on a couple of points of detail. I looked through my notes on this and I think you can do everything you need.

This is what I tried and what seems to work for me...

Bascially, I have had problems with the character encoding after converting a Microsoft Access datase to MySQL.

The problem with the data.

Everything usually looks fine as ISO-8859-1, but sometimes it's weird and it won't validate (non-sgml characters in the 128-159 range). These code points are defined in Windows-1252 and so, in my case, the problem results from user input that was pasted into a MS Access form by a variety of users with different native character encodings on their machines, some of them using Windows-1252 encoding. The problem is that any characters in this range need to be converted or the document as a whole must be presented in Windows-1252. If you want to use things such as em dashes, that will not be compatible. Whereas iso-8859-1 characters will all be correctly displayed if interpreted as utf-8, this is not true for the Windows-1252 characters with code points between 128 and 159 (80-9F in hex).

* Far and away the best discussion of the problem in general and particularly as it relates to HTML presentation is offered on Jukka Korpela's page On the use of some MS Windows characters in HTML [cs.tut.fi].

* See also the Windows-1252 code point table near the bottom of the Wikipedia article on ISO-8859-1 [en.wikipedia.org].

* And Chris Wendt's comments [lists.w3.org] from way back in 1998.

* The quick converter from codeside seems to be an easy way to convert Windows-1252 to numeric entities [code.cside.com].

The problem with PHP

Beginning with PHP 4.3, multi-byte functions are enabled by default. So why not just test for the encoding, like so:

mb_detect_order [php.net]("ASCII, UTF-8, Windows-1252, ISO-8859-1");
mb_detect_encoding [php.net]($string);

Why not? Because that will always return "ISO-8859-1"


For ISO-8859-*, mbstring always detects as ISO-8859-*.

You would think that since the Windows-1252 encoding was valid, it would stop there, but it doesn't. It will always return iso-8859-* if it is in the detect order list. Stupid, but true. Fortunately, ISO-8859-1 values map perfectly to utf-8, so if instead we have

mb_detect_order("UTF-8, Windows-1252");

it will return UTF-8 for a valid ASCII or ISO-8859-1 string, but Windows-1252 for a Windows-1252 string. From there, we can simply convert like so.

$utf_8_string = mb_convert_encoding("This is a ‘windows-1252’ string, UTF-8, Windows-1252);

So putting it all together, this *seems* to work, though I haven't tested very much.

mb_detect_order("utf-8, windows-1252");
if (mb_detect_encoding($string) == "Windows-1252") {
$string = mb_convert_encoding($string, UTF-8, Windows-1252);
}

I think this solves my particular problem I was having with character sets. Of course, there are limitations. You would have to check for each available character set and convert. I haven't tried it, but I imagine you could generalize it somewhat. You would need to carefully construct your detect order list and then you could do something like

if (mb_detect_encoding($string) && mb_detect_encoding($string)!= "UTF-8") {
$string = mb_convert_encoding($string, UTF-8, mb_detect_encoding($string));
}

I would want to test that extensively and be very careful with the detect order list. If you are serving up the page in UTF-8, you would want to check only for character encodings that do not map to UTF-8 and make sure to encode those. Above all, you can't have anything from the ISO-8859 family in your list.

Big Caveat

The reason I have Windows-1252 characters in my actual documents on MySQl, however, is that these were imported straight from Access to MySQL without going through a web page. I am 99% certain that it will not function this way throuhg a form (did some tests, but can't find my notes). This is not a limitation of PHP, but inherent in the way that the web and character encoding work. Switching to another server-side scripting language will not help you.

First, the character encoding side.

Question: If I have data that is generally being encoded as UTF-8, and the sample includes only a-zA-Z, what encoding will be detected for that sample? Answer: ASCII, not UTF-8, and that is going to be true whatever you use for detection - Perl, PHP, Python or whatever. If I am detecting, I want to differentiate between ASCII and UTF-8, but since the code points for a-zA-Z are completely overlapping in ASCII and UTF-8, there is no way to distinguish them based on my sample. By preference, you want your detection routine to be conservative (otherwise you have no way to detect ASCII vs UTF-8) and therefore detect this as ASCII. Then in your programming logic, you can make decisions about whether that's okay.

If you want to detect as UTF-8 everything that could be represented as UTF-8 (e.g. ASCII, ISO-8859-1), then you can not give your detector (PHP or otherwise) the choice of picking ASCII or, if it does, you must tell it to treat it as UTF-8.

Now when it comes to distinguishing the Unicode/ASCII/ISO-8859-* family from Windows-1252, PHP will be flawless if you understand one thing. As in the previous example, if the Window-1252 includes only a-zA-Z, it will be detected as ASCII (or ISO-8859-1, or UTF-8) because these code points are fully overlapping and will never present a problem. It will only detect as Windows-1252 if the user has used a character that is not valid in your other set of choices. There are certain punctuation marks, for example, in the Windows-1252 encoding that have no corresponding values in Unicode. If the user uses one of these, PHP can and will detect the encoding as Windows-1252 and the mb_string functions can do the proper conversion.

Web side

The problem comes in when you have a web page with the encoding set to UTF-8. Someone working on a system that has a word processor using Windows-1252 pastes in text with a curly appostrophe (‘). When pasted, this is automatically translated to UTF-8, so 'I'm' becomes 'I=m'. The user can see this happen, but may not notice and may not do anything about it. If this form is sent, it will become a completely valid UTF-8 sequence, namely 'I=m' and, I'm reasonably certain that the character will become a Unicode = and not a Unicode or ASCII single quote (I'll have to test this again though). It's too late to do anything about it with PHP as far as I know.

Whether or not you can catch this with JS I don't know, but I would really like to see what you find.

jatar_k




msg:1281125
 7:03 pm on Apr 20, 2004 (gmt 0)

very nice post ergophobe, added to library.

ergophobe




msg:1281126
 7:50 pm on Apr 20, 2004 (gmt 0)

Thanks, when the whole thing came up in preview I thought "Wow! That's so long nobody will ever read it!"

I hope it saves some people some time on these issues. They seem to be coming up a lot lately. I guess people are starting to take the World in WWW seriously.

Tom

coopster




msg:1281127
 8:56 pm on Apr 21, 2004 (gmt 0)

Thanks for adding this to the Library, jatar_k. And I would like to echo your accolades for ergophobe's contribution to this forum, particularly the enlightening insight and research regarding Unicode support. I've watched and read the struggles and successes over the conversion of the MS Access to MySQL database and have learned from the sidelines. Cheers to you, ergophobe.

I've collected some of the other threads here regarding Unicode support and put them in a list for easy reference...

Related Links:
UTF-8, ISO-8859-1, PHP and XHTML [webmasterworld.com]
Big5 to Unicode [webmasterworld.com]
Need character entity for Trademark symbol [webmasterworld.com]
XML XSLT Sablotron Strangeness... [webmasterworld.com]
Writing English [webmasterworld.com]
Unicode for multi-lingual website [webmasterworld.com]
Handling Character Encodings [webmasterworld.com]

[edited by: ergophobe at 6:58 pm (utc) on Nov. 10, 2004]
[edit reason] added url [/edit]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved