Forum Moderators: coopster

Message Too Old, No Replies

utf-8, php and mysql

It would be usefull to have a thread to exchange and discuss experiences.

         

richard

5:44 pm on Feb 24, 2003 (gmt 0)

10+ Year Member



We're constructing a multi-lingual site using php, mysql and utf-8.

In our html application, up to 5 different languages could be displayed on the same page at the same time.
We've choosen utf-8 because it seems to be the future that's already ripe to use now.
We haven't encountered any problems up to now, as long as we stay in utf-8 the whole time. That means both with inserts/updates as well as using <meta http-equiv="content-type" content="text/html; charset=UTF-8">.
(we did have to do some utf8_encode()ing on text we imported from existing tables into the new structure.)

Up to now we've only tested it with run of the mill characters like üÜöÖäÄß. We will be testing using Japanese, Chinese, Korean and Arabic characters in about 1-2 months.

We don't expect to experience any problems with data storage, (inserting/updateing), but where we are sure we will see odd results is with order by and the like. That is with data retrieval.

We have mysql 3.23.47, and at least for the forseeable future, aren't able to adjust its configuration. (Dependant on server provider.)

Perhaps there's others out there, who'd like to share their experience(s) with us.

Brett_Tabke

5:40 pm on Feb 25, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



It's cake on the back end. The problems arise when you go interactive. Even though you specify utf8 to the browser, what comes back can often be _anything_. That data will need scrubing.

richard

3:08 pm on Feb 26, 2003 (gmt 0)

10+ Year Member



thanks Brett_Tabke
Here's the function we use to scrub all text before being written into the database.
For an explanation of the native php functions go to php manual [php.net]
 
function textIn( $text, $noNewLine = false, $specialChars = true )
{
$text = stripSlashes($text);
//$prePend used for any addition replacements
$prePend = array( "<script>"=>"", "</script>"=>"", //for security
"\r\n"=>"\n", "\n\r"=>"\n", "\r"=>"\n"); //only \n as newLine
//$trans translates HTML_ENTITIES into their decoded values
$trans = get_html_translation_table(HTML_ENTITIES);
$trans = array_flip( $trans );
$trans = array_merge($prePend, $trans);
//this replaces all keys found in $text with their values
$text = strtr($text, $trans);

//utf-8 encode the text
$text = utf8_encode ( $text );

if( $noNewLine ){
//optional, if new lines are not wanted
//and replace any multiple spaces with single spaces
$text = preg_replace("/\s+/"," ",$text);
}
if( $specialChars ){
//optional, replace <>&" with their HTML_ENTITIES
$text = htmlspecialChars($text,ENT_COMPAT, UTF-8);
//[i]Does anybody know exactly what specifying UTF-8 here does?[/i]
}

return $text;
}

Its anticipated that very much of the texts, will be copy and pasted from emails, rtf and doc files, etc. It will need scrubbing, how much we probably will only know after the application is launched in its beta version.

We don't want to give to much away about the intent or nature of the project untill its launched. Suffice to say, its inspired by the spirit of "copy left" as used in open source [opensource.org] and GNU General Public License [gnu.org]. You could think of it, as a form of "living" achive.

Where we expect to have problems is with sorting, mysql's "order by". For example, how will mysql order utf8 Chinese characters, certainly not A-Z.

As yet we have not found anything about utf-8 and mysql. Perhape somebody has a few tips or links in regards to this, as well as generally about utf-8.
Here's a list of utf-8 and charset links we've found useful.
[w3.org...]
[unicode.org...]
[hclrss.demon.co.uk...]
[lcweb.loc.gov...]
[zsigri.tripod.com...]
[geocities.com...]
[czyborra.com...]