struggling with chinese characters, in MySQL/PHP

Forum Moderators: phranque

Message Too Old, No Replies

struggling with chinese characters, in MySQL/PHP

httpwebwitch

2:47 am on Jul 13, 2010 (gmt 0)

This seems like such a newb question, but I'm really struggling with this.

I have the Traditional Chinese characters "漢字測試" in my MySQL database. The encoding is UTF-8, for the field, and for the database as a whole. I'm sending the proper Character set header. I even have the <meta> tag in there specifying that the page is UTF-8.

But on my page, it looks like "?"

When I SELECT the characters in MySQL, they are Chinese. If I type those chinese characters directly into the HTML, they look fine. But when I SELECT them from the database, and print() them on the page, they look like "?"

Try as I might, I can't get those characters to render on my web page.
I've tried umpteen variations using utf_encode() and iconv()

I should add that other international characters are showing up OK... Spanish and French are looking fine.

What am I doing wrong?
...rephrase that.
How do I do this right?

phranque

5:00 am on Jul 13, 2010 (gmt 0)

i would assume you need to configure PHP with the mbstring [php.net] module to make that work.
that's only a guess on my part - i only marginally have the equivalent perl knowledge.

httpwebwitch

6:15 pm on Jul 13, 2010 (gmt 0)

here's an interesting illustration

<?php
header('Content-Type:text/html; charset=UTF-8');

$r = mysql_fetch_array(mysql_query("SELECT name FROM items WHERE itemid = 1234"));
$n = $r['name'];
echo $n;

// shows ? ?

$r = mysql_fetch_array(mysql_query("SELECT HEX(name) as name FROM items WHERE itemid = 1234"));
$n = $r['name'];
echo hex2str($n);

// shows chinese characters, properly

// hex2str is a nifty little function I found online:

function hex2str($hex){
 $string='';
 for ($i=0; $i < strlen($hex)-1; $i+=2){
 $string .= chr(hexdec($hex[$i].$hex[$i+1]));
 }
 return $string;
}
?>

this method shows everything properly, regardless if it's chinese, spanish (with accents), or english (latin), in every browser I've tried.

It does seem like excessive effort, just to show some characters exactly as they're stored, from a UTF8 database onto a UTF8 page. Why does it work when I convert text into HEX and back again?

httpwebwitch

6:22 pm on Jul 13, 2010 (gmt 0)

Does this mean that whenever I want to show text from the database (um, a LOT), I'll have to SELECT it as HEX?

Holy frijoles, there must be a better way than this

httpwebwitch

1:17 am on Jul 14, 2010 (gmt 0)

correction to the code above. It works in Chrome, Firefox, and Safari on a Mac.

It does not work in Firefox or IE on Windows.

Someone put me out of this character encoding misery.

youfoundjake

1:55 am on Jul 14, 2010 (gmt 0)

What happens if you change it to latin1_swedish_ci?

phranque

7:48 am on Jul 14, 2010 (gmt 0)

the windows problem might be a missing font.

httpwebwitch

2:41 pm on Jul 14, 2010 (gmt 0)

I've discovered what my "problem" is. Turns out I was displaying things right all along, but my data was bad because the method I'd provided users to enter it was not posting a <form> with the right encoding.

Things *looked* fine in the database because I could SELECT stuff and it looked like Chinese, but the data was in fact not in UTF8, despite being stored in a database/table/column that were all configured as UTF8.

Now that I've dicovered the problem, I realize it was an easy trap to get stuck in. I'll certainly write up a how-to and post it here.

httpwebwitch

2:45 pm on Jul 14, 2010 (gmt 0)

oh @phranque, the mb- functions are great, they're replacements for string functions like substr(), that perform properly on multibyte characters. PHP's native string functions only work properly on single-byte characters. ttyl

httpwebwitch

3:33 pm on Jul 14, 2010 (gmt 0)

The (now obvious to me) fact is, if you store Chinese characters in a database that is configured as UTF8, what you should see when you SELECT from it is "è„‚è‚ªéš".

If you SELECT from the database and you see Chinese characters, your data is wrong. That was the crux of my gotcha. I was seeing Chinese characters in the database, thinking that was correct, and that my display methods were wrong. It was the other way around!

Typically, if a page that has a <form> on it is encoded as UTF8, then the $_POST that comes back will be in UTF8. Assuming your database is also configured with UTF8 encoding, you don't need to do anything special to the data before an INSERT or UPDATE - just put it in there exactly as it arrived. (Of course you should still run it through mysql_real_escape_string() on the way in and htmlspecialchars() on the way out, to avoid SQL injection and XSS respectively!)

And when you want to display something UTF8 encoded from your database onto an HTML page with UTF8 encoding, it's simple: SELECT it, and print() or echo() it as is. The bytes sent to your browser are "è„‚è‚ªéš", but what you see rendered is "脂肪魚"

When I first started this project, those pages - with the form on them - were not using UTF8 encoding. So the data coming into $_POST from the <form> was not UTF encoded, and it was being stored in a Latin1 db column.

Later, when I started this whole i18n endeavour, I switched the database to UTF8 and put the UTF8 character headers on the pages too. But the stuff in my database was not multi-byte encoded.

An HTML page using UTF8 expects the source to have "è„‚è‚ªéš" in it. It will render that as "脂肪魚". IF a UTF8 HTML page receives "脂肪魚", that's when you see "?????".

BTW it's not just Chinese that is affected, it's dashes and apostrophes and copyright signs and all the other rich text characters. Users will paste them into the form straight from Word. That's how it is.

Another gotcha was that I'd used other methods to insert data - importing from CSV, importing from TXT files, etc - these were not UTF8 encoded, so they were stored wrong. Looking in my database, I see a word like "Sugestión", but what I SHOULD see is "SugestiÃ³n". The smart thing would have been to read the source data into PHP, and INSERT it after applying utf8_encode() to it.

The moral is: start EVERY project with everything configured in UTF8, from the start, whether you think it'll be required or not. And realize that the data in your database should not look readable, it should look like "è„‚è‚ªéš".

Now, I also learned that MOST browsers will return UTF8 encoded stuff in the $_POST when the <form> is on a page that uses UTF8 encoding. This is *mostly* true, for modern compliant browsers. But many bloggers cite that you can't count on it with 100% certainty. Some browsers might send their $_POST to your server with "脂肪魚" instead of "è„‚è‚ªéš". And as I know now, if I INSERT "脂肪魚" into my database, later on I'll see it on the web site looking like "?????"

The workaround (emphasis on "work") is to put a hidden field in your <form> that contains some extended characters, like "óóó". Then when your server receives the $_POST, you check to see if that field looks like "Ã³Ã³Ã³". If it does, then you know the $_POST is UTF8 encoded, and you may INSERT everything into your UTF8 database as-is. If it's not, there are checks you can do to figure out what encoding it is in, and apply appropriate conversion techniques to all the other data.

So, problem solved, I finally understand where I went wrong and how to fix it. I hope this helps you if you encounter similar problems with your own i18n efforts. Finally I can't emphasize strongly enough, it's WORTH THE TIME to learn and grasp all this encoding stuff, and start every project with everything in UTF8.

Now that domains and URLs and email addresses can contain Arabic characters etc, you really have no excuse to store anything in Latin1. The only things I still keep in Latin1 are MD5 hashes, private ENUMs and other such things that will never require an extended character set.

httpwebwitch

3:46 pm on Jul 14, 2010 (gmt 0)

Final note:
The program I use to browse my MySQL database shows everything looking like "Ã³Ã³Ã³", and there's no option (that I can find in the menus or help) to display it in a readable way.

But, a similar program on my Macbook does allow you to "view as UTF8", so even though I know the stored data is multi-byte encoded, you get to see the readable characters.

It is prudent to be sure what is really happening when you use such an app for editing data. When you type "��" directly into the field in your GUI, is it being stored properly? You'll know it's not if your web page shows "????" or that funny diamond-shaped or empty rectangle character...

best wishes, hww

JAB Creations

10:13 pm on Jul 14, 2010 (gmt 0)

If you know me well enough you know I'm always pushing standards to their brink (XHTML as application/xhtml+xml, WAI AAA compliance, and of course UTF-8) so lucky for you I figured this out a long time ago... ;)

GET and POST

$query1 = "SET CHARACTER SET 'utf8';";

POST Only

$query2 = "SET NAMES 'utf8'";

Also make sure that when you create a database that you set the collation to utf8_general_ci which I think offhand automatically makes all the tables you create automatically UTF8 (at least through phpMyAdmin).

- John

Ujang

11:44 pm on Jul 14, 2010 (gmt 0)

I have similar problem with japanese charset before. What I did was - in Notepad++ - there's an option to place a signature on your files which lets the browser (user agent) know they're dealing with UTF-8 pages.

Menu >> Format >> 'Encode in UTF-8 Without BOM'.

**BOM info: [w3.org ]

JAB Creations

12:05 am on Jul 15, 2010 (gmt 0)

Ujang remembered something I did not earlier and they're right; Microsoft thought up of an ingenious (sarcasm) solution in order to further aggravate web designers worldwide. If you serve your pages as application/xhtml+xml and you've got a BOM on the page it should break your XML in fully compliant browsers (Gecko and Presto though KHTML and WebKit are not fully XHTML compliant and it's too early to call on IE9).

I use a freeware program called SuperEdi which allows you to remove the BOM when saving a file. Most editors in my experience do not display the byte order mark option (which I've never found any justifiable reason to use let alone enable by default in any scenarios).

- John

UserFriendly

12:59 am on Jul 15, 2010 (gmt 0)

httpwebwitch, have you made sure to tell PHP which encoding it should be using on your database connection?

You need to make a call to the set_charset function for your MySQL driver. For instance, if you're using the MySQLi (improved) driver, then you would issue a call like this immediately after you've opened the connection resource (referred to by a variable called $db):

mysqli_set_charset('utf8', $db);

If you are using the standard MySQL driver instead, then the call would be:

mysql_set_charset('utf8', $db);

Note that you can also call these methods in the object-oriented notation:

$db->set_charset('utf8');

See the PHP documentation for this:

[php.net...]

[edited by: UserFriendly at 1:04 am (utc) on Jul 15, 2010]

UserFriendly

1:00 am on Jul 15, 2010 (gmt 0)

Oh, and note that the PHP docs state that the set_charset functions are preferred to simply issuing a "SET NAMES" query (because that query does not tell PHP what encoding to use and expect).

kazakh

6:30 am on Jul 15, 2010 (gmt 0)

mysql_query('set names utf8')

httpwebwitch

1:20 pm on Jul 15, 2010 (gmt 0)

@UserFriendly,
using mysql_set_charset() caused my Chinese characters to be displayed as "偉大的產品好價錢"

Without mysql_set_charset(), everything works fine. I can accept a <form>, read the $_POST, do a SQL INSERT, SELECT, print()... all the pieces fit and the Chinese characters look good on the page without any fiddling with encodings or conversions.

In what situation would mysql_set_charset() be needed? Is that for when your DB is Latin1, but the page is UTF8? Or the other way around? Alls I know is if both are UTF8, mysql_set_charset() causes problems.

kapow

1:43 pm on Jul 15, 2010 (gmt 0)

...and apply appropriate conversion techniques to all the other data

We manage an old japanese site where all the content is in hex (static html not db). Is there a quick way to convert all the old content from hex to utf8 encoding?

phranque

3:20 pm on Jul 15, 2010 (gmt 0)

use the iconv utility or API.

UserFriendly

3:53 pm on Jul 15, 2010 (gmt 0)

httpwebwitch, that is strange. (And I take it that "偉大的產品好價錢" is not correct. I'm completely clueless about eastern script.)

To my understanding, set_charset should be used every time a MySQL connection is opened, unless you're certain that the defaults (for client and server) are exactly as you need them already.

There's no chance that the data was inserted into the database over a connection that had another encoding, is there? Such that the database is storing a non-utf8 byte stream and issuing it under the banner of a UTF-8 encoding? The only way I can think of to check for this is to set_charset('utf8') and then store a new record in the database, then retrieve only that record and see whether everything is correct. If that does produce a correct retrieval, it would suggest that the data already in the database is in the wrong encoding. If instead you see the same problems, then I don't know what's going on.

httpwebwitch

3:57 pm on Jul 15, 2010 (gmt 0)

@UserFriendly, OK I'll try that.

Oh, I found a nifty trick for finding rows that have multibyte characters in MySQL:

SELECT mycolumn FROM mytable WHERE LENGTH(mycolumn) != CHAR_LENGTH(mycolumn);

neat.

httpwebwitch

4:13 pm on Jul 15, 2010 (gmt 0)

@UFriendly,
I added mysql_set_charset('utf8'), then viewed the page. The Chinese looked like 大的� etc.

Then I edited that item and resaved it, and now it looks OK again on the page.

So it was working before, and now it's working after. Status quo. However now I have a bunch of data that needs to be edited & re-encoded, especially all the Spanish stuff with their funny accents.

So now I wonder, what is the real state of things in my database? are things being encoded, double-encoded, anti-encoded, or what. If I export the database into Excel, what will I see?

headdesk.
Is this job done yet? :)

UserFriendly

3:54 am on Jul 16, 2010 (gmt 0)

So you think it is simply the case that the data was inserted over a connection declaring the wrong encoding?

If so, at least you've uncovered the cause of the strangeness.

As for re-encoding the existing data, there's probably a way of doing that automatically if you work out in which encoding it was sent to the database. But that's a whole other can of worms, and probably only worth the hassle if you have hundreds or thousands of records to fix.

httpwebwitch

3:22 am on Jul 17, 2010 (gmt 0)

Yes, it is the case that my data was not properly encoded. Now I'm certain that everything is UTF8 - going in, coming out, top to bottom. I've tested on 6 different browsers on 3 operating systems, entering Chinese, Cree, Hebrew, Arabic, and French. Everything is A-OK.

I do still have some old data with bad encoding, but it's only a few thousand rows; I can find & replace the naughty characters and it shouldn't take too long to get everything smoothed out. In fact it's half done already; most of the naughty bits were dashes, apostrophes, copyright symbols, and the occasional accented vowel. Oh, and there is some Spanish in there with that funny little squiggle thing over the "n". Got those.

UserFriendly

12:20 am on Jul 18, 2010 (gmt 0)

Glad you're on top of it now.