Forum Moderators: coopster

Message Too Old, No Replies

htmlentities and Foreign Characters

         

mdurrant

7:17 pm on May 26, 2008 (gmt 0)

10+ Year Member



Hello,
When someone posts content on my site and it's in a foreign language (in this case, Chinese), it saves into the DB like 请 but when I use htmlentities it converts the ampersand and ends up like 请 in the code, causing the ascii code to display instead of the character.

Is there a better approach to this? I'm stuck and would appreciate some guidance, Thanks! :)

httpwebwitch

8:01 pm on May 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ah, double-encoding. A bane of existence. The obtuse answer is, don't use htmlentities()! But that's probably not the useful advice you were looking for - akin to "if it hurts to breathe, stop breathing".

There are some really great tutorials out there about handling extended characters. It's a complicated subject... It'll take a bit of reading, but you should learn about UTF-8, how it works, how to design a system that uses it effectively. Do some searches for "UTF-8" and learn the basics.

Then once you have absorbed what UTF-8 is, there is an entire chapter devoted to using it effectively in the O'Reilly book "Building Scalable Web Sites" by Cal Henderson.

A rule of thumb is: don't store escaped stuff in your database. Unless you're deliberately "denormalizing" or "pre-processing" data for performance reasons, the data should be stored in its rawest, unencoded, unescaped, nakedest form possible. That means, you look in the database and you should see Chinese characters, not 请 stuff.

Then when you're preparing/rendering data for output, that's when you do htmlencoding, escaping, etc., as required. For instance, if your data is being output in XML, there's a lot of encoding and escaping that needs to be done.

Getting user-entered data into the database "raw" is tricky, and it's where built-in methods like PHP's mysql_real_escape_string() comes in handy.