Forum Moderators: open

Message Too Old, No Replies

UTF-8 or GB / Big5 encodings?

Do the engines understand?

         

shri

2:05 am on Mar 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I can only have one page per topic for this greater china area.

How bad would it be if I encode in UTF-8? I see pages in Hong Kong encoding in big5 and was curious to see google etc can figure out of the utf8 is equivalent of big5.

bill

3:28 am on Mar 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you're targeting the mainland then GB2312 is the safest encoding to use. Big5 is for the Hong Kong and Taiwan markets. You're looking at different languages as well: Simplified Chinese on the mainland and Traditional Chinese for HK & TW. The characters used are different. (You probably knew all that.)

You can use UTF-8 but be aware that there are issues with older browser support. Also keep in mind that your content can be accessed by a lot of non-PC hardware nowadays. Your pages can be pulled up by phones, TVs, PDAs, etc. and some of them may have display issues. In China you can't go wrong with GB2312. UTF-8 is more of a luxury for the webmaster.

The SEs can handle the character sets as long as you code the page properly. Just declare the charset with a HTTP header, add a meta

charset
tag just before your
title
element and declare the content language on the
<html>
tag.

shri

5:44 am on Mar 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, targeting Hong Kong for now and just want to figure out if I am safe doing it in UTF-8 (for further expansion) or if I should muck around with country specific / character set targeting in what might turn into a multi-country site.

shri

5:51 am on Mar 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just did a search with the lr=lang_zh-TW parameter and the pages show up. I assume this means that google did sort of figure out the UTF-8 encoding?

leunga

4:40 am on Mar 5, 2006 (gmt 0)

10+ Year Member



Hi Shri,
Yes, I think utf-8 encoding will not harm your site. It should be visible to Google and others. SEs will also determine your site to see if it is in Traditional or Simplified language. I guess this is done by looking at the range of characters the site used. I also examined google search box and toggled the "show HK website only" option, I can see some sites are in fact encoded in utf-8. :)
leunga

guoqi

1:48 pm on Mar 6, 2006 (gmt 0)

10+ Year Member



If you are targeting mainland China, I supposed that you better use GB2312.

UTF-8 has different encoding schema than GB2312. People who use browsers/devices with pre-defined encoding of GB2312 might have difficult in visiting your site, they will only see many strange characters. Although most of browsers are supposed to select the correct encoding according the charset defined on your page. In practice, I have seen this function fail pretty often.

My question is: if you can use GB2312, why risk yourself in using UTF-8?

It is just my opinion.

leunga

1:08 pm on Mar 8, 2006 (gmt 0)

10+ Year Member



If characters are to be encoded in GB for display, shall we need to input characters to database also in GB? If yes, there will involve additional consideration. In HK, as far as I knew, most Chinese character input methods are in Big-5. If GB is preferred, I guess there will need some sort of conversion, e.g. using iconv with php.