Forum Moderators: open

Message Too Old, No Replies

Encoding pages for a Chinese audience

Newbie needs advise.

         

HarryM

8:22 pm on Feb 15, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am a newbie at this so would appreciate any advise. I have trawled previous posts but could not find anything that answered my specific question, and of course the situation may have changed.

I have a non-commercial .com site hosted in the UK. It uses XHTML 1.0 Transitional, with charset ISO-8859-1. I am proposing to make a duplicate of part of the site which should be readable to the majority of Chinese users and also Asian search engines. The pages contain mainly images and very little text, so translation is not an issue. The new pages will be written in Mandarin using Simplified Characters.

My question is, what encoding is best? And how do I implement it?

The &#****x; codes are the easiest because I can create them with Word 2000. They seem to render correctly in all the browsers I have tried no matter what encoding I set on the page. However I am based in the UK, so I don't know if this would be the case in Asia.

If this is the way to go, what do I set my encoding to?

The obvious alternative is BG2312. But if this is the way to go, is there an easy method of conversion from characters produced on Word 2000? Do I also have to save the pages in any special way? (They are all .php)

If BG2312 is the way to go, what do I declare in my pages? I have tried various combinations, but nothing seems to automatically set my browser to the correct encoding. My current header is:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"../DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

I realise I will have to change the charset, but I am unclear about the "en". In the HTML statement, do they both get replaced by zh?

Also is it necessary to include an XML declaration, such as:

<?xml version="1.0" encoding="gb2312"?>

Thanks in advance

Harry

takagi

9:10 am on Feb 16, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The &#****x; codes are the easiest because I can create them with Word 2000.

I don't know much about Chinese encodings, but once you limit yourself to those codes, then there is no need anymore to worry about the encoding.

HarryM

1:53 pm on Feb 16, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



takagaki,

Thanks for the reply. This is the first time I have ever had to become involved in encoding, so I am still lost.

I understand that &#****x; codes are ASCII codes. They render OK if I have my charset=iso-8859-1. They also look OK if I have charset=bg2312, but mine is all Western software and I don't know if these fonts are normally available on Chinese PCs. As I am only just starting on writing the Chinese pages, I want to make sure I get all encoding issues sorted out before I get to far.

From what you say they should render OK in Japanese PCs. As a side issue can Japanese speakers read Chinese simplified characters as Kanji, or only the traditional Chinese versions? In fact do many people read Kanji these days?

Harry

bill

4:01 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just to clarify here...you want GB2312, not BG2312

My Chinese site isn't XHTML yet, but here's what I've used:

<html lang="[b]zh[/b]">
<head>
<meta http-equiv="content-type" content="text/html;charset=[b]gb2312[/b]">
<meta http-equiv="content-language" content="[b]zh[/b]">

Your current headed declares the page English with Western European encoding. If you're going to use Chinese text on the page you're going to have to change that. Maybe one of our Chinese members could tell us the proper way to do this for XHTML.

takagi

9:24 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello Harry,

Lots of questions, but I will do my best.

I understand that &#****x; codes are ASCII codes. They render OK if I have my charset=iso-8859-1. They also look OK if I have charset=bg2312, but mine is all Western software and I don't know if these fonts are normally available on Chinese PCs. As I am only just starting on writing the Chinese pages, I want to make sure I get all encoding issues sorted out before I get to far.

As a matter of fact these codes are closer to Unicode/UTF-8 than ASCII (which is only 7 bit for encoding some 96 printable characters). You could try this Japanese page [kanzaki.com] to get Unicode converted into these codes. But please let somebody who can read Chinese double check the result before you post it on Internet. BTW, according to this Chinese Website Encoding [webmasterworld.com] thread 'GB' stands for 'Guo (=Country) Biao(=Standard)'.

From what you say they should render OK in Japanese PCs.

Yes it should render OK if the PC that is used has the correct font installed. That shouldn't be a problem for the targeted users.

As a side issue can Japanese speakers read Chinese simplified characters as Kanji, or only the traditional Chinese versions?

Japanese speakers can read Japanese kanji which are sometimes identical to the traditional Chinese characters and sometimes closer to the simplified characters. But since the Japanese started to use them (over 1000 year ago) they did change and add some kanji. I guess also the traditional Chinese version is not the same as it was around the year 900. Most Japanese can have an idea what a Chinese text (simplified or traditional) is about, but not really read it. It's like someone who studied Latin sees for the first time a Spanish text. The Latin skills will help to make an educated guess about the contents. But for really reading, this person needs to learn Spanish.

In fact do many people read Kanji these days?

Sure they do. Without being able to read kanji, live is hard in Japan.

Looking at the source of this page [linguaitaliana.com] can help you understand how these codes would work in a web page made with a simple ASCII editor.

HarryM

4:02 pm on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Many thanks for all the replies. I have now sorted out the technical issues. The difficulty I had with beiing unable to specify GB encoding was due to the fact I was using an internal server which was probaby incorrectly set up. I now have both GB2312 and HTML numeric &#****; encoded test pages on my live server and they switch browser encoding and render correctly.

The only question remaining is should I go with GB2312 encoding or HTML numeric? From just looking at what's out there, it seems GB2312 would be the way to go, but I would welcome any comments before I start the hard stuff - getting to grips with all the pinyin I learnt years ago, and which is now very, very rusty.

Incidentally, does anyone know of a good on-line Chinese glossary or source of internet terms?

Harry

takagi

3:45 am on Feb 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The only question remaining is should I go with GB2312 encoding or HTML numeric?

The good thing about using the numeric codes is that it is easier to maintain the code in an ASCII editor. If you happen to save a GB2312 encoded file in the wrong format, some automatic conversion could cause problems you can only see if you can read Chinese.

OTOH, some special browsers (e.g. those embedded in a mobile phone or specially made for blind people) could have problems with something unusual as text in these numeric codes. Same potential problem for niche Search Engines. The numeric codes also use more bandwith, but that's no problem in your case since you wrote "the pages contain mainly images and very little text".

HarryM

12:41 am on Feb 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



takagi,

Thanks for the help.

I've decided to go with GB2312. It may not be the easiest, but it's probably worth the effort.

Harry

bill

2:28 am on Feb 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How did you handle the XML declarations in the header?

HarryM

3:37 pm on Feb 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I haven't done anything special. My full page header is below.

<?xml version="1.0" encoding="GB2312"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh" lang="zh">
<head>
<meta http-equiv="Content-Language" content="zh">
<meta http-equiv="Content-Type" content="text/html; charset=GB2312" />

However, this may not be 100% foolproof. I still have one problem which I have raised on the Apache forum, but so far no reply.

I created a test GB2312 page and placed it on my live server. All browsers automatically switch encoding correctly between this and my other pages.

However only Opera switches encoding automatically when accessing similar pages on my local host test server. IE and Mozilla both remain set to Western European, although the pages render OK if I set the encoding manually. I am no expert on Apache and as I installed the local host server myself, I suspect the problem is something missing in the config file.

As I use the local host server to develop the pages, this is a bit of a pain. :(

Harry

takagi

4:18 am on Feb 21, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What do you see when you use the Server Header Check [webmasterworld.com] to see if there is any encoding in it.

HarryM

2:13 pm on Feb 21, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The header of a GB3212 test page on my live server look OK to me. It doesn't include any encoding information, but neither do the headers from other GB sites that I checked.

HTTP/1.1 200 OK
Date: Sat, 21 Feb 2004 13:32:48 GMT
Server: Apache/1.3.26 (Unix) PHP/4.3.4 mod_perl/1.27 mod_ssl/2.8.10 OpenSSL/0.9.6a
X-Powered-By: PHP/4.3.4
Connection: close
Content-Type: text/html

Unfortunately I can't use the utility on the identical page served by my localhost:8080/. Perhaps it's possible, but I don't know how.

I have added some more details at the Apache forum where I also raised the problem. I suspect it is an Apache problem.

[webmasterworld.com...]

Thanks for your time, takagi. If you want to look at the test pages, I could sticky you the url.

Harry