Forum Moderators: open
I used the English version of Windows XP and Microsoft Publisher 2003 to create a Chinese website. The way that I inputted Chinese characters was I first typed it in NJStar, and then I copied and pasted into Microsoft Publisher. I then saved the publication under UTF-8 encoding. However, I noticed that the Chinese characters are tagged as English even though they are displayed as Chinese. Not sure if I described it clearly, so here's an example.
In my header, I have this:
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
but for the body I would have something like this:
...<span style='language:EN'> (some chinese characters)...
My Chinese characters (even those tagged as "language:EN") are displaying properly in the browser. However, I'm wondering if there would be any negative effects from the "language:EN". For example, would the search engines have trouble finding the words? Or would the search engine bots have trouble with it?
If there are any other pitfalls you can think of with the way I've built my website, I would very much appreciate it if you can share your knowledge with me. Thanks.
Microsoft Publisher is not the best tool for making websites. It makes some very messy and invalid code. If possible I would suggest you use something else. Publisher is really intended for making printed materials. It's not a bad tool for that purpose...but the web is a different story.
For example, the
<span style='language:EN'> you mentioned is not a valid part of CSS1 or CSS2 to my knowledge (at least my validator didn't like it). I think you could delete that and nothing would change as far as the appearance on your page. I looked around the web for other examples of sites using that code and it seems that Publisher is the culprit in a lot of cases. If you would like to add some additional language tags to your code try these:
<html lang="zh"> <meta http-equiv="content-language" content="zh"> I fear you may have a lot of invalid code in your pages. You may benefit from some validation.
Validate your HTML here --> [validator.w3.org...]
Validate your CSS here --> [jigsaw.w3.org...]
Also, it looks like Microsoft Publisher is saving the pages as filtered html. I don't seem to be able to convert them back to regular html though.
If the only solution is for me to start over with Frontpage, sigh, then is there a way to quickly convert everything or do I have to lay it out from scratch again?
Thanks.
In my opinion it is better to use the encodings used by the Chinese themselves. Decide on which encoding to use - 'Big5' for Taiwan and Hong Kong, or 'GB2312' for mainland China and Singapore. I create identical pages in each encoding to cover both, with links between. (Site in my profile.)
For Big 5 you could specify:
<html lang="zh">
<meta http-equiv="Content-Language" content="zh" />
<meta http-equiv="Content-Type" content="text/html; charset=Big5" />
For GB2312 you could specify:
<html lang="zh">
<meta http-equiv="Content-Language" content="zh" />
<meta http-equiv="Content-Type" content="text/html; charset=GB2312" />
Presumably you still have the NJStar WP files. If so you could download NJStar Communicator (free trial). It has an option 'Universal Code Convertor'. Copy the character phrase in NJStar WP and it shows automatically in the Convertor window (no need to paste). Then convert the code, and copy and paste the result into the html.
Let's say I am buiding a page using simplified Chinese characters with some English. Below is the code that Frontpage generates. Obviously my site will have more than this, but in terms of getting the language settings correct for the search engines, is there anything wrong with this?
<html>
<head>
<meta http-equiv="Content-Language" content="zh-cn">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<title>New Page 1</title>
</head>
<body>
<p>ÕâÊÇÖÐÎÄ.</p>
<p><span lang="en-us">This is English.</span></p>
</body>
</html>
A few other questions:
1) What is the "zh-cn" used for if the encoding is already set to gb2312? I don't necessarily want the page to be read as a mainland China web page.
2) If I set the encoding to big5, why do I still need to specify either "zh-tw" or "zh-hk"? I don't necessarily want my traditional Chinese pages to be read as a Hong Kong page or a Taiwan page. I just want it to be read as a page that uses traditional Chinese Characters.
3) I guess I'm asking what is the purpose of setting zh-cn, zh-hk, or zh-tw if you are already specifiying the encoding.
4) My page is primarily Chinese but there is some English. The search terms that I want to be found by search engines are in Chinese. In this case, should I use the Chinese encodings or UTF-8? I read that web pages that use multiple languages should use UTF-8.
5) Does the encoding that I use depend on the settings of my server? I am on a shared server and most of my web hosting company's customers are in the U.S., so if there are server settings chances are they are set to accomodate English language.
6) Sometimes Frontpage won't use the <span> tag when I go from Chinese to English, but the words still display properly. Do I want that <span> tag to be there?
Thanks.
Actually I am as confused about zh-tw and zh-hk, etc., as you. :)
There is no need to specify <span lang="en-us">. Both Big5 and GB2312 handle English ascii characters. You just have to be careful about using shortcuts such as — or / etc., which may not be recognized. (Actually those particular examples may be OK as I haven't tested them.)
Big5 displays the English characters with normal spacing, whereas GB spaces them out as if they were characters. But there is no need to worry about this as this is quite normal on Chinese web sites.
The English characters will also look smaller because the Chinese characters take up the full height which in English is used for descenders and ascenders. You may need to set a line height to get larger spaces between the lines of characters.
You can input the English text into the Chinese in NJStar which pretty much guarantees that what you see in the WP will work. You can also insert Chinese punctuation symbols which are somewhat different to English. For instance in a list phrase such as "apples, pears, and oranges" the Chinese use a different style of comma.
There are academic arguments for and against UTF8, but my opinion it is better to go with the flow. The Chinese (and the Japanese) generally don't use it, so it's best to give them and their search emgines what they would normally expect. I am sure you would be better off with Chinese search engines with Chinese encodings.
As to server settings, I have no idea. The best way to tell is to put up a page in Chinese, navigate to it from an English page, and if the browser automatically switches to the correct encosing and displays the characters correctly, then it should be OK. If you have to manually set encodings in your browser then you may have a problem, although it might be your page rather than the server.
Back in the day when Netscape was king and Windows was a lot less adept at handling multiple languages the
<meta http-equiv="content-language" content="zh"> tag was actually very helpful. If the user had chosen zh encoding as one of their language choices in the browser options then Netscape (and other browsers) would automatically switch the page encoding in the browser. Pages without this tag required the user to manually switch the encoding of the page. Today's browsers are a lot better at handling different languages, so this is not as necessary as it was in the past. I've simply continued adding these tags to my Chinese and Japanese pages out of habit. There may still be browsers that require this. Some spiders may recognize this tag as well, but I've never seen any definitive evidence that they do. Regardless, it won't hurt to have this tag in there.
HarryM is correct about the UTF-8 encoding. Although I think it's a great idea I am still being warned by local programmers not to use this on websites. It is recommeded to use the proper local encodings.
The
<span lang="en-us"> problem was an issue with older versions of FP. I haven't noticed this with FP 2003, but I do a lot of my page development in the HTML editor so I may be missing this.
[w3.org...]
thanks for all the info so far. i've been trying to redo my site from scratch with frontpage 2003 and CSS. omg what a struggle at first . i'll spare you the war stories and late night frustrations, but i finally got my page to validate. the other pages should just be a matter of copying and pasting text (i hope). i guess what doesn't kill you makes you stronger. Whew!
I do have a question about the DOCTYPE declaration though. Does the "EN" at the end specify english? Do I need to worry about that if my site is in Chinese?
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"...
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 ... etc.
No idea, what it means though...
As for zh-tw, zh-hk, etc., my understanding is that the suffix was supposed to trigger "cultural" behaviour, such as date formats, calendars, etc. Zh-hans and zh-hant seems like an afterthought by a standards organisation that initially didn't know enough about Chinese to realize that there were two scripts.
It looks like W3C is suggesting that people use "zh-hans" for simplified and "zh-hant" for traditional which makes more sense.Yes, but in the same section they point out this:
Version 6.0 of Internet Explorer does not recognise either of these codes. Mozilla XX and Netscape 7.0 recognise the tags, but treat them both as Simplified Chinese.If IE6 can't handle these codes then that could be a major issue. That page however is a good source of information. I'm going to have to read through it a bit more closely though.
I do have a question about the DOCTYPE declaration though. Does the "EN" at the end specify english?This thread might help: Language in doctype [webmasterworld.com]
mbauser2
That abbreviation doesn't refer to the content of the document. It refers to the language used to develop the markup language -- HTML tag names are taken from English, so the language has to be //EN. Always. No exceptions.