Forum Moderators: open

Message Too Old, No Replies

Cantonese characters and UTF-8

Using Chinese as keywords in an english website.

         

apprentice

6:20 pm on Sep 8, 2006 (gmt 0)

10+ Year Member



I have been struggling with this a bit. Initially I wanted to display Cantonese characters on a website and as I was already using UTF-8 I though that wouldn't be an issue. When I pasted a Chinese character 'row' into the HTML, it would lose the character after been saved, as the encoding was different. So I used the numeric equivalent instead (i.e. &#blahblah). For example:

- 食物

I found an excellent resource that converts Chinese characters to Unicode numeric equivalents. Which works fine when placing Chinese character in the content of the page. My query is whether I can use the numeric representation of Chinese characters for the keyword tag. Would that be picked up properly by spiders or would it cause confusion? What is the case of using Chinese keywords - are they supported well by the major search engines?

Regards.

Edit: I pasted the raw Chinese character directly to this message but it was converted to the numeric equivalent

encyclo

7:26 pm on Sep 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What editor are you using to create the file? If the file is saved correctly as UTF-8, then you should be able to use the chracter directly within the page without resorting to entity references. Are you simply declaring the UTF-8 charset via a HTTP header or meta element or are you using a UTF-8-capable editor and specifying the encoding when you save?

If you do use entity references, they are interchangeable with the real character, so you should be able to use them in all parts of the page including meta elements.

Non-western characters cannot display here on the board as the defined charset is ISO-8859-1 not UTF-8.

apprentice

8:04 pm on Sep 9, 2006 (gmt 0)

10+ Year Member



Thank you so much for the reply encyclo.

I am using HTML4.01 for the page and declare the UTF-8 charset via the meta element. The editor I use is HTML-Kit and I save the page as *.html. You are probably right. I was just looking at a Chinese content page at Wikipedia - it uses UTF-8 but they don't seem to be using entity references. Also, Notepad has an option for saving as UTF-8. I think I was using ANSI up until now so no wonder why it wouldn't do the job. Do you suggest if I should save all pages of the site as UTF-8, for consistency purposes? It's a very small site anyway so it won't be a pain. Although not sure about that, something tells me I should avoid having half the site pages encoded as ANSI and half as UTF-8.

Regards.