Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Pages in multiple Oriental scripts poorly indexed

Google doesn't like my (Unicode) pages with Chinese and Japanese mixed!

         

bathrobe

10:35 am on Dec 30, 2006 (gmt 0)

10+ Year Member



I've been away for a long time. My problem is this:

I have a non-commercial site that has content in multiple Oriental languages (including Chinese Traditional, Chinese Simplified, and Japanese scripts). Encoding is almost all utf-8.

I've been been encountering what appears to be a reluctance by Google to index Chinese and Japanese content when it appears together on the same page.

1) For example, I have a page listing the way a certain famous sentence has been translated into Chinese and Japanese (over 60 versions).

If I input one translated version of the sentence into English-language Google, what comes up in first position is a POPUP from my site, included merely as a handy reference, listing all the Chinese versions (no Japanese). The main page of translations (multi-lingual) doesn't appear at all.

If I input the same sentence into Chinese-language Google, first place goes to a Chinese blog that has copied my content (and includes the Chinese versions only)! My POPUP comes in fifth place. Needless to say, the multi-lingual page doesn't come up at all.

Do a search on Chinese and Japanese versions together, and my site just doesn't register!

Put in the title of the particular book in Chinese and Japanese, and you get a lot of sites but not mine -- whether you search the entire web or confine your search to websites in Chinese or Japanese respectively.

2) Example two. I have what is probably the Internet's most authoritative list of scientific species names in Chinese and Japanese, and yet inputting the species names from both languages will mostly come up with "did not match any documents". Needless to say, my site DOES include both names, and the page itself is being indexed by Google. It's just that the results aren't being indexed, it appears. (This does not happen with all species names -- there are some families where species are mysteriously listed but many more where they are not).

This kind of problem afflicts all parts of the site to some extent or other.

I just can't seem to figure out exactly what is wrong. Some factors that might be relevant:

1. The site is hosted in the US. (This might cause Google to downgrade or ignore my results -- but the thing is, they're not consistent!)

2. The site has been going since 2000. However, I only converted to Unicode in 2005 -- prior to that I used a Western encoding, with Chinese characters etc. inserted as graphics. (Could it be that in the dusty corridors of Google's clogged up memories, my site has been semi-permanently assigned to Western encoding?)

3. In my pages I take advantage of Unicode to ensure that text from all these languages is readable by browsers. I don't make much attempt to distinguish between languages. (Does Google dislike pages that have both Chinese and Japanese content? Do they prefer sites that are in the native encodings for those languages? It does seem better if the page is in a Chinese or Japanese encoding rather than Unicode).

4. To address the above, I've tried including a language-specific tag around every non-English word in some parts of the site (e.g., span class=xml:lang="ja" lang="ja"). But despite this measure, Google is giving preference to Chinese-only (actually Chinese and English) popups rather than multi-lingual main pages, as in the first example above.

What is Google's problem that it penalises multi-lingual content on unicode-encoded pages? What more can I do? The species glossary is linked to from Chinese and Japanese Wikipedia, but this still doesn't appear to be enough to make Google take much notice!

As a general comment, my general impression about Google is that it has incredibly clogged arteries. With their system of old 'authoritative sites', it's hard for anything new to get a look in. My main keywords still seem to be much the same as they were back in 2002-2003. (Of course, this lumbering conservatism doesn't seem to stop highly-SEO'd spammy sites from clogging up the results! The only thing that has trouble getting into Google's results is real content!)

Happy New Year!

[edited by: tedster at 2:06 pm (utc) on Dec. 30, 2006]