Unicode sites showing wrong in Google

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Unicode sites showing wrong in Google

elgoognl

3:01 pm on Jan 29, 2006 (gmt 0)

Some of my sites show up like this in Google:
=========================================
ÿþ< HTML LANG = zh - CN > < HEAD > < META HTTP - EQUIV = " Content ...
Éb/OžŠ 0 ^/O†OžŠ 09N¥žžŠ 0 ^ØžŠ, casino Chinese china < / TITLE > < META NAME = " keywords " CONTENT = " íŒÎW ÿ²}ïíŒÎW ÿÚ} NíŒ"“J2b ÿíŒZS ÿJ2b ÿZSUY ÿ ...
========================================

They show ok in a browser
i did read archive on this issue,
tryed all, utf-8, unicode etc
charsets and language tags are added

related post is here [webmasterworld.com]

i used notepad

all help much appreciated

encyclo

10:49 am on Jan 30, 2006 (gmt 0)

i used notepad

This is what caused the problem: never use Notepad for anything relating to Unicode/UTF-8. Notepad adds a BOM (Byte Order Mark) to UTF-8 content even though it is quite unnecessary. The presence of a BOM can seriously hinder indexing.

To get rid of it you can use a hex editor to remove the offending characters, but it is difficult unless you know what you're doing. A better bet may be to copy/paste your source code out of a parsed page. Use a web-friendly text editor such as Edit Plus, TextPad, HomeSite... to edit the pages. Once done, the earlier thread you mentioned gives some very good advice - declare the charset with a HTTP header, add a meta charset tag just before your

title

element and declare the content language on the

<html>

tag.

elgoognl

10:06 pm on Jan 30, 2006 (gmt 0)

thank u so much cyclo
of course i added the charset and language tag
homesite does not work
but ill try to use another editor

lots of thanks again

texasville

11:21 pm on Jan 30, 2006 (gmt 0)

>>>>>This is what caused the problem: never use Notepad for anything relating to Unicode/UTF-8. Notepad adds a BOM (Byte Order Mark) to UTF-8 content even though it is quite unnecessary. The presence of a BOM can seriously hinder indexing. <<<<<

Geez....something else I had no idea about. Depresses me how much I DON'T know. Gets me depressed sometimes.

g1smd

11:24 pm on Jan 30, 2006 (gmt 0)

Obligitory posting: If Microsoft made it then it probably breaks web standards.

mrMister

9:34 am on Jan 31, 2006 (gmt 0)

Hi encyclo, Interesting comment.

Have you got a URL with more information about this issue?

[edited by: mrMister at 9:45 am (utc) on Jan. 31, 2006]

mrMister

9:37 am on Jan 31, 2006 (gmt 0)

of course i added the charset and language tag

This comment suggests to me that you are setting the character encoding in meta-equiv tags. Is this correct?

Have you set the Content-Type HTTP header correctly as well?

Also, is there any particular reason why you are using UTF-8, could it be possible to use ISO-8895-1 be used in your case?

encyclo

11:33 am on Jan 31, 2006 (gmt 0)

Have you got a URL with more information about this issue?

Try:
[webmasterworld.com...]
[webmasterworld.com...]

For the original question, it may be a problem if you have the BOM but the content is actually encoded in something like GB or Big5 (I see we are talking about content in Chinese so ISO-8859-1 is not going to be appropriate). If Homesite won't open the file, then you may be forced to use a hex editor to remove the initial characters first.

If Microsoft made it then it probably breaks web standards.

Notepad might break web standards (although a BOM is possible in UTF-8, just not required), but to be fair Notepad simply is not a web editor - it is designed for simple text file manipulation within the context of the OS. If you are using an English version of Windows, it will save the file cntents in the standard Windows encoding (windows-1252). There is an option to Save As Unicode, but the Unicode produced is not appropriate for web use.

Notepad is not broken per se, just that it is the wrong tool for the job.

elgoognl

3:39 pm on Jan 31, 2006 (gmt 0)

Mr.Mister,
Here you can see how my urls show in google
and how charset and language is defined:
==========================================
ÿþ< html > < head > < META http - equiv = " Content - Type ...
5 9 A : > 5 8 = B 5 @ = 5 B - : 0 7 8 = > , russian [keyword], 3 4 5 K < > 6 5 B 5
... C ; O @ = 5 9 H 5 5 , russian casino , 5 2 @ >? 5 9 A : > 5 , ...
www.example.com/russian.html - 11k - 25 jan 2006 - In cache - Gelijkwaardige pagina's

ÿþ< HTML LANG = " zh - CN " > < HEAD > < META HTTP - EQUIV ... Éb/OžŠ 0 ^/O†OžŠ 09N¥žžŠ 0 ^Ø�žŠ, [keyword] Chinese china < / TITLE > < META ... b > casino < / b > - for - sale . info / < b > chinese < / b > / index . html ...
www.example.com/chinese.html - 14k - 25 jan 2006 - In cache - Gelijkwaardige pagina's

html lang = " ko " > < head > < META http - equiv = " content ...
... charset = ks _ c _ 5 6 0 1 - 1 9 8 7 " > < title > Korean [keyword] , 鶿剼
馨剼 < 8 1 5 柒 衙棠\ 檔陊 (鸛暖 x t怹 ...
www.example.com/korean.html - 7k - 25 jan 2006 - In cache - Gelijkwaardige pagina's

===================================================

its very hard to get it fixed
ppppppppffffffftttt

[edited by: tedster at 4:22 pm (utc) on Jan. 31, 2006]
[edit reason] use example.com, no specifics [/edit]

mrMister

4:14 pm on Jan 31, 2006 (gmt 0)

If at all possible, don't use meta-equiv tags to set your content-type, you should use real HTTP headers to be sure that it will be interpreted correctly.

Currently, on your chinese pages, you're setting your meta-equiv content-type to BIG5. However your page is actually encoded in UTF-16LE.

elgoognl

5:09 pm on Jan 31, 2006 (gmt 0)

Tedster, thx for editing....srry...

MrMister

i will try that,
cant get rid of the BOM though

encyclo

5:14 pm on Jan 31, 2006 (gmt 0)

Currently, on your chinese pages, you're setting your meta-equiv content-type to BIG5. However your page is actually encoded in UTF-16LE.

In fact the page is probably encoded in BIG5 but the UTF-16 BOM takes precedence. Like I said, a hex editor is the only sure way - a quick Google search gives several free Windows hex editors. The first characters will look something like

FE FF

its very hard to get it fixed

Yes, BOMs are very difficult to handle because they represent a zero-width space - hence they are not readily visible in a standard text editor.

elgoognl

6:03 pm on Jan 31, 2006 (gmt 0)

thanks alot encyclo,

i will go and use a hex-editor, because i cant get it fixed with textpad and editplus,

btw, if got one file up, wich does not show , nor has the FF FE in hex, but still not look good:
=========================================
html lang = " ko " > < head > < META http - equiv = " content ...
... charset = ks _ c _ 5 6 0 1 - 1 9 8 7 " > < title >

g1smd

8:37 pm on Jan 31, 2006 (gmt 0)

I got myself a copy of PSPad to edit UTF-8 files, after I found out that the "Unicode" option in WordPad saves the file as UTF-16LE and that completely breaks the file in both Mozilla and Firefox, and probably breaks Opera and others too (but it works in IE of course!).

PSPad works a lot better than many other things that I have tried, but I am still having problems trying to cut and paste any Czech (some letters are replaced with full stops) or Greek (text is completely wrecked) texts, even when their source is known to be UTF-8 already.

mirrornl

9:37 pm on Feb 1, 2006 (gmt 0)

(ok, admin asked me to change my name
system problems with el****nl, nvm)

˙ţ< HTML LANG

hum, the board cant display the char
it look like this in real on google:

t< HTML LANG

its not a BOM, but what is it?

g1smd

9:54 pm on Feb 1, 2006 (gmt 0)

Did you save the file in some "WordProcessor" format, rather than Plain ASCII Text?

mirrornl

11:29 pm on Feb 1, 2006 (gmt 0)

i didnt safe in asci...

atm i am trying to safe my own page from the net
(including a folder #*$!xfolder/img #*$! etc)

it does not show strange codes in hexeditor this way

my son found out:)

waiting for my url to refresh in google

thumbs crossed:)

thanks all for kind help!
(been bussy with this for ages...pfffffffffft)

ALbino

11:56 pm on Feb 1, 2006 (gmt 0)

Try UltraEdit, it works great.

mirrornl

12:24 am on Feb 2, 2006 (gmt 0)

ill wait a day to see how it shows up in G
thanks alot

mirrornl

7:55 pm on Feb 3, 2006 (gmt 0)

well,
it didn't work.
the ÿþ (=bom-sign) is gone,
but it still shows wrong:

< HTML LANG = zh - CN > < HEAD > < META HTTP - EQUIV = " Content ...

mrMister

4:57 am on Feb 4, 2006 (gmt 0)

well, it didn't work. the ÿþ (=bom-sign) is gone

It looks to me like you're still encoding in UTF-16 rather than UTF-8

leunga

7:22 pm on Feb 6, 2006 (gmt 0)

Hello all,

I think I have similar problem, i.e. Unicode content showed wrongly in Google's SERPs, but my case was on RSS feed ONLY. I have a website which was fine with Chinese Characters encoded in UTF-8 and it was fine with SEAPs. However, the RSS feeds which were indexed and cached by Google were showing unrecognisable characters in SERPs. In the SERP's result, I can see feeds were labeled with an additional, second line, i.e. "File Format: Unrecognized - View as HTML". Any ideas?

I followed my feed's link on my site and it appears below when browsed.

<?xml version="1.0" encoding="UTF-8"?>
- 
- <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/">
- <channel>
<title>Comments for XXXX</title>
<link>http://www.mydomain.com</link>
<description>XXXX</description>
<pubDate>Mon, 04 Feb 2006 19:11:00 +0000</pubDate>
<generator>http://wordpress.org/?v=2.0</generator>
</channel>
</rss>

where XXXX is good Chinese characters. Why I can see good characters in browser, but Google can't see?

g1smd

9:16 pm on Feb 6, 2006 (gmt 0)

Is the <?xml ...... UTF-8"?> tag format exactly right?

mirrornl

9:41 pm on Feb 6, 2006 (gmt 0)

"Why I can see good characters in browser, but Google can't see?"

thats the bottum of the quetstion

for me
i still cant convert to utf-8

mirrornl

12:51 am on Feb 19, 2006 (gmt 0)

bumperdeebump
srry
cant get it fixed for over 4 months now.......