Forum Moderators: Robert Charlton & goodroi
They show ok in a browser
i did read archive on this issue,
tryed all, utf-8, unicode etc
charsets and language tags are added
related post is here [webmasterworld.com]
i used notepad
all help much appreciated
i used notepad
This is what caused the problem: never use Notepad for anything relating to Unicode/UTF-8. Notepad adds a BOM (Byte Order Mark) to UTF-8 content even though it is quite unnecessary. The presence of a BOM can seriously hinder indexing.
To get rid of it you can use a hex editor to remove the offending characters, but it is difficult unless you know what you're doing. A better bet may be to copy/paste your source code out of a parsed page. Use a web-friendly text editor such as Edit Plus, TextPad, HomeSite... to edit the pages. Once done, the earlier thread you mentioned gives some very good advice - declare the charset with a HTTP header, add a meta charset tag just before your
title element and declare the content language on the <html> tag.
lots of thanks again
Geez....something else I had no idea about. Depresses me how much I DON'T know. Gets me depressed sometimes.
of course i added the charset and language tag
This comment suggests to me that you are setting the character encoding in meta-equiv tags. Is this correct?
Have you set the Content-Type HTTP header correctly as well?
Also, is there any particular reason why you are using UTF-8, could it be possible to use ISO-8895-1 be used in your case?
Have you got a URL with more information about this issue?
Try:
[webmasterworld.com...]
[webmasterworld.com...]
For the original question, it may be a problem if you have the BOM but the content is actually encoded in something like GB or Big5 (I see we are talking about content in Chinese so ISO-8859-1 is not going to be appropriate). If Homesite won't open the file, then you may be forced to use a hex editor to remove the initial characters first.
If Microsoft made it then it probably breaks web standards.
Notepad might break web standards (although a BOM is possible in UTF-8, just not required), but to be fair Notepad simply is not a web editor - it is designed for simple text file manipulation within the context of the OS. If you are using an English version of Windows, it will save the file cntents in the standard Windows encoding (windows-1252). There is an option to Save As Unicode, but the Unicode produced is not appropriate for web use.
Notepad is not broken per se, just that it is the wrong tool for the job.
html lang = " ko " > < head > < META http - equiv = " content ...
... charset = ks _ c _ 5 6 0 1 - 1 9 8 7 " > < title > Korean [keyword] , 鶿剼
馨剼 < 8 1 5 柒 衙棠\ 檔陊 (鸛暖 x t怹 ...
www.example.com/korean.html - 7k - 25 jan 2006 - In cache - Gelijkwaardige pagina's
===================================================
its very hard to get it fixed
ppppppppffffffftttt
[edited by: tedster at 4:22 pm (utc) on Jan. 31, 2006]
[edit reason] use example.com, no specifics [/edit]
MrMister
i will try that,
cant get rid of the BOM though
Currently, on your chinese pages, you're setting your meta-equiv content-type to BIG5. However your page is actually encoded in UTF-16LE.
In fact the page is probably encoded in BIG5 but the UTF-16 BOM takes precedence. Like I said, a hex editor is the only sure way - a quick Google search gives several free Windows hex editors. The first characters will look something like
FE FF. its very hard to get it fixed
Yes, BOMs are very difficult to handle because they represent a zero-width space - hence they are not readily visible in a standard text editor.
i will go and use a hex-editor, because i cant get it fixed with textpad and editplus,
btw, if got one file up, wich does not show , nor has the FF FE in hex, but still not look good:
=========================================
html lang = " ko " > < head > < META http - equiv = " content ...
... charset = ks _ c _ 5 6 0 1 - 1 9 8 7 " > < title >
PSPad works a lot better than many other things that I have tried, but I am still having problems trying to cut and paste any Czech (some letters are replaced with full stops) or Greek (text is completely wrecked) texts, even when their source is known to be UTF-8 already.
atm i am trying to safe my own page from the net
(including a folder #*$!xfolder/img #*$! etc)
it does not show strange codes in hexeditor this way
my son found out:)
waiting for my url to refresh in google
thumbs crossed:)
thanks all for kind help!
(been bussy with this for ages...pfffffffffft)
I think I have similar problem, i.e. Unicode content showed wrongly in Google's SERPs, but my case was on RSS feed ONLY. I have a website which was fine with Chinese Characters encoded in UTF-8 and it was fine with SEAPs. However, the RSS feeds which were indexed and cached by Google were showing unrecognisable characters in SERPs. In the SERP's result, I can see feeds were labeled with an additional, second line, i.e. "File Format: Unrecognized - View as HTML". Any ideas?
I followed my feed's link on my site and it appears below when browsed.
<?xml version="1.0" encoding="UTF-8"?>
- <!-- generator="wordpress/2.0"
-->
- <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/">
- <channel>
<title>Comments for XXXX</title>
<link>http://www.mydomain.com</link>
<description>XXXX</description>
<pubDate>Mon, 04 Feb 2006 19:11:00 +0000</pubDate>
<generator>http://wordpress.org/?v=2.0</generator>
</channel>
</rss>
where XXXX is good Chinese characters. Why I can see good characters in browser, but Google can't see?