Forum Moderators: open
Untitled
xx<! DOCTYPE html PUBLIC " - / / W 3 C / / DTD XHTML 1 . 0 Strict / / EN
" " http : / / www . w 3 . org / TR / xhtml 1 / DTD / xhtml 1 - strict
..rather than the true nature of the file which has the same structure as other non-Unicode files which are indexed correctly, titles, description et al.
If I open a Unicode document in an editor that doesn't support Unicode this is what I see. Which has me wonder if Google is seeing it as an Ascii file. How does Google determine the nature of the file, ie Unicode/non-Unicode?
Browsers appear to have no problems reading these files. Naturally I'm uploading these pages in binary, as uploading in ASCII looses the Unicode.
Any ideas?
Thanks & Regards
davidpbrown
If you have the problem explained in that thread, your not alone. A search for "ÿþ HTML" [google.com] shows 39,000 pages with the same trouble.
You suggested
"I still don't understand why some pages have this problem, while other pages on the same server don't have this problem."
I'm thinking it's too obvious to suggest files uploaded in ASCII don't have this problem, it's only the Unicode ones uploaded in binary.
I'm a little out of my depth with this.. I'd understood Unicode to be UTF-8
Why the talk of UTF-16, what's the difference?
Should Unicode files for the web all be UTF-16?
and if so why is the encoding ID for Unicode ="UTF-8"
if there are two flavours why doesn't Google handle both, or UTF-8 and not UTF-16?
The issue is to let google know which language you use. (see the first couple lines of the source).
let me copy it here.
****************************
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=big5">
<meta http-equiv="Content-Language" content="zh-tw">
<title>MY HOMEPAGE</title>
</head>
****************************
Ok...I hope this helps...my page is in BIG5 (traditional chinese)......language location is zh-tw (taiwan)
If you use the Server Header Check [webmasterworld.com], enter the URL of a page which has the wrong text in Google's index, and press the enter-key, you will see something like this:
HTTP/1.1 200 OK
Date: Sat, 09 Aug 2003 14:54:15 GMT
Server: Apache/1.3.20 Sun Cobalt (Unix) mod_ssl/2.8.4 OpenSSL/0.9.6b PHP/4.1.2 mod_auth_pam_external/0.1 FrontPage/4.0.4.3 mod_perl/1.25
Last-Modified: Sun, 22 Jun 2003 09:18:08 GMT
ETag: "373771-1120-3ef57450"
Accept-Ranges: bytes
Content-Length: 4384
Connection: close
Content-Type: text/html
The last line has no 'charset', so it defaults to 'ISO-8859-1' (cf. The HTTP charset parameter [w3.org]), whereas it should be:
Content-Type: text/html; charset=utf-8 Since your site is on a server which is running an Apache version later than 1.3.10 this can be solved with the AddCharset directive [httpd.apache.org].
OTOH, you have a content-type in the header:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="ja" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="22 June 2003" />
<meta http-equiv="pics-label" content='(pics-1.1 "http://www.icra.org/ratingsv02.html" l r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l r (SS~~000 2))' />
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8" />
<meta http-equiv="Content-Language" content="ja" />
<meta http-equiv="Content-Style-Type" content="text/css" />
[edited by: takagi at 3:43 pm (utc) on Aug. 9, 2003]
"So now I'm confused, is it the newline after the "Content-Type"?"
I don't understand your confusion here, I'm thinking you've solved the problem.. well done you.
Also make sure your UTF declaration matches your content, so don'Ät upload a UTF-16 page with a UTF-8 tag.
SN
I'm now using Worldpad - one of many suggested by Alan Wood's resources - [alanwood.net...]
Which of all these suggestions will fix my initial problem I don't know but I'll feedback later once it's clear how Google then sees my files.
I've not tried uploading UTF-16 file which declare themselves as UTF-16 but I would expect that Google does correctly understand all UTF encoding and it's more user error.
That there are different flavours of UTF encoding all of which are able to code the full Unicode set, was not obvious to me, and having used Wordpad, I wasn't aware I'd saved in anything but Unicode=UTF.
That software often doesn't suggest which Unicode/UFT encoding it's using, I think, is unfortunate.
Anyhow, problem solved.. Thankyou killroy
davidpbrown
-------------
User Error: Replace user and press any key to continue.