Forum Moderators: open

Message Too Old, No Replies

Google's Unicode support -?uploading Unicode documents

Google unicode UTF-8 UTF8 uploading multilingual

         

davidpbrown

10:56 am on Aug 8, 2003 (gmt 0)

10+ Year Member



Unicode pages on my site are being misread by Google.
I'm wondering if it's Google or the way I upload them.
I'd thought Google had full Unicode support now, but see my pages indexed as..

Untitled
xx<! DOCTYPE html PUBLIC " - / / W 3 C / / DTD XHTML 1 . 0 Strict / / EN
" " http : / / www . w 3 . org / TR / xhtml 1 / DTD / xhtml 1 - strict

..rather than the true nature of the file which has the same structure as other non-Unicode files which are indexed correctly, titles, description et al.

If I open a Unicode document in an editor that doesn't support Unicode this is what I see. Which has me wonder if Google is seeing it as an Ascii file. How does Google determine the nature of the file, ie Unicode/non-Unicode?

Browsers appear to have no problems reading these files. Naturally I'm uploading these pages in binary, as uploading in ASCII looses the Unicode.

Any ideas?
Thanks & Regards
davidpbrown

takagi

2:36 pm on Aug 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Looks like a problem with the Byte Order Mark (BOM) as mentioned in the thread: Strange characters begin cache - ÿþ [webmasterworld.com]

If you have the problem explained in that thread, your not alone. A search for "ÿþ HTML" [google.com] shows 39,000 pages with the same trouble.

davidpbrown

3:17 pm on Aug 9, 2003 (gmt 0)

10+ Year Member



takagi, that is exactly my problem, thanks for identifying it.

You suggested
"I still don't understand why some pages have this problem, while other pages on the same server don't have this problem."

I'm thinking it's too obvious to suggest files uploaded in ASCII don't have this problem, it's only the Unicode ones uploaded in binary.

I'm a little out of my depth with this.. I'd understood Unicode to be UTF-8
Why the talk of UTF-16, what's the difference?

Should Unicode files for the web all be UTF-16?
and if so why is the encoding ID for Unicode ="UTF-8"
if there are two flavours why doesn't Google handle both, or UTF-8 and not UTF-16?

mars9820

3:30 pm on Aug 9, 2003 (gmt 0)

10+ Year Member



davidbrown....check my homepage in my profile. I have a chinese version and correctly indexed.

The issue is to let google know which language you use. (see the first couple lines of the source).

let me copy it here.

****************************

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=big5">
<meta http-equiv="Content-Language" content="zh-tw">
<title>MY HOMEPAGE</title>
</head>

****************************

Ok...I hope this helps...my page is in BIG5 (traditional chinese)......language location is zh-tw (taiwan)

takagi

3:32 pm on Aug 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



After reading your profile, and visiting your site, I think there is another problem: the server header.

If you use the Server Header Check [webmasterworld.com], enter the URL of a page which has the wrong text in Google's index, and press the enter-key, you will see something like this:

HTTP/1.1 200 OK
Date: Sat, 09 Aug 2003 14:54:15 GMT
Server: Apache/1.3.20 Sun Cobalt (Unix) mod_ssl/2.8.4 OpenSSL/0.9.6b PHP/4.1.2 mod_auth_pam_external/0.1 FrontPage/4.0.4.3 mod_perl/1.25
Last-Modified: Sun, 22 Jun 2003 09:18:08 GMT
ETag: "373771-1120-3ef57450"
Accept-Ranges: bytes
Content-Length: 4384
Connection: close
Content-Type: text/html

The last line has no 'charset', so it defaults to 'ISO-8859-1' (cf. The HTTP charset parameter [w3.org]), whereas it should be:

Content-Type: text/html; charset=utf-8

Since your site is on a server which is running an Apache version later than 1.3.10 this can be solved with the AddCharset directive [httpd.apache.org].

OTOH, you have a content-type in the header:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="ja" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="22 June 2003" />
<meta http-equiv="pics-label" content='(pics-1.1 "http://www.icra.org/ratingsv02.html" l r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l r (SS~~000 2))' />
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8" />
<meta http-equiv="Content-Language" content="ja" />
<meta http-equiv="Content-Style-Type" content="text/css" />

So now I'm confused, is it the newline after the "Content-Type"?

[edited by: takagi at 3:43 pm (utc) on Aug. 9, 2003]

mars9820

3:33 pm on Aug 9, 2003 (gmt 0)

10+ Year Member



Sorry I forgot something...I saw in your message (message number 1).

quote :

xx<! DOCTYPE html PUBLIC " - / / W 3 C / / DTD XHTML 1 . 0 Strict / / EN

/quote

EN at the end of the line which indicates ENGLISH. Maybe that is the problem.

davidpbrown

4:10 pm on Aug 9, 2003 (gmt 0)

10+ Year Member



takagi -
What you suggest about the AddCharset directive is very helpful and I expect exactly the solution I'm looking for. I'm thinking it unfortunate declaration by server is necessary but understand Unicode files are different enough that this may be reasonable.

"So now I'm confused, is it the newline after the "Content-Type"?"
I don't understand your confusion here, I'm thinking you've solved the problem.. well done you.

takagi

4:14 pm on Aug 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



At least you can try to tackle this problem that way. Please keep us informed, if this helps or not.

killroy

5:15 pm on Aug 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Huve you considered using UTF-8 instead? From the spacign it looks like you'Re using UTF-116 with 2 bytes per character, while the UTF-8 variant of unicode uses 8bits per character, and 16 bits for special characters.

Also make sure your UTF declaration matches your content, so don'Ät upload a UTF-16 page with a UTF-8 tag.

SN

davidpbrown

6:43 pm on Aug 9, 2003 (gmt 0)

10+ Year Member



Good spot killroy,
I've always used Wordpad on Windows 98 and this apparently always saves as UTF-16.

I'm now using Worldpad - one of many suggested by Alan Wood's resources - [alanwood.net...]

Which of all these suggestions will fix my initial problem I don't know but I'll feedback later once it's clear how Google then sees my files.

g1smd

10:32 pm on Aug 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I would change the content slightly, perhaps add an embedded date or some comments in the source code, so that you can tell when the page has been spidered and cached again.

davidpbrown

8:41 pm on Sep 1, 2003 (gmt 0)

10+ Year Member



It's clear now that the source of the "ÿþ HTML" a.k.a. Byte Order Mark (BOM) problem, for me, was simply, and only that I had uploaded files in UTF-16. Saving and uploading files in UTF-8 the problem disappears - Google now presents the content correctly. (This without the server header suggesting encoding.)

I've not tried uploading UTF-16 file which declare themselves as UTF-16 but I would expect that Google does correctly understand all UTF encoding and it's more user error.

That there are different flavours of UTF encoding all of which are able to code the full Unicode set, was not obvious to me, and having used Wordpad, I wasn't aware I'd saved in anything but Unicode=UTF.

That software often doesn't suggest which Unicode/UFT encoding it's using, I think, is unfortunate.

Anyhow, problem solved.. Thankyou killroy

davidpbrown
-------------
User Error: Replace user and press any key to continue.

g1smd

10:23 pm on Sep 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I hope you took the opportunity to add the Content-Type and Content-Language meta tags too?

davidpbrown

7:49 am on Sep 2, 2003 (gmt 0)

10+ Year Member



?g1smd

I don't know which page you were looking at.. Content-Type and Content-Language meta tags were always present as were date changes/updates?

GoogleGuy

5:50 pm on Sep 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hey, glad to hear that Google now presents the content correctly.. Just to echo what others have said, we use HTTP headers + meta tags + language ID to guess at charset and character encoding. If you can use meta tags to define the charset and character encoding, that will definitely help search engines figure things out too. Sounds like you're already in good shape though..