Welcome to WebmasterWorld Guest from 3.209.80.87

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

UTF or utf - what does google think?

     
8:03 pm on Aug 22, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 19, 2003
posts: 46
votes: 0


Hi,

just checked my webmaster tools, and I saw that my site is listed as having all its content listed as USASCII... Surprise surprise..

Its a very valid XHTML 1.0 Transitional site. At least according to w3 org.

But, my header has utf-8 in lower case.

And G itself has UTF in capitals.

Who is to blame here? And would there be any side effects to this? At the moment not, looking at the traffic G sends...

9:51 pm on Aug 22, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


I saw that my site is listed as having all its content listed as USASCII

What does your server header say for character encoding? Not the meta tag in the html source code, but the http header sent from the server? These two should agree, but when they do not the higher priority is the http header.

As far as case goes, I'm not aware that it's an issue - I see it both ways in the server header, but Google probably uses upper case as the standard for their reporting.

10:14 pm on Aug 22, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 19, 2003
posts:46
votes: 0


Just Content-Type: text/html in the server header, checked with LiveHTTPheaders and Firefox.

Checked a few others, on the same hoster and another hoster, and they give the same, just text/html

And there the webmaster tools give the correct character set, in that case windows-1252 (in header of actual page)

I think its a bug. A real Google bug. The bot doesn't understand lower case utf... or webmaster tools doesn't report correctly.

Unless a lower case utf-8 as language descriptor is not valid.

In that case, w3 has a bug in their validation.

Anyway, I changed things to uppercase.

Would be nice to know if other webmasters have the same.

10:24 pm on Aug 22, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Your HTTP response headers should be specifying application/xhtml+xml not text/html, and should also be specifying the encoding as UTF-8. You should be seeing both with Live HTTP Headers.

Get those server response headers in shape, correct the HTML meta-tags if you use them, and then see if this "case error" goes away.

For Apache, see AddType and AddCharset in mod_mime.

Jim

10:34 pm on Aug 22, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 19, 2003
posts:46
votes: 0


Gee.

Checked a few other sites, even with windows-1252, also listed totally as US-ASCII.

A phpbb3 forum, ALL of its page are in UTF-8 (uppercase), and there the webmaster tools report about 70% in US-ASCII, and 29% in UTF-8 and 1% unknown...

I am seeing ghosts ?

[edited by: Gede at 10:36 pm (utc) on Aug. 22, 2008]

10:40 pm on Aug 22, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 19, 2003
posts: 46
votes: 0


Oh the phpBB forum is on another hosting company, and this hoster does give the UTF in the server header...

[edited by: Gede at 10:40 pm (utc) on Aug. 22, 2008]

11:04 pm on Aug 22, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 19, 2003
posts:46
votes: 0


Oh, I think there is a "application/xhtml+xml" mimetype, but just for xhtml xht extensions.

Cannot add this to php can I?

12:01 am on Aug 23, 2008 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member encyclo is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 31, 2003
posts:9074
votes: 6


application/xhtml+xml

I hate to correct jdMorgan, but neither Internet Explorer nor Googlebot understand this mime type, so you should absolutely stick to

text/html
even if you are using XHTML.

You may be right about this being a bug, as using lower-case (utf-8) should be accepted. However, I've not done any analysis so I can't say this with any degree of certainty. What's more, I always use upper-case so I don't have a test case to compare.

Is your content really UTF-8, in that you are using at least some double-byte characters in your content, or are you using English with nothing outside the US-ASCII range?

12:46 am on Aug 23, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 19, 2003
posts:46
votes: 0


Yes I think so. I have pages in Chinese, Japanese, Korean, Russian etc on the same site, and it all works nicely.
1:11 am on Aug 23, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 19, 2003
posts:46
votes: 0


After finding a possible explanation why the phpBB3 forum is only partly listed as UTF in webmastertools (host changed config some time ago, because of a move, older pages not served as UTF still listed as US-ASCII) it must be that at least the tools take the response header as leading, and not the actual header in the page.

Googlebot has no problems indexing the site, as far as I can see, and the site does well enough.

I have requested the hoster to make changes so the site is correctly served, and if they don't want to make changes, I'll put it in the header myself with php.