Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Page Analysis encoding incorrect

Is this something I should worry about?

         

HarryM

11:55 am on Oct 7, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I get odd results for some of my sites in Google Webmaster Tools, Statistics, Page Analysis. Sites that are in non-English languages are correctly seen as 100% text/html, and 100% in the correct encoding. For example:

French site encoded in ISO-8859-1 is seen as 100% text/html, ISO-8859-1 (Latin-1).
Chinese site encoded in GB2312 is seen as 100% text/html, GB (Simplified Chinese).
Chinese site encoded in Big5 is seen as 100% text/html, Big5 (Chinese).

But the English language sites are listed as either 100% US-ASCII, or mainly US-ASCII with only a few pages in ISO-8859-1.

I am at a loss to understand this or what to do about it. Does this have any implications for the way Google indexes these sites?

All sites use the same doctype format.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Language" content="en" />
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

For non-English language sites I change the encoding, etc., in the usual way. For example my French site has:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
<head>
<meta http-equiv="Content-Language" content="fr" />
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

Any suggestions welcome.

HarryM

1:03 pm on Oct 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Anybody any comments?

Harry

g1smd

1:19 pm on Oct 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Run an affected page through the W3C HTML validator just to see if that does pick up the correct encodings.

Run a few pages through WebBug to see if the server pre-pends a duplicate encoding tag inside the HTTP header.

That might be the conflict.

HarryM

2:11 pm on Oct 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have checked a few pages with W3C from sites that Google sees as ASCII, and W3C sees them correctly as ISO-8859-1.

I haven't tried WebBug yet. Downloaded the zip, but I am very, very cautious about installing software so haven't installed it yet.

trinorthlighting

2:22 pm on Oct 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have a similar issue, my pages are 8859 yet in google site maps they show up as US-ASCII

Might be a google bug. Are your pages indexed? Ours are....

HarryM

2:33 pm on Oct 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, all the pages are indexed. One of the sites has been up since 2002, but I have only recently specified my sites in Google Account, which is when I noticed it.

HarryM

11:33 am on Oct 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have checked Google caches. In the source Google pre-pends an additional "Content-Type" meta tag as the first line of the Google stuff at the top of the page. Below that where my page starts is my doctype and meta tags.

In the caches of my non-English pages both meta tags are identical. I.e., in cached French pages both meta tags are "charset=ISO-8859-1", and in cached Chinese pages both are "charset=Big5" or "charset=GB2312".

But in the English pages the two meta tags are different.

The meta tag pre-pended by Google is: <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">.

Below the Google stuff is my doc type and meta tag: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />.

This has left me more confused. I can understand why Google would see the encoding of Chinese pages correctly even without my meta tag, because the text is actually created in that encoding. But the French pages are created with notepad on the same PC and keyboard as the English pages. The only difference is that I occasionally toggle between FR and EN on the MS language interface to produce characters with accents.

I am not enough of an expert on encoding to understand why this should be, or more importantly, whether I should worry about it. :(

g1smd

12:07 pm on Oct 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I would report that as a possible bug to Google.

See if you can grab the attention of either Vanessa Fox or Adam Lasnik on that one... however, getting the encoding wrong like that where both encodings are very similar isn't a major problem. However, it would be nice if it was 100% right.

HarryM

12:59 pm on Oct 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think I may have stumbled on the answer. In the footer of all my English language sites I was using &copy; which probably generated a non-8859 character, possibly forcing Google to see the page as ASCII. None of my non-English sites use &copy;

I have now deleted it from all, so will have to wait for the next crawl to see whether this was the problem.

[added]
trinorthlighting,
You mentioned you have a similar problem. I notice you also use &copy; in your site's footer. May be coincidence, of course.:)
[/added]

[edited by: HarryM at 1:04 pm (utc) on Oct. 11, 2006]

trinorthlighting

1:04 pm on Oct 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That is what I thought as well, but it never changed for our site. I would not really worry about it to much if your pages are indexed and rank well.