Forum Moderators: open

Message Too Old, No Replies

Strange characters begin cache - ÿþ

seems to prevent further indexing?

         

HenryUK

9:22 am on Apr 15, 2003 (gmt 0)

10+ Year Member



On a site for which I have been giving some friendly advice the last update failed to index their pages despite a change in structure.

I note that the page designed to give a hierarchical view of their content, Google's cache of the page begins ÿþ<html>

This seems to make all the code on the page just text rather than HTML.

Those characters don't appear in the source of the page.

If you run a google search for ÿþ<html> there are a lot of results "suffering" from the same thing. Run that and look at the cache of any page to see what I mean.

My question: is this something to do with the software used in putting the page together, or is it a way that Google somehow "tags" pages that it doesn't want to index?

In this case the home page of the site checks whether the user has cookies enabled and sends the user to an info page about cookies if it has not. However the index link does appear on the cookies page. I suppose this could be seen as spider-sniffing and result in a ban, although the reasons for doing it this way are quite legitimate.

Any ideas about those strange characters though?

takagi

11:52 am on Apr 15, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think I saw what you mean. When I checked Google's cache, I found between all the letters a space. This would suggest an encoding problem. Unlike ASCII, Unicode needs 2 bytes per character. No idea why these pages are not doing well on Google. They load correctly in my browser. Also strange is that in the snippet, the words are not showing the extra space.

takagi

2:43 pm on Apr 15, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just did a check with the Server header checker [webmasterworld.com] on some of these 36,600 pages and found no pattern. Different servers (Apache, Microsoft-IIS) correct Content-Type (Content-Type: text/html) and no redirects or errors (200 OK). The size mentioned in the SERP matches the "Content-Length" in the header (if the header contains this field). So it is not double the size you would expect with all these extra spaces.

Words in the text can be found (no problem with the extra spaces) but also the HTML code that is on the page.

The pages have no Title in the SERP unless it is a page in Google Directory (in that case the title from the directory is used).

This truly looks like a bug in Google or its bots.

HenryUK

3:37 pm on Apr 15, 2003 (gmt 0)

10+ Year Member



Google bug? Hey, how about that then GG?

I have another theory that I am going to check out first...

thanks for the intelligent response and research takagi - I think this is an interesting one.

HenryUK

4:00 pm on Apr 15, 2003 (gmt 0)

10+ Year Member



Hmm. My theory no good. Written to Google on this, will post any interesting reply...

HenryUK

2:15 pm on Apr 16, 2003 (gmt 0)

10+ Year Member



Apparently it WAS a unicode issue.

Don't know whether it is a general Unicode problem with Google indexing or whether the pages were set to Unicode inappropriately.

takagi

2:30 pm on Apr 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What makes you so sure it was a Unicode problem? And if so, why have these sites this problem, and others are OK?

takagi

6:58 am on May 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello Henry, I found some more information about these 2 characters. It has to do with a ZERO-WIDTH NON-BREAKING SPACE character also known as a 'Byte Order Mark (BOM)'. When sending a 16 bit code, the server can send the Most Significant Byte (MSB) first or the Least Significant Byte (LSB) first. Kind of big-endian vs little-endian.

At the end of this table [users.cybercity.dk] you can see the 2 characters have the hexadecimal code FE and FF.

In paragraph 5.2.1 of the W3 document called HTML Document Representation [w3.org] you can read the following about this mark:

Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal FFFE, a character guaranteed never to be assigned. Thus, a user-agent receiving a hexadecimal FFFE as the first bytes of a text would know that bytes have to be reversed for the remainder of the text.

I still don't understand why some pages have this problem, while other pages on the same server don't have this problem.

HenryUK

10:19 am on May 1, 2003 (gmt 0)

10+ Year Member



thanks takagi

I have passed this back to the people concerned and I will pass on any further feedback.

Henry