Forum Moderators: open
I note that the page designed to give a hierarchical view of their content, Google's cache of the page begins ÿþ<html>
This seems to make all the code on the page just text rather than HTML.
Those characters don't appear in the source of the page.
If you run a google search for ÿþ<html> there are a lot of results "suffering" from the same thing. Run that and look at the cache of any page to see what I mean.
My question: is this something to do with the software used in putting the page together, or is it a way that Google somehow "tags" pages that it doesn't want to index?
In this case the home page of the site checks whether the user has cookies enabled and sends the user to an info page about cookies if it has not. However the index link does appear on the cookies page. I suppose this could be seen as spider-sniffing and result in a ban, although the reasons for doing it this way are quite legitimate.
Any ideas about those strange characters though?
Words in the text can be found (no problem with the extra spaces) but also the HTML code that is on the page.
The pages have no Title in the SERP unless it is a page in Google Directory (in that case the title from the directory is used).
This truly looks like a bug in Google or its bots.
At the end of this table [users.cybercity.dk] you can see the 2 characters have the hexadecimal code FE and FF.
In paragraph 5.2.1 of the W3 document called HTML Document Representation [w3.org] you can read the following about this mark:
Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal FFFE, a character guaranteed never to be assigned. Thus, a user-agent receiving a hexadecimal FFFE as the first bytes of a text would know that bytes have to be reversed for the remainder of the text.
I still don't understand why some pages have this problem, while other pages on the same server don't have this problem.