Forum Moderators: open
Checking on our listings and one of them that has been very consistent over 2 years has dropped a few spots and has the following as Googles description of the site:
File Format: Unrecognized - View as HTML
The title is getting pulled from the DMOZ listing and View as HTML takes you to Googles cache.
The site has had no down time at all that I'm aware of.
Has anyone seen this before?
thanks
It is not unusual for Google that they will show title, description & category from Google Directory (basicly ODP with some delay) for pages they cannot spider/parse.
Here's how you can debug it yourself from Unix/Linux--you basically imitate a web browser or spider. Here's an example of fetching a page by hand from Google:
telnet www.google.com 80
Trying 216.239.39.99...
Connected to www.google.com.
Escape character is '^]'.
GET / HTTP/1.1
Host: www.google.com
(hit return once or twice until you get a response, which will look like the text below:)
HTTP/1.1 200 OK
Date: Wed, 30 Jul 2003 16:38:18 GMT
...
Cache-control: private
Content-Type: text/html <--- this line says what type of file it is.
Server: GWS/2.1
Content-length: 2691
Now if your page is www.foo.com/user1/test.html, you would type
telnet www.foo.com 80
and then do
GET /user1/test.html HTTP/1.1
Host: www.foo.com
and see what the webserver returns back. This is all that a crawler does, except it also looks for links and follows them several billion times. ;)
By the way, the "Host:" line is what allows an ISP to support virtual hosting--the bot says which domain it wants to fetch the page from. That's what allows an ISP to host many domains on one IP address. You can also use this technique to verify that an ISP is doing virtual hosting correctly. If you ask for pages from foo.com and get pages from someothercompany.com or yourisp.net, then tell your ISP to fix their virtual hosting. If you find virtual hosting errors, it could be that your ISP made a mistake, or maybe you didn't pay your ISP bill, so they've started serving their own content instead of yours. :)
So try that out. If the Content-Type: line doesn't say text/html, that's what needs to be fixed. If it does say text/html, then you might want to look into whether the webserver is sending binary data (e.g. an executable, or bad character encodings for non-English pages, etc.). Let us know what you find out, and good question! :)
P.S. I kinda spilled that out fast; definitely let me know if I did a typo/mistake in the above..
REQUEST: **************\n
GET / HTTP/1.1\r\n
Host: www.domain.com\r\n
Accept: */*\r\n
\r\n
RESPONSE: **************\n
HTTP/1.1 200 OK\r\n
Date: Thu, 31 Jul 2003 15:32:56 GMT\r\n
Content-Length: 17900\r\n
Content-Type: text/html\r\n
Set-Cookie: ASPSESSIONIDCQQTQTSC=KEIDNIHADKMOCMBHLFLIEMKE; path=/\r\n
Cache-control: private\r\n
Server: domain Anti-Track\r\n
\r\n
\r\n
I'm told the "HTTP/1.1 200 OK\r\n" means everything is cool and the "Content-Type: text/html\r\n" means its reading it as html.
We're still getting good traffic and enquiries - just hope its not permanent.