File Format: Unrecognized - View as HTML - (deprecated) Google News Archive forum at WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

File Format: Unrecognized - View as HTML

in the SERPS?

clickthinker

1:14 pm on Jul 29, 2003 (gmt 0)

Checking on our listings and one of them that has been very consistent over 2 years has dropped a few spots and has the following as Googles description of the site:

File Format: Unrecognized - View as HTML

The title is getting pulled from the DMOZ listing and View as HTML takes you to Googles cache.
The site has had no down time at all that I'm aware of.

Has anyone seen this before?
thanks

takagi

2:50 pm on Jul 29, 2003 (gmt 0)

Usually the 'File Format:' string is shown when the file is not in HTML or plain ASCII (like an .txt file) but in a format like MS-Word, MS-Excel, MS-Powerpoint, MS-Access, Acrobat-PDF, etc. In those cases there is usually also a 'view as HTML'. So for some reason Google cannot parse your file. What extension does it have (.htm, .php, .shtml, etc)?

It is not unusual for Google that they will show title, description & category from Google Directory (basicly ODP with some delay) for pages they cannot spider/parse.

takagi

3:17 pm on Jul 29, 2003 (gmt 0)

One more question, are these pages dynamic (like .asp, .php) or static? Or to put it differently, perhaps a time-out occured on a dynamic page. If googlebot is hitting the server too hard (too short intervals), problems occur. Maybe there is a relation between this 'File Format: Unrecognized' and a time-out.

clickthinker

5:53 am on Jul 30, 2003 (gmt 0)

Hi takagi
Thanks for the reply.

The site is built in .asp - Most of the pages are static but there are dynamic elements.

Is there anything I can do?

takagi

2:39 pm on Jul 30, 2003 (gmt 0)

In the SERP for the keywords you Stickymailed me, I could see the 'File Format: Unrecognized - View as HTML'. A search for 'info:www.domain.com' gives the same problem. But when I click on 'More results from www.domain.com', the real title, and snippet is shown with a fresh tag ('28 Jul 2003') and none of the other 68 pages hava an unrecognized file type. So it looks the problem was temporarily. The cache looks OK, but maybe it is the cache from the date in the fresh tag. Strange.

MrSpeed

3:23 pm on Jul 30, 2003 (gmt 0)

I have the same thing happening on a site. The index page is a straight index.html page. Nothing fancy about it and it validates fine.

The site ranks VERY poorly for a term that I would expect a top three listing.

I haven't noticed how long it was like this.

GoogleGuy

4:49 pm on Jul 30, 2003 (gmt 0)

Hmm. If I had to take a guess, I'd look for a misconfigured webserver. Just a shot in the dark, but I would guess that the webserver isn't returning text/html as the content type.

Here's how you can debug it yourself from Unix/Linux--you basically imitate a web browser or spider. Here's an example of fetching a page by hand from Google:

telnet www.google.com 80
Trying 216.239.39.99...
Connected to www.google.com.
Escape character is '^]'.
GET / HTTP/1.1
Host: www.google.com
(hit return once or twice until you get a response, which will look like the text below:)

HTTP/1.1 200 OK
Date: Wed, 30 Jul 2003 16:38:18 GMT
...
Cache-control: private
Content-Type: text/html <--- this line says what type of file it is.
Server: GWS/2.1
Content-length: 2691

Now if your page is www.foo.com/user1/test.html, you would type
telnet www.foo.com 80
and then do
GET /user1/test.html HTTP/1.1
Host: www.foo.com

and see what the webserver returns back. This is all that a crawler does, except it also looks for links and follows them several billion times. ;)

By the way, the "Host:" line is what allows an ISP to support virtual hosting--the bot says which domain it wants to fetch the page from. That's what allows an ISP to host many domains on one IP address. You can also use this technique to verify that an ISP is doing virtual hosting correctly. If you ask for pages from foo.com and get pages from someothercompany.com or yourisp.net, then tell your ISP to fix their virtual hosting. If you find virtual hosting errors, it could be that your ISP made a mistake, or maybe you didn't pay your ISP bill, so they've started serving their own content instead of yours. :)

So try that out. If the Content-Type: line doesn't say text/html, that's what needs to be fixed. If it does say text/html, then you might want to look into whether the webserver is sending binary data (e.g. an executable, or bad character encodings for non-English pages, etc.). Let us know what you find out, and good question! :)

P.S. I kinda spilled that out fast; definitely let me know if I did a typo/mistake in the above..

clickthinker

3:40 pm on Jul 31, 2003 (gmt 0)

Hi
Thanks for the info!
I had our techie guys take a look at it.
They got me to use an app that emmulates a spider:
Here what I got:

REQUEST: **************\n
GET / HTTP/1.1\r\n
Host: www.domain.com\r\n
Accept: */*\r\n
\r\n
RESPONSE: **************\n
HTTP/1.1 200 OK\r\n
Date: Thu, 31 Jul 2003 15:32:56 GMT\r\n
Content-Length: 17900\r\n
Content-Type: text/html\r\n
Set-Cookie: ASPSESSIONIDCQQTQTSC=KEIDNIHADKMOCMBHLFLIEMKE; path=/\r\n
Cache-control: private\r\n
Server: domain Anti-Track\r\n
\r\n
\r\n

I'm told the "HTTP/1.1 200 OK\r\n" means everything is cool and the "Content-Type: text/html\r\n" means its reading it as html.

We're still getting good traffic and enquiries - just hope its not permanent.

g1smd

11:54 pm on Aug 1, 2003 (gmt 0)

Is Google falling over your cookie (or is your site choosy about serving content when Google eats the cookie and gives nothing in return)?