Google won't crawl if robots.txt returns a 500 error

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google won't crawl if robots.txt returns a 500 error

pcote

8:18 pm on Jan 29, 2008 (gmt 0)

Hello everyone,

I'm new to Webmaster World, and I'm not sure if anyone has every posted about this, but I discovered an interesting piece of info about Google Bot this week. One of my newer accounts did not have a robots file on the server, and for some reason when the robots file was requested the server was returning a '500' response instead of '404'. I found that even though all of the other files on the site worked properly, Google will not crawl a website if the robots file returns a '500' response code, because it is not sure if the file actually exists. In layman terms - this means that Google will not crawl a website unless it is sure it is allowed to.

I also noticed that Yahoo and MSN will index a site even if the robots file returns '500'.

Anyways, if you have a site that isn't being indexed by Google and you cannot figure out why, try checking the response code that is served when your robots.txt file is requested.

Hope this helps everyone,
Paul

tedster

9:45 pm on Jan 29, 2008 (gmt 0)

Welcome to the forums, Paul. That's a good observation about robots.txt, and pretty well-mannered behavior from googlebot, too. It makes sense for them to want a clear signal on the first spidering, even if they don't check robots.txt every time going into the future.

pcote

10:08 pm on Jan 29, 2008 (gmt 0)

Actually, the site was already in Google's index and the problem started when my client rebuilt the site and moved to a new host. Google was kind and left all of the old site's pages in the index, and it appeared to be waiting patiently for the server issue to be resolved - this went on for four months.

phranque

1:53 am on Jan 30, 2008 (gmt 0)

i thought google would crawl anything it wasn't specifically excluded from crawling.
however i found a recent thread on the google webmaster tools discussion group [groups.google.com] and the final response by googler johnmu (Jan 10, 3:06 am) verifies paul's observation.
i'm also guessing they do check robots.txt every time it is not a previously indexed url.

pcote

2:22 am on Jan 30, 2008 (gmt 0)

One of the most interesting points is that my client's old site and its rankings were not removed for the entire four month period. This behavior definitely presents an opportunity for spammers to change the page they're sending visitors to without losing their Google rankings - essentially using Google's cache to cloak a page.

phranque

3:36 am on Jan 30, 2008 (gmt 0)

by the way, welcome to WebmasterWorld [webmasterworld.com], paul!