Sitemaps reported NO robots.txt on 6/9/2006

This quirk could certainly explain many posts I've seen here where sites are being crawled by Googlebot even though Googlebot was banned by robots.txt.

If Googlebot thinks there is no robots.txt the door is wide open! Sort of a significant design flaw in terms of protecting website copyright.

At the time Sitemaps did not indicate any 404 errors certainly not for robots.txt text. Also sitemaps indicated no other errors in any way related to robots.txt. (Unreachable URLS, blocks by robots.txt, etc).

Certainly this morning (6/10/2006) Sitemaps has rediscovered I do have a robots.txt. ------ The plot thickens

And certainly on 6/9/2006 Mediapartners-Google did crawl my robots.txt, Googlebot has not. Somewhat of an indication Googlebots are sharing robots.txt information.

On 06/06/2006 Googlebot did attempt to fetch robots.txt but without the WWW. Googlebot was returned a 301, BUT, Googlebot did not followup at this time and do another GET at www.example.com/robots.txt! This 301 mechanism does work and does return the correct path so I don't know why Googlebot did not fetch robots.txt at the WWW address in response to the 301 return.

How do I know this 301 mechanism works?
On 5/28/2006 Googlbot fetched robots.txt and a 301 was returned by this site's server. In the same minute Googlebot at the same IP address did fetch robots.txt with a 200, or success, indication. This is strong evidence at one point the 301 mechanism was working and it certainly is today and yesterday when I tested it with the very helpful "Live HTTP headers".

Conclustion:
So for some period of time between 6/6/2006 and 6/9/2006 Googlebot did not think I had a robots.txt file. Whoops!
Mediapartners did successfully crawl this sites's robots.txt file everyday since 301 response on the 6th. So it is likely, I saw a transient condition where Googlebot only thought I didn't have a robots.txt from 6/8/2006 to 6/9/2006. During this time period Googlebot, Mediapartners bot, and the Adwords bot (GoogleBot/2.1 note the uppercase "Bot"), all crawled quite a few pages on this site.

Sidenotes:

This robots.txt gives Googlebot and in fact all bots unlimited access using:
User-agent: *
Disallow:
Also this sites robots.txt gives Mediapartners (Adsense) unfetterd access using:
User-agent: Mediapartners-Google*
Disallow:
which should of course be redundant information to Google.

Twice in May Googlebot requested robots.txt and was told "304" or file has not changed.
On 6/19/2006 I had changed robots.txt to give Googlebot unlimited access to the site and I believe this is confirmed by Sitemaps, but I'm not sure I fully understand the sitemaps status indication of:

"Allowed by line 4: Disallow: Detected as a directory; specific files may have different restrictions"

This indicated by the Sitemap's "Analysis of cached robots.txt". URL:

Sitemaps summary indicates Googlebot last successfully crawled this site's "home" page on 5/31/2006, which is technically correct, BUT, Googlebot has been doing HEAD requests of this site's "home" page numerous times since, even on 6/9/2006. Also Mediabot crawled this homepage on 6/3/2006, perhaps indicating Mediabot and Googlebot are NOT sharing information about this site's home page and perhaps other pages (makes sense to me).

This site is fully indexed with a few Supplemental pages and spurious URL errors (like // paths) which return success, and a few other URL requests for non-existant pages.

Another Sidenote:
Someone tried to fetch misspellings of a web page on this site. Interestingly Mediapartners bot (Adsense) immediately tried to fetch all the misspelled URLs!

Sitemaps reported NO robots.txt on 6/9/2006

But I do have a robots.txt on this site!

bumpski

speda1

abates

bumpski

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week