Forum Moderators: Robert Charlton & goodroi
If Googlebot thinks there is no robots.txt the door is wide open! Sort of a significant design flaw in terms of protecting website copyright.
At the time Sitemaps did not indicate any 404 errors certainly not for robots.txt text. Also sitemaps indicated no other errors in any way related to robots.txt. (Unreachable URLS, blocks by robots.txt, etc).
Certainly this morning (6/10/2006) Sitemaps has rediscovered I do have a robots.txt. ------ The plot thickens
And certainly on 6/9/2006 Mediapartners-Google did crawl my robots.txt, Googlebot has not. Somewhat of an indication Googlebots are sharing robots.txt information.
On 06/06/2006 Googlebot did attempt to fetch robots.txt but without the WWW. Googlebot was returned a 301, BUT, Googlebot did not followup at this time and do another GET at www.example.com/robots.txt! This 301 mechanism does work and does return the correct path so I don't know why Googlebot did not fetch robots.txt at the WWW address in response to the 301 return.
How do I know this 301 mechanism works?
On 5/28/2006 Googlbot fetched robots.txt and a 301 was returned by this site's server. In the same minute Googlebot at the same IP address did fetch robots.txt with a 200, or success, indication. This is strong evidence at one point the 301 mechanism was working and it certainly is today and yesterday when I tested it with the very helpful "Live HTTP headers".
Conclustion:
So for some period of time between 6/6/2006 and 6/9/2006 Googlebot did not think I had a robots.txt file. Whoops!
Mediapartners did successfully crawl this sites's robots.txt file everyday since 301 response on the 6th. So it is likely, I saw a transient condition where Googlebot only thought I didn't have a robots.txt from 6/8/2006 to 6/9/2006. During this time period Googlebot, Mediapartners bot, and the Adwords bot (GoogleBot/2.1 note the uppercase "Bot"), all crawled quite a few pages on this site.
Sidenotes:
This robots.txt gives Googlebot and in fact all bots unlimited access using:
User-agent: *
Disallow:
Also this sites robots.txt gives Mediapartners (Adsense) unfetterd access using:
User-agent: Mediapartners-Google*
Disallow:
which should of course be redundant information to Google.
Twice in May Googlebot requested robots.txt and was told "304" or file has not changed.
On 6/19/2006 I had changed robots.txt to give Googlebot unlimited access to the site and I believe this is confirmed by Sitemaps, but I'm not sure I fully understand the sitemaps status indication of:
"Allowed by line 4: Disallow: Detected as a directory; specific files may have different restrictions"
Sitemaps summary indicates Googlebot last successfully crawled this site's "home" page on 5/31/2006, which is technically correct, BUT, Googlebot has been doing HEAD requests of this site's "home" page numerous times since, even on 6/9/2006. Also Mediabot crawled this homepage on 6/3/2006, perhaps indicating Mediabot and Googlebot are NOT sharing information about this site's home page and perhaps other pages (makes sense to me).
This site is fully indexed with a few Supplemental pages and spurious URL errors (like // paths) which return success, and a few other URL requests for non-existant pages.
Another Sidenote:
Someone tried to fetch misspellings of a web page on this site. Interestingly Mediapartners bot (Adsense) immediately tried to fetch all the misspelled URLs!
I was in a hurry when I saw the problem, but I do believe I exited the robots status page and refreshed it several times and the results were consistent. It was corrected the next morning. I wish I had just done a "Save As" of the robots.txt status page to fully document the occurance!
I think it may be related to the 301 redirect and then the failure of Googlebot to followup and actually read at the redirect address. Googlebot never read robots.txt (in the short term) after the redirect response. It may take this unusual scenario where Googlebot tries to fetch:
example.com/robots.txt (no WWW)
The server says "no, that's at www.example.com/robots.txt", but in this case Googlebot never read the "www" robots.txt, it just ignored the redirect. So it could be it takes an attempt at an incorrect domain to actually see the problem. I don't know where Google is picking up the non WWW domain name, I do know the site is coded correctly.
I've since reviewed my error logs as well and there is no error attempting to read robots at the time of the 301 response. In fact several other bots fetched robots.txt within a few minutes of the 301 redirect response.
Assuming this is a problem, Google should be a little more fault tolerant and not just assume there's no robots.txt when there had been one for several years. Try reading robots.txt a couple more times; in the interim using it's cached copy.
I'm lucky right now in that I've left the site wide open to Googlebot anyway! So it make no difference.
Thanks for the feedback.