Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Sitemaps reported NO robots.txt on 6/9/2006

But I do have a robots.txt on this site!

         

bumpski

10:01 am on Jun 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This quirk could certainly explain many posts I've seen here where sites are being crawled by Googlebot even though Googlebot was banned by robots.txt.

If Googlebot thinks there is no robots.txt the door is wide open! Sort of a significant design flaw in terms of protecting website copyright.

At the time Sitemaps did not indicate any 404 errors certainly not for robots.txt text. Also sitemaps indicated no other errors in any way related to robots.txt. (Unreachable URLS, blocks by robots.txt, etc).

Certainly this morning (6/10/2006) Sitemaps has rediscovered I do have a robots.txt. ------ The plot thickens

And certainly on 6/9/2006 Mediapartners-Google did crawl my robots.txt, Googlebot has not. Somewhat of an indication Googlebots are sharing robots.txt information.

On 06/06/2006 Googlebot did attempt to fetch robots.txt but without the WWW. Googlebot was returned a 301, BUT, Googlebot did not followup at this time and do another GET at www.example.com/robots.txt! This 301 mechanism does work and does return the correct path so I don't know why Googlebot did not fetch robots.txt at the WWW address in response to the 301 return.

How do I know this 301 mechanism works?
On 5/28/2006 Googlbot fetched robots.txt and a 301 was returned by this site's server. In the same minute Googlebot at the same IP address did fetch robots.txt with a 200, or success, indication. This is strong evidence at one point the 301 mechanism was working and it certainly is today and yesterday when I tested it with the very helpful "Live HTTP headers".


Conclustion:
So for some period of time between 6/6/2006 and 6/9/2006 Googlebot did not think I had a robots.txt file. Whoops!
Mediapartners did successfully crawl this sites's robots.txt file everyday since 301 response on the 6th. So it is likely, I saw a transient condition where Googlebot only thought I didn't have a robots.txt from 6/8/2006 to 6/9/2006. During this time period Googlebot, Mediapartners bot, and the Adwords bot (GoogleBot/2.1 note the uppercase "Bot"), all crawled quite a few pages on this site.

Sidenotes:

This robots.txt gives Googlebot and in fact all bots unlimited access using:
User-agent: *
Disallow:
Also this sites robots.txt gives Mediapartners (Adsense) unfetterd access using:
User-agent: Mediapartners-Google*
Disallow:
which should of course be redundant information to Google.

Twice in May Googlebot requested robots.txt and was told "304" or file has not changed.
On 6/19/2006 I had changed robots.txt to give Googlebot unlimited access to the site and I believe this is confirmed by Sitemaps, but I'm not sure I fully understand the sitemaps status indication of:

"Allowed by line 4: Disallow: Detected as a directory; specific files may have different restrictions"

This indicated by the Sitemap's "Analysis of cached robots.txt". URL:

Sitemaps summary indicates Googlebot last successfully crawled this site's "home" page on 5/31/2006, which is technically correct, BUT, Googlebot has been doing HEAD requests of this site's "home" page numerous times since, even on 6/9/2006. Also Mediabot crawled this homepage on 6/3/2006, perhaps indicating Mediabot and Googlebot are NOT sharing information about this site's home page and perhaps other pages (makes sense to me).

This site is fully indexed with a few Supplemental pages and spurious URL errors (like // paths) which return success, and a few other URL requests for non-existant pages.

Another Sidenote:
Someone tried to fetch misspellings of a web page on this site. Interestingly Mediapartners bot (Adsense) immediately tried to fetch all the misspelled URLs!

speda1

8:54 pm on Jun 11, 2006 (gmt 0)

10+ Year Member



This also happened to me for one of my sites yesterday. I refreshed the page and then google picked it up.

This leads me to believe that it is the sitemaps tool that is reporting on your robots.txt file, not info gathered by one of the bots.

abates

9:57 pm on Jun 11, 2006 (gmt 0)

10+ Year Member



This happened to me, and Googlebot indexed a bunch of stuff I didn't want it to index. :P

Once it "found" the robots.txt again, I used the removal tool to remove the URLs from the index.

bumpski

12:31 am on Jun 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It will be interesting to hear if others witness this.

I was in a hurry when I saw the problem, but I do believe I exited the robots status page and refreshed it several times and the results were consistent. It was corrected the next morning. I wish I had just done a "Save As" of the robots.txt status page to fully document the occurance!

I think it may be related to the 301 redirect and then the failure of Googlebot to followup and actually read at the redirect address. Googlebot never read robots.txt (in the short term) after the redirect response. It may take this unusual scenario where Googlebot tries to fetch:
example.com/robots.txt (no WWW)
The server says "no, that's at www.example.com/robots.txt", but in this case Googlebot never read the "www" robots.txt, it just ignored the redirect. So it could be it takes an attempt at an incorrect domain to actually see the problem. I don't know where Google is picking up the non WWW domain name, I do know the site is coded correctly.

I've since reviewed my error logs as well and there is no error attempting to read robots at the time of the 301 response. In fact several other bots fetched robots.txt within a few minutes of the 301 redirect response.

Assuming this is a problem, Google should be a little more fault tolerant and not just assume there's no robots.txt when there had been one for several years. Try reading robots.txt a couple more times; in the interim using it's cached copy.

I'm lucky right now in that I've left the site wide open to Googlebot anyway! So it make no difference.

Thanks for the feedback.