Google disobeying robots.txt

Forum Moderators: open

Message Too Old, No Replies

Google disobeying robots.txt

Do I hear the sound of the other shoe dropping?

uber_boy

3:40 pm on Feb 12, 2003 (gmt 0)

For the first time, googlebot is disobeying one of my robots.txt files. It used to have free run of the site in question, but I more or less banned all bots when I altered the file last month. And until today, googlebot was living by the new rules. However, about an hour ago googlebot started aggressively crawling this site and I have to say I'm a little shocked. It's not causing me any grief from a capacity perspective, but I've come to think of googlebot as trustworthy. Thus, I'd appreciate it if someone could shed a little light on these happenings...

lazerzubb

3:46 pm on Feb 12, 2003 (gmt 0)

Ah welcome to the real world ;)

NO spider follows your robots.txt file 100%.

You can contact Google about it though.
More information available at:
[google.com...]

So if it was sensitive content, you may be lucky and it may never show up in the serps.
If you contact them, be brief and clear in your message.

jdMorgan

3:47 pm on Feb 12, 2003 (gmt 0)

uber_boy,

First, validate your robots.txt [searchengineworld.com]. Then send a description of your problem along with a sample of your access log file to googlebot@google.com.

HTH,
Jim

amznVibe

3:57 pm on Feb 12, 2003 (gmt 0)

If google checked your robots earlier in the day and comes back later, I've noticed it will not check again... so if you made some changes during that time, they may not be followed

uber_boy

4:25 pm on Feb 12, 2003 (gmt 0)

Thanks for the replies, everyone, and particularly to you, jdMorgan, for that validator link. It's mighty slick and, as suspected, confirmed that I lack style (but at least my file was validated). So should Google persist, I will do as suggested and send a quick message to the folks at Google...

GoogleGuy

10:55 pm on Feb 12, 2003 (gmt 0)

uber_boy, the googlebot tries to refetch your robots.txt file periodically, depending on how many pages we've fetched. Have you changed your robots.txt file recently?

You might try writing to googlebot@google.com if you still see a problem. Can you say what IP address and user agent it was?

rfgdxm1

11:12 pm on Feb 12, 2003 (gmt 0)

It occurs to me your site might have been down or unreachable the moment Googlebot tried to fetch robots.txt.

uber_boy

11:27 pm on Feb 12, 2003 (gmt 0)

Actually, I hadn't changed my robots.txt file in several weeks (although I've since added a bit of "style" to satisfy the validator). As for the culprit, a quick glance at the logs shows that I had every bot between 1 and 9 except for "crawl6". Were I to examine a few more records, I'd probably find the latter there as well. What makes all this particularly interesting to me is that I didn't get a whiff from the bots until today despite the fact that the deep crawl's been going on for a week...

uber_boy

6:36 pm on Feb 13, 2003 (gmt 0)

Just thought I'd wrap this thread up by noting that I did finally write to Google today. Much to my surprise, I had a personal reply from "The Google Team" within an hour. Granted, they fed me the standard message about robots.txt only being read once per day (despite my having made it VERY clear in my message that the file hadn't been changed in a month), but I was still impressed with the swiftness of their reply. I've since fired off another message so I'll be interested to see where things go from here.

uber_boy

4:37 pm on Feb 15, 2003 (gmt 0)

Heard from Google yesterday and again today. I have to say, it's a pretty responsive team there. That said, they informed me that the problem was trailing slashes on the directories I was trying to disallow. I've made the changes but that still doesn't explain why googlebot has obeyed these directory names in the past...

jdMorgan

5:42 pm on Feb 15, 2003 (gmt 0)

uber_boy,

Are you saying that a construct like


Disallow: /off_limits_directory/

is not being obeyed because of the trailing slash?
This should disallow all files in /off_limits_directory, but not a file called /off_limits_directory.html, for example.

Thanks,
Jim

GoogleGuy

7:31 pm on Feb 15, 2003 (gmt 0)

This is a good time to make the point that just because you have a valid robots.txt file doesn't mean that it does what you want. :) Just like with compiling a program--the program may compile because there are no syntax errors, but that doesn't mean that it does what you want.

I don't think I've heard of any actual robots.txt bugs at Google in several months, but it's always possible. Glad you got a useful answer back quickly, uber_boy..

chrisnrae

7:37 pm on Feb 15, 2003 (gmt 0)

I also had a bad robots.txt experience. I banned google from a new domain while I was uploading files, tweaking etc. I opened my hosting account on a Monday afternoon. Tuesday afternoon I uploaded the robots.txt first then the rest of the files. By Tuesday night, I had a PR zero, with my domain being IN google with nothing more than my domain (no title, description, etc)... still don't understand why or how it happened.

Rae

jdMorgan

9:28 pm on Feb 15, 2003 (gmt 0)

Rae,

I'll wager you have the Google toolbar installed and enabled, and that you visited your new site. The other (remote) possibility is that your server log files are open and have been indexed. :o

Google will list any URL it becomes aware of in any way. However, it will not spider any page which you have disallowed in robots.txt. I've argued before that there is a question of semantics between "indexing" and "showing a link", but the robots exclusion standard does not say they can't show the link. So there it is, with no title or description - just a link.

If you really want to tell Google "Don't mention this page at all" you have to allow it to be spidered and place a meta robots noindex tag on the page itself.

BTW, Ask Jeeves/Teoma shows the same behaviour, but they're the only other one that I'm aware of.

Now that I know the fix, I can live with it.

Jim

chrisnrae

11:48 pm on Feb 15, 2003 (gmt 0)

Thanks :). Yes, I do have the toolbat installed and on. LOL, NOW, I want google to index it. I just wish it would have waited until the site was finished :). So, now I am just waiting for the next crawl.

uber_boy

3:32 pm on Feb 16, 2003 (gmt 0)

I'm as surprised as you by this, jdMorgan. But I did indeed receive an email from Google telling me to remove every trailing slash from my robots.txt file. The "exact" wording (I've obviously changed the directory names) of the email was as follows:

Thanks for sending us the requested information.

To prevent the crawling of the disallowed directories, please make the following changes to your robots.txt file:

User-agent: *
Disallow: /foo/
Disallow: /bar/

Change To:

User-agent: *
Disallow: /foo
Disallow: /bar

Regards,
The Google Team

And as I've noted a couple of times, googlebot obeyed the original robots.txt file for the first week of the deep crawl.

jdMorgan

4:08 pm on Feb 16, 2003 (gmt 0)

Houston - er, Mountain View, we have a problem.

uber_boy,

The following is a quote from A Standard for Robot Exclusion [robotstxt.org]. Note the second sentence of the quote:

Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

I hope you were getting an answer from a "Level 1" tech out there, because the answer looks wrong. Please keep your log files of the "incident", 'cause I suspect this will need some looking into...

Robots are supposed to use simple prefix-matching to determine which resources are off-limits. If you say, "Disallow: /myfiles/", then www.example.com/myfiles/whatever.html is off-limits, and www.example.com/myfiles.html is not.

Best,
Jim

uber_boy

4:18 pm on Feb 16, 2003 (gmt 0)

I hear what you're saying, Jim, and am equally concerned. GoogleGuy's been following this thread so perhaps we can count on him (her?) to clarify this since, I agree, this seems to go against the standard.

GoogleGuy

9:10 pm on Feb 16, 2003 (gmt 0)

Interesting. I'll pass this back and ask what da skinny is. Thanks for giving more details, uber_boy and jdMorgan.

Romeo

9:43 pm on Feb 16, 2003 (gmt 0)

Uber-boy,
I had a lot of google crawling on my pages during the recent days, and the Google-bots did read my robots.txt and obeyed the rule
User-agent: *
Disallow: /bottrap/
correctly, as they did in the past (note the trailing slash).

Is your Disallow statement the last in your file and does it end with a new line? Some bots seem to insist on the new line (I don't know if the Google bots do), and not all robots.txt verification programs check this.

Regards,
R.

uber_boy

9:59 pm on Feb 16, 2003 (gmt 0)

Thanks for the hints, Romeo. As near as I can tell, I'm doing everything you say. And as I keep noting, the crazy thing is that googlebot's obeyed exactly the same robots.txt file in the past and, for most of the deep crawl, obeyed it this time.

Painting

8:29 pm on Feb 23, 2003 (gmt 0)

Has anybody figured out why googlebot is not following robots.txt spec sometimes?

This is an important thread; just making sure it's not forgotten ;-)

isaac