Google not respecting robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Google not respecting robots.txt

Found >500 pages in Google index that are forbiden by robots.txt

roots

5:34 pm on Jan 19, 2007 (gmt 0)

I can't believe this is true.

In robots.txt I disallowed crawlers to index one page even before that page existed on the site (3 months ago). Today I found out that pages could be found in G index. First I thought that this is a problem with robot.txt instructions, but I took destination URL of the page from SERP and paste it in the Google Webmaster tool that tests URLs against robots.txt file from the web server. Result was: Blocked by line 2!

Anyone experienced something like that?

Quadrille

12:06 am on Jan 22, 2007 (gmt 0)

No direct experience, but I've heard tales of this happening when the URL has incoming links from other sites.

And if the site is dynamic, it may be that there's other routes to the same page.

Xenu might be your friend?

goodroi

1:56 am on Jan 22, 2007 (gmt 0)

what are you finding in the google index? is google listing the url only or is it also listing a title and snippet?

piskie

1:59 am on Jan 22, 2007 (gmt 0)

Maybe, just maybe Google only obeys robots.txt for whole directories and not individual files.

Eathan

5:54 am on Jan 22, 2007 (gmt 0)

I'm not sure about Googlebot, but the robots.txt checker in webmaster tools is case sensitive. Blocking /bob* in your robots file will not block /Bob* in the tester. May be common knowledge, but it stumped me the other day. Silly purchased cart software had mixed cases all over the place...