Forum Moderators: open

Message Too Old, No Replies

Google does not obey robots.txt

         

flex55

7:54 am on Sep 5, 2004 (gmt 0)

10+ Year Member



Yesterday I've seen something very odd on my logs. I wonder if anyone elses sees that:
many hits from ip 64.68.83.180 (reversed to crawl35.googlebot.com, owned by google according to ARIN) identifies itself with useragent "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" - unlike the regular "Googlebot/2.1 (+http://www.google.com/bot.html)" i'm used to on my logs.
What caught me was that this IP clearly did not obey robots.txt instructions.

What is that? A bug at google?

Brett_Tabke

11:48 am on Sep 9, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



That is a very common occurance.

- Did it index your content?
- Did it display that content in Google?

If not, then it obeyed its interpretation of the robots.txt standard.

The only way to stop GoogleBot from reading and following your links is to cloak them off.

Marcia

3:55 am on Sep 11, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



They've got the index page of two domains of mine in the index with a URL only listing that have a robots.txt exclusion and do not have even one link to them.

rfgdxm1

4:03 am on Sep 11, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>If not, then it obeyed its interpretation of the robots.txt standard.

That sir would be a flawed interpretation of the robots.txt standard. That standard was originally intended more as "This site is for humans, not robots. If you are a robot, go away and don't waste the bandwidth I pay real money for."

rfgdxm1

4:06 am on Sep 11, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>They've got the index page of two domains of mine in the index with a URL only listing that have a robots.txt exclusion and do not have even one link to them.

This is A-OK per the standard if Google fetched only robots.txt, and nothing more. You have no right to stop Google from indexing your sites. Your only right is they not waste your bandwidth.