Forum Moderators: goodroi

Message Too Old, No Replies

Googlebot tripping bad bot script

         

RobertRogers

7:47 pm on Nov 27, 2006 (gmt 0)

10+ Year Member



I have a number of websites with a "bad bot" trap. On only one of my sites a particular googlebot keeps springing the trap.

Does the googlebot mess up sometimes in reading robots.txt or perhaps this is a fake googlebot?

In my robots.txt file I have this, which has always worked for years:

User-agent: *
Disallow: /see-this/

This is the google bot that springs the trap:
---------------------------------------------
A bad robot hit /see-this/ 2006-11-27 (Mon) 00:34:21
address is 66.249.65.109
agent is Mozilla/5.0
(compatible; Googlebot/2.1; +http://www.google.com/bot.html)

goodroi

11:10 am on Nov 28, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Hi RobertRogers,

Welcome to WebmasterWorld!
My first thought was that the bot is one of the many imitation googlebots. But the ip you list is a Google ip.

The part of your robots.txt that you posted is valid and Google should not be following it unless maybe you have a section in your robots.txt that is specifically for googlebot that allows it. Do you?

cheers
goodroi

RobertRogers

7:13 pm on Nov 29, 2006 (gmt 0)

10+ Year Member



No, the robots.txt file hasn't changed for a couple of years. This is the first time googlebot visited the disallowed files.

After doing it twice, it seems like the bot has learned it's lesson and is behaving itself.

I guess even the big guys make mistakes.

lammert

3:56 pm on Dec 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you use AdWords? The landing page quality bot is known to not obey wildcards in robots.txt by design. this specific bot should be recognizable by its UserAgent string though.

jdMorgan

5:29 pm on Dec 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Another possible problem is that you have other records in your robots.txt file, not mentioned here, which are interfering with your intended function.

Most robots will accept the first robots.txt record that matches their user-agent string, or "*" -- whichever comes first. The major search robots go beyond that, and accept the record that most-specifically matches their user-agent string.

However, support for obeying multiple records which match to varying degrees is likely non-existent. Therefore, your robots.txt should be designed assuming that only one record will be obeyed by any given robot, and the best approach is to design for the simple rule given first above.

Another way to put this is that robots.txt records are per-robot, not per-URL-path.

Jim