Forum Moderators: DixonJones

Message Too Old, No Replies

Fake Google Bot? Where do I find Googlebot ip's?

         

rfontaine

7:34 pm on Dec 27, 2005 (gmt 0)

10+ Year Member



Recently I created a badbot trap on my site, with a robots.txt file that has been up for several months and validates.

Yesterday I caught these bots - but what concerns me is a few mention "Google" (see the second bot - is this a legit bot?).

I do not want to block legit googlebots. Is there a good list somewhere of legit googlebots?

Trapped bots:

24.57.8.78, agent is EasyDL/3.04 [keywen.com...]

66.249.65.238, agent is Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

64.233.178.136, agent is SonyEricssonK750i/R1N Browser/SEMC-Browser/4.2 Profile/MIDP-2.0 Configuration/CLDC-1.1 (Google WAP Proxy/1.0)

66.176.107.118, agent is Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)

68.169.221.230, agent is Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

g1smd

7:44 pm on Dec 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I believe that 66.249.65.238 is within one of the IP ranges that Google uses.

The other one (3rd) to worry about is that people using a WAP proxy to view your site on their mobile phone are being blocked too.

The 4th and 5th look like real users with real browsers too; what detects their "botness"? Bandwidth used? Pages per Second served? Access from a known "bad" IP?

rfontaine

7:53 pm on Dec 27, 2005 (gmt 0)

10+ Year Member



Do you see anything wrong with this method, and why would it snare a googlebot?

The method uses a robot.txt file containing:

User-agent: *
Disallow: thebottrap.php

Then each web page has a hidden link to thebottrap.php.

If someone or some thing goes to thebottrap.php then I know it/they has ignored the robot.txt file (or perhaps even used it to find the file) or someone has been poking around the source code and discovered it that way.

When it/they goes to thebottrap.php, I log the ip address and prevent this ip from accessing my site again.

What worries me is that though the robots.txt file validates and has been up for a couple of months, it seems like it still may have snared a google ip.

Where can I get an up to date list of legit Google IP's?

Thankyou in advance :-)

g1smd

8:09 pm on Dec 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Disallowed URL must start with a / I believe, so without it, bots may not understand the location of where that file they mustn't access really is - that's why they are still accessing it.

rfontaine

8:22 pm on Dec 27, 2005 (gmt 0)

10+ Year Member



Thank you g1smd, your idea is possibly correct. That missing '/' could make all the difference.

I suspect the others could be people dissecting the html source.

jdMorgan

11:14 pm on Dec 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your robots.txt was invalid: Robots use prefix-matching when determining whether they can fetch a file. Add the leading slash as mentioned above. You'll also need to allow at least a day with the corrected robots.txt in place before enabling the script, so that all robots have a fighting chance to fetch it.

WAP proxies and Mozilla browsers that can do prefetching will need to be 'allowed' to fetch your 'bot-link; They are not robots and so don't read robots.txt.

For regular browsers, your 'bot-link needs to be hidden either as a comment or in other ways that cause browsers to ignore it. In many cases, WAP devices will display even those links, so the script must not ban them (you can modify the script or use mod_rewrite to handle this).

Only about a third of the complexity of bot-trapping is in the scripting. The rest is being clever about 'bot-baiting and avoiding collateral damage. Be careful, since as this case demonstrates, it only takes a small error to ban important search 'bots and innocent users.

Jim

g1smd

11:25 pm on Dec 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



... as in this simple example in another thread: [webmasterworld.com...] perhaps?