Forum Moderators: DixonJones
Yesterday I caught these bots - but what concerns me is a few mention "Google" (see the second bot - is this a legit bot?).
I do not want to block legit googlebots. Is there a good list somewhere of legit googlebots?
Trapped bots:
24.57.8.78, agent is EasyDL/3.04 [keywen.com...]
66.249.65.238, agent is Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
64.233.178.136, agent is SonyEricssonK750i/R1N Browser/SEMC-Browser/4.2 Profile/MIDP-2.0 Configuration/CLDC-1.1 (Google WAP Proxy/1.0)
66.176.107.118, agent is Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)
68.169.221.230, agent is Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
The other one (3rd) to worry about is that people using a WAP proxy to view your site on their mobile phone are being blocked too.
The 4th and 5th look like real users with real browsers too; what detects their "botness"? Bandwidth used? Pages per Second served? Access from a known "bad" IP?
The method uses a robot.txt file containing:
User-agent: *
Disallow: thebottrap.php
Then each web page has a hidden link to thebottrap.php.
If someone or some thing goes to thebottrap.php then I know it/they has ignored the robot.txt file (or perhaps even used it to find the file) or someone has been poking around the source code and discovered it that way.
When it/they goes to thebottrap.php, I log the ip address and prevent this ip from accessing my site again.
What worries me is that though the robots.txt file validates and has been up for a couple of months, it seems like it still may have snared a google ip.
Where can I get an up to date list of legit Google IP's?
Thankyou in advance :-)
WAP proxies and Mozilla browsers that can do prefetching will need to be 'allowed' to fetch your 'bot-link; They are not robots and so don't read robots.txt.
For regular browsers, your 'bot-link needs to be hidden either as a comment or in other ways that cause browsers to ignore it. In many cases, WAP devices will display even those links, so the script must not ban them (you can modify the script or use mod_rewrite to handle this).
Only about a third of the complexity of bot-trapping is in the scripting. The rest is being clever about 'bot-baiting and avoiding collateral damage. Be careful, since as this case demonstrates, it only takes a small error to ban important search 'bots and innocent users.
Jim