Welcome to WebmasterWorld Guest from 18.208.159.25

Forum Moderators: phranque

Message Too Old, No Replies

protect site against badbots

interesting method found using bait dir in robots.txt, newbie inquires

     
2:47 pm on Mar 8, 2007 (gmt 0)

New User

10+ Year Member

joined:Jan 15, 2007
posts:2
votes: 0


Dear Friends,

new to webmastering, relatively speaking. Have just taken over a neglected site. the previous wm did nothing to protect and obviously got infected with huge amounts of spam-emails etc.
Now we want to update the site and would like to make sure that the spambots don't have an easy time of it.

First, where I've been.
altered the .htaccess files to deny known user-agents. Tried to find the most updated list. Somewhat of a challenge, as I know little about this whole business.

Found this post while googling. It is somewhat outdated (2001)
Would like to know if you think this is a good method for locking out the naughty beasts. It involves setting a trap in /robots.txt file, a non-existent directory, and then logging which bots specifically ignore the "ignore"-request and go for that, then using a script to deny access to the requesting IPs. Quoted here:

++++++++++++++++++++++
Stopping the most pernicious and egregarious spiderts can be easy though:

1. use some tool that does what mod_rewrite does on your server
2. insert the DISALLOW /email_addresses/ line into your robots.txt file
3. every time some visitor requests that explicitly disallowed directory you rewrite the request to a cgi that logs their IP address
4. and finally you configure your htaccess/mod_rewrite files to deny access to any visitor whose IP address is in that log file.

Thus the spidert is kick/banned instantly, rather than much later when you get around to perusing your log files ... by which time its too late
++++++++++++++++++++++++

If you think this is a good method for a small community site, would love to have suggestions towards implementation. I know nothing of cgi and little of apache server, but learning fast.

What do you all think? Is this a good method? If not, what are people doing these days?

TIA
Greetings
Glogo

3:10 pm on Mar 8, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


This presumes that "bad" robots will fetch robots.txt, and then get the honeypot URL from it. The problem is that many if not most bad-bots won't even fetch robots.txt.

So in addition to putting a trap in robots.txt, it's a good idea to use several other methods as well. See this thread [webmasterworld.com] for a PERL-based solution, and this thread [webmasterworld.com] for a PHP-based solution (you can use both).

Jim