Forum Moderators: coopster
I am posting a revised version of this PHP spider trap [webmasterworld.com], because of a flaw that was recognized by our local Apache Web Server [webmasterworld.com] guru. Thanks, jdMorgan [webmasterworld.com]!
Basically, it needed filelocking to prevent the htaccess file from being opened by another bot while it is already being written to. This could happen on a busy server.
Also, before I move on, I'd like to extend the credit for this script to Key_Master [webmasterworld.com]. KeyMaster posted the original bad bot script [webmasterworld.com], written in Perl. There is also a modified version [webmasterworld.com] as well.
How it Works
When the file getout.php is accessed, it opens your .htaccess file and appends the visitor's(bad bot) ip address to the list of banned ips.
Before you do anything, you'll need to disallow the file(getout.php) in your robots.txt file. Any decent bot should be reading and obeying this file so. Do not use the spider trap for a few days after adding the robots.txt disallow. You have to give the good bots enough time to read the ammended robots file. If you start using the trap right away, you stand a chance of banning good spiders!
Example robot.txt disallow:
User-agent: *
Disallow: /getout.php
Next, create a new folder in your root folder. Name it /trap/. You can name it anything really, but that's what I have in the script so you'll need to alter the script if you name it differently.
Chmod your .htaccess file to 644 and chmod getout.php to 755. You should put getout.php in the root folder. Or, simply change the robots.txt file to reflect the location of the file if you put it elsewhere.
Add these lines to your .htaccess file at the very top.
SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>
Ok, now you are ready to add some invisible links to your pages to catch the misbehaving bots. Don't forget to wait a few days for the good bots to catch the updated robots.txt file.
You can use a 1x1 transparent .gif for your links like so:
<a href="/getout.php" onclick="return false">
<img src="/clear.gif" /></a>
There are other ways as well. Using CSS absolute positioning, display property or visibility property. Jd_Morgan also suggests adding links within <!--comment tags-->.
getout.php
Any PHP peeps out there fell free to suggest ways to streamline this code :)
<?php$lock_dir = $_SERVER["DOCUMENT_ROOT"] . "/trap/lock";
$filename = $_SERVER["DOCUMENT_ROOT"] . "/.htaccess";
$bad_bot_ip = str_replace(".", "\.", $_SERVER["REMOTE_ADDR"]);
$content = "SetEnvIf Remote_Addr ^" . $bad_bot_ip . "$ getout\r\n";
function make_lock_dir(){
global $lock_dir;
$key = @mkdir($lock_dir, 0777);
$i = 0;
while ($key === FALSE && $i++ < 20) {
clearstatcache();
usleep(rand(5,85));
$key = @mkdir($lock_dir, 0777);
return $key;
}
}
function write_ban(){
global $filename, $bad_bot_ip, $content, $lock_dir;
$handle = fopen($filename, 'r');
$content .= fread($handle,filesize($filename));
fclose($handle);
$handle = fopen($filename, 'w+');
fwrite($handle, $content,strlen($content));
fclose($handle);
rmdir($lock_dir);
print "Goodbye!";
}
function stale_check(){
global $lock_dir;
if (fileatime($lock_dir) < time()-120){
rmdir($lock_dir);
if (make_lock_dir()!== False) write_ban();
} else {
exit;
}
}
if (make_lock_dir()!== False) {
write_ban();
} else {
stale_check();
}
?>
Enjoy! Thanks to Key_Master and jdMorgan!
[edited by: jatar_k at 4:39 pm (utc) on June 29, 2004]
[edit reason] Birdman requested edit [/edit]
--
I was using the previous CGI script and thought I'd try out this php one. It seems to be missing one thing I really liked... sending me an email whenever an IP gets banned.
The old cgi had this added to it
# trying to send an e-mail message
open(MAIL, "¦/usr/sbin/sendmail -t") ¦¦ die
"Content-type: text/text\n\nCan't open /usr/sbin/sendmail!";
print MAIL "To: bannedbots\@SiteLance\.com\n";
print MAIL "From: chowbotbanner\@SiteLance\.com\n";
print MAIL "Subject: You caught another one!\n";
print MAIL "The ip address \^$remaddr\$ has been banned on $date \n";
print MAIL "The associated user agent was $usragnt\n";
close(MAIL);
Is that 'safe' or something one should not do?
---------------------------------------------------
Blocking Badly Behaved Bots update [webmasterworld.com]
An update/fix for a very useful routine
[edited by: jatar_k at 10:06 pm (utc) on June 23, 2005]
[edit reason] added link to update [/edit]