Forum Moderators: coopster

Message Too Old, No Replies

Updated PHP Bad Bot Script

AKA: Spider Trap

         

Birdman

12:54 pm on Jun 29, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello everyone,

I am posting a revised version of this PHP spider trap [webmasterworld.com], because of a flaw that was recognized by our local Apache Web Server [webmasterworld.com] guru. Thanks, jdMorgan [webmasterworld.com]!

Basically, it needed filelocking to prevent the htaccess file from being opened by another bot while it is already being written to. This could happen on a busy server.

Also, before I move on, I'd like to extend the credit for this script to Key_Master [webmasterworld.com]. KeyMaster posted the original bad bot script [webmasterworld.com], written in Perl. There is also a modified version [webmasterworld.com] as well.

How it Works
When the file getout.php is accessed, it opens your .htaccess file and appends the visitor's(bad bot) ip address to the list of banned ips.

Before you do anything, you'll need to disallow the file(getout.php) in your robots.txt file. Any decent bot should be reading and obeying this file so. Do not use the spider trap for a few days after adding the robots.txt disallow. You have to give the good bots enough time to read the ammended robots file. If you start using the trap right away, you stand a chance of banning good spiders!

Example robot.txt disallow:
User-agent: *
Disallow: /getout.php

Next, create a new folder in your root folder. Name it /trap/. You can name it anything really, but that's what I have in the script so you'll need to alter the script if you name it differently.

Chmod your .htaccess file to 644 and chmod getout.php to 755. You should put getout.php in the root folder. Or, simply change the robots.txt file to reflect the location of the file if you put it elsewhere.

Add these lines to your .htaccess file at the very top.
SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

Ok, now you are ready to add some invisible links to your pages to catch the misbehaving bots. Don't forget to wait a few days for the good bots to catch the updated robots.txt file.

You can use a 1x1 transparent .gif for your links like so:
<a href="/getout.php" onclick="return false">
<img src="/clear.gif" /></a>

There are other ways as well. Using CSS absolute positioning, display property or visibility property. Jd_Morgan also suggests adding links within <!--comment tags-->.

getout.php
Any PHP peeps out there fell free to suggest ways to streamline this code :)

<?php

$lock_dir = $_SERVER["DOCUMENT_ROOT"] . "/trap/lock";

$filename = $_SERVER["DOCUMENT_ROOT"] . "/.htaccess";

$bad_bot_ip = str_replace(".", "\.", $_SERVER["REMOTE_ADDR"]);

$content = "SetEnvIf Remote_Addr ^" . $bad_bot_ip . "$ getout\r\n";

function make_lock_dir(){

global $lock_dir;

$key = @mkdir($lock_dir, 0777);

$i = 0;

while ($key === FALSE && $i++ < 20) {

clearstatcache();

usleep(rand(5,85));

$key = @mkdir($lock_dir, 0777);

return $key;

}

}

function write_ban(){

global $filename, $bad_bot_ip, $content, $lock_dir;

$handle = fopen($filename, 'r');

$content .= fread($handle,filesize($filename));

fclose($handle);

$handle = fopen($filename, 'w+');

fwrite($handle, $content,strlen($content));

fclose($handle);

rmdir($lock_dir);

print "Goodbye!";

}

function stale_check(){

global $lock_dir;

if (fileatime($lock_dir) < time()-120){

rmdir($lock_dir);

if (make_lock_dir()!== False) write_ban();

} else {

exit;

}

}

if (make_lock_dir()!== False) {

write_ban();

} else {

stale_check();

}

?>

Enjoy! Thanks to Key_Master and jdMorgan!

[edited by: jatar_k at 4:39 pm (utc) on June 29, 2004]
[edit reason] Birdman requested edit [/edit]

jdMorgan

5:40 am on Sep 5, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, combine them as in your second example. It won't work if you have more than one "Order" directive in .htaccess, so that's the determining factor.

Jim

kwasher

5:40 am on Sep 5, 2004 (gmt 0)

10+ Year Member



Thanks JP! (again)

One more quick question

If my robots.txt file has this...

User-agent: *
Disallow: /somedirectory/

And I put the getout.php file in /somedirectory/

Then I no longer need to put the getout.php file in the robots.txt because the directory itself is already banned, yes?

kwasher

6:53 am on Sep 5, 2004 (gmt 0)

10+ Year Member



Uh... I meant JD (thinking of JPMorgan the banker)

--

I was using the previous CGI script and thought I'd try out this php one. It seems to be missing one thing I really liked... sending me an email whenever an IP gets banned.

The old cgi had this added to it

# trying to send an e-mail message
open(MAIL, "¦/usr/sbin/sendmail -t") ¦¦ die
"Content-type: text/text\n\nCan't open /usr/sbin/sendmail!";
print MAIL "To: bannedbots\@SiteLance\.com\n";
print MAIL "From: chowbotbanner\@SiteLance\.com\n";
print MAIL "Subject: You caught another one!\n";
print MAIL "The ip address \^$remaddr\$ has been banned on $date \n";
print MAIL "The associated user agent was $usragnt\n";
close(MAIL);

kwasher

9:41 am on Sep 5, 2004 (gmt 0)

10+ Year Member



Ach! I had the same problem. .htaccess has to be at least chmod to 666 (same thing for the htaccess.txt test file) and the directory holding getout.php has to be 777

Is that 'safe' or something one should not do?

---------------------------------------------------

Blocking Badly Behaved Bots update [webmasterworld.com]
An update/fix for a very useful routine

[edited by: jatar_k at 10:06 pm (utc) on June 23, 2005]
[edit reason] added link to update [/edit]

This 34 message thread spans 2 pages: 34