bad-bot script: follow-up?

Forum Moderators: coopster & phranque

Message Too Old, No Replies

bad-bot script: follow-up?

Script posted for bad-bots had problems. Were they resolved?

stapel

6:10 am on Dec 2, 2002 (gmt 0)

There is a script posted in this thread:

Ban malicious visitors with this Perl Script [webmasterworld.com]

The script was designed to catch bad-bots, but it appears that it protects only the (cgi-bin) directory into which it is placed.

One of the last messages in the thread is from the author, saying that he'll get back with a solution, but the thread ends shortly afterwards, and is now closed.

Have the problems been resolved?

Thank you.

Key_Master

7:17 am on Dec 2, 2002 (gmt 0)

There is no followup solution because the script is solid. Sounds like you have something misconfigured. Make sure your .htaccess file resides in the root directory of your site and not the cgi-bin.

jdMorgan

7:36 am on Dec 2, 2002 (gmt 0)

stapel,

That script protects any part of your directory hierarchy under the control of the .htaccess file that it modifies.

The only issues I know of is that it needed a flock(HTACCESS,2); statement after the .htaccess file is opened (yeah, after - PERL works that way) in order to lock the file. The .htaccess file then needs to remain open while it is modified so that the lock is not released. Therefore, the file needs to be opened in read/write mode. This locking is to guarantee safe operation in a multiprocess environment - to prevent "simultaneous" threads from attempting to modify the .htaccess file.

Here's the version I'm using now. I'm not a PERL whiz - I just hacked it until it met my requirements. Only the change cited above was really needed.


#!/usr/local/bin/perl$htadir = $ENV{DOCUMENT_ROOT};
$htafile = "/\.htaccess";
# Form full pathname to .htaccess file
$htapath = "$htadir"."$htafile";
# Get the bad-bot's IP address, convert to regular-expressions (regex) format by escaping all
# periods.
$remaddr = $ENV{REMOTE_ADDR};
$remaddr =~ s/\./\\\./gi;
# Get User-agent & current time
$usragnt = $ENV{HTTP_USER_AGENT};
$date = scalar localtime(time);
# Open the .htaccess file and wait for an exclusive lock. This prevents multiple instances of this
# script from running past the flock statement, and prevents them from trying to read and write the
# file at the same time, which would corrupt it. When .htaccess is closed, the lock is released.
#
# Open existing .htaccess file in r/w append mode, lock it, rewind to start, read current contents
# into array.
open(HTACCESS,"+>>$htapath") ŚŚ die $!;
flock(HTACCESS,2);
seek(HTACCESS,0,0);
@contents = <HTACCESS>;
# Empty existing .htaccess file, then write new IP ban line and previous contents to it
truncate(HTACCESS,0);
print HTACCESS ("SetEnvIf Remote_Addr \^$remaddr\$ getout \# $date $usragnt\n");
print HTACCESS (@contents);
# close the .htaccess file, releasing lock - allow other instances of this script to proceed.
close(HTACCESS);# Write html output to server response
print ("Content-type: text/html\n\n");
print ("<html><head><title>Fatal Error</title></head>\n");
print ("<body text=\"#000000\" bgcolor=\"#FFFFFF\">\n");
print ("<p>Fatal error</p></body></html>\n");
exit;

Basic install:
Upload this file in yout cgi-bin and name it trap.pl. Set permissions using chmod 755 (ow:rwx,gr:rx,wo:rx). Create a link on a 1x1-pixel transparent .gif in one or more of your pages, and link it to /about.cgi?id=13 or similar. In .htaccess, rewrite /about.cgi to to /cgi-bin/trap.pl. Disallow the /about.cgi file in robots.txt so good 'bots won't fetch it. (Be sure to post the updated robots.txt hours or even days before uploading and installing this script - some slow-cycle robots need a long time to read it!)

Add the following code to your .htaccess to block the IP address records written by the script:


# Block bad-bots using lines written by bad_bot.pl script above
SetEnvIf Request_URI "^(/403.*\.htmlŚ/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

How this works:
If a bot finds the link and ignores the disallow in robots.txt, it will attempt to fetch /about.cgi. .htaccess will rewrite that request to trap.pl. trap.pl runs, grabs the bot's IP address, and converts it to regular-expressions format. It then opens you .htaccess file and writes a new line at the beginning, SetEnvIf (IP-address) getout followed by a comment containing a timestamp and the user-agent. It then closes your .htaccess file, and serves a very short html page (included in the script) to the requestor.

On the next request from that IP address, the new .htaccess file is processed, and the "deny from getout" directive blocks that IP address and any others added by the script:

Note that due to the construction of the "deny from getout" section added to .htaccess, all requestors including bad-bots are allowed to fetch robots.txt and files beginning with "403" and ending with ".html". This allows the bots a chance to read robots.txt, and it also allows them to fetch my custom error pages. This last item is important to prevent an "infinite request loop" once a bot is banned.

Warning: Replace all broken vertical pipe symbols "Ś" with solid vertical pipe symbols before attempting to use. These characters are changed by posting here on WebmasterWorld.

This script works very nicely - Thanks Key_Master!

Jim