Forum Moderators: coopster

Message Too Old, No Replies

Blocking badly behaved runaway WebCrawlers

PHP solution that doesn't need a bad bot list - Identifies them on the fly.

         

xlcus

11:47 pm on Jan 11, 2003 (gmt 0)

10+ Year Member



I have a fairly processor intensive script on one of my sites which is fine most of the time, but when it gets many repeated hits in quick succession from badly behaved webcrawlers which don't honour my robots.txt file, it brings my server to its knees.

I needed a way to block these inconsiderate bots, many of which identified themselves as standard browsers, so an .htaccess black list wasn't helping. Besides, this would need to be kept up to date every time a bad bot was spotted.

I came up with a small bit of PHP code to put at the start of a script that detects rapid multiple accesses from a particular ip address, and then blocks that ip until the bombardment stops...

 $itime = 10; // Minimum number of seconds between visits
$ipenalty = 60; // Seconds before visitor is allowed back
$imaxvisit = 42; // Maximum visits
$iplogdir = "/sites/my.site.com/iplog/";

 $ipfile = substr(md5($_SERVER["REMOTE_ADDR"]), -2);
$oldtime = 0;
if (file_exists($iplogdir.$ipfile)) $oldtime = filemtime($iplogdir.$ipfile);

 $time = time();
if ($oldtime < $time) $oldtime = $time;
$newtime = $oldtime + $itime;

 if ($newtime >= $time + $itime*$imaxvisit)
{
touch($iplogdir.$ipfile, $time + $itime*($imaxvisit-1) + $ipenalty);
header("HTTP/1.0 503 Service Temporarily Unavailable");
header("Connection: close");
header("Content-Type: text/html");
echo "<html><body><p><b>Server under heavy load</b><br>";
echo "Please wait $ipenalty seconds and try again</p></body></html>";
exit();
}
touch($iplogdir.$ipfile, $newtime);

Notes...

  • $iplogdir
    needs to be a directory that's writable to by the web server.
  • $itime
    is the minimum number of seconds between visits on average over
    $itime*$imaxvisit
    seconds. So in the above example, a visitor isn't blocked if they visit the script multiple times in the first 10 seconds, as long as they don't visit more than 42 times within 420 seconds.
  • If the limit is reached,
    $ipenalty
    is the number of seconds a visitor has to wait before they are allowed back.

How it works...

For each visitor, an MD5 hash is made of their ip address and the last 2 hex digits of this are taken to generate one of a possible 256 filenames. If this is a new visitor, or a visitor who hasn't been seen for a while, the timestamp of the file is set to the current time, otherwise they must have been a recent visitor and the time stamp is increased by

$itime
. If they start loading the script more rapidly than
$itime
seconds per visit, you can see that the time stamp on their ips hashed filename will be increasing faster than the actual time is increasing. If the time stamp gets too far ahead of the current time, then they're branded as a bad visitor and the penalty is applied by increasing the time stamp on their file even further.

$itime, $ipenalty, $imaxvisit can be tweaked to fit your own traffic patterns.

Hope someone else finds my script useful. :) If you have any questions, ask away...

Giacomo

11:50 pm on May 28, 2003 (gmt 0)

10+ Year Member Top Contributors Of The Month



Derek T,

You're right, I forgot to modify the minimum number of seconds between requests. For Googlebot and other "friendly bots", this should read something like

[small]$botbanner_itime = 5; // Minimum number of seconds between visits [/small]

Digger

6:46 pm on May 29, 2003 (gmt 0)

10+ Year Member



Thanks for the script and implementation instructions, Giacomo! STOP ALL BOTS

________________________

updated version
Blocking badly behaved bots [webmasterworld.com]

[edited by: jatar_k at 10:21 pm (utc) on Mar. 14, 2005]
[edit reason] added link [/edit]

This 32 message thread spans 2 pages: 32