Forum Moderators: coopster
Now, because my site is on a temporary server, and the main server is offline (but still connected to the web) whilst it gets an upgrade, I have had a chance to find out the flaw in the logic.
The "$newtime" was updated with the wrong value, causing perfectly-well-behaved bots to always get blocked if they crawled the site for long enough.
Here are some figures using the W3C link-validator [validator.w3.org], which crawls at 1 hit/sec:
iMaxVisit = 10
iTime = 10 (seconds)
$newTime = $oldTime + $iTime;
oldT-test = ( $oldTime - $time - ( $iTime * $iMaxVisit ))01:48:35 oldT=01:50:32 newT=01:50:42 oldT-test=17 Blocked
01:48:32 oldT=01:50:31 newT=01:50:41 oldT-test=19 Blocked
01:48:31 oldT=01:50:18 newT=01:50:28 oldT-test=7 Blocked
01:48:29 oldT=01:50:08 newT=01:50:18 oldT-test=-1
01:48:28 oldT=01:49:58 newT=01:50:08 oldT-test=-10
01:48:26 oldT=01:49:48 newT=01:49:58 oldT-test=-18
01:48:25 oldT=01:49:38 newT=01:49:48 oldT-test=-27
01:48:21 oldT=01:49:28 newT=01:49:38 oldT-test=-33
01:48:20 oldT=01:49:18 newT=01:49:28 oldT-test=-42
01:48:19 oldT=01:49:08 newT=01:49:18 oldT-test=-51
01:48:15 oldT=01:48:58 newT=01:49:08 oldT-test=-57
01:48:14 oldT=01:48:48 newT=01:48:58 oldT-test=-66
01:48:13 oldT=01:48:38 newT=01:48:48 oldT-test=-75
01:48:10 oldT=01:48:28 newT=01:48:38 oldT-test=-82
01:48:09 oldT=01:48:18 newT=01:48:28 oldT-test=-91
01:48:08 oldT=01:48:08 newT=01:48:18 oldT-test=-100
You can see that it gets blocked, even though there should be no problem. Changing the logic of the $newtime setting fixes it:
iMaxVisit = 10
iTime = 10
$newTime = $oldTime + ( $iTime / $iMaxVisit );02:06:49 oldT=02:06:49 newT=02:06:50 oldT-test=-100
02:06:48 oldT=02:06:48 newT=02:06:49 oldT-test=-100
02:06:44 oldT=02:06:44 newT=02:06:45 oldT-test=-100
02:06:43 oldT=02:06:43 newT=02:06:44 oldT-test=-100
02:06:42 oldT=02:06:42 newT=02:06:43 oldT-test=-100
02:06:40 oldT=02:06:40 newT=02:06:41 oldT-test=-100
02:06:39 oldT=02:06:39 newT=02:06:40 oldT-test=-100
02:06:37 oldT=02:06:37 newT=02:06:38 oldT-test=-100
02:06:35 oldT=02:06:35 newT=02:06:36 oldT-test=-100
02:06:34 oldT=02:06:34 newT=02:06:35 oldT-test=-100
02:06:33 oldT=02:06:33 newT=02:06:34 oldT-test=-100iMaxVisit = 5
iTime = 10
$newTime = $oldTime + ( $iTime / $iMaxVisit );02:15:26 oldT=02:15:33 newT=02:15:35 oldT-test=-43
02:15:25 oldT=02:15:31 newT=02:15:33 oldT-test=-44
02:15:24 oldT=02:15:29 newT=02:15:31 oldT-test=-45
02:15:21 oldT=02:15:27 newT=02:15:29 oldT-test=-44
02:15:20 oldT=02:15:25 newT=02:15:27 oldT-test=-45
02:15:19 oldT=02:15:23 newT=02:15:25 oldT-test=-46
02:15:17 oldT=02:15:21 newT=02:15:23 oldT-test=-46
02:15:16 oldT=02:15:19 newT=02:15:21 oldT-test=-47
02:15:15 oldT=02:15:17 newT=02:15:19 oldT-test=-48
02:15:13 oldT=02:15:15 newT=02:15:17 oldT-test=-48
02:15:12 oldT=02:15:13 newT=02:15:15 oldT-test=-49
02:15:11 oldT=02:15:11 newT=02:15:13 oldT-test=-50
And here is the modified routine:
// -------------- Start blocking badly-behaved bots -------
$oldSetting= ignore_user_abort( TRUE );
$remote = $_SERVER[ 'REMOTE_ADDR' ];
if(( substr( $remote, 0, 10 ) == '66.249.64.' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
( substr( $remote, 0, 10 ) == '66.249.65.' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
( substr( $remote, 0, 10 ) == '66.249.66.' ) or// Google has blocks 72.14.192.0 - 72.14.207.255
( substr( $remote, 0, 9 ) == '216.239.3' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
( substr( $remote, 0, 9 ) == '216.239.4' ) or
( substr( $remote, 0, 9 ) == '216.239.5' ) or
( substr( $remote, 0, 10 ) == '65.54.188.' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
( substr( $remote, 0, 10 ) == '207.46.98.' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
( substr( $remote, 0, 13 ) == '66.194.55.242' )// Ocelli
) {
// let well-behaved bots through
} else {
$iTime= 5;// secs; check interval
$iMaxVisit= 20;// Maximum visits allowed within $iTime
$iPenalty= 60;// Seconds before visitor is allowed back
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
$logLine= '';
$newTime= 0;
$time= time();
$oldTime= ( file_exists( $ipFile ))
? filemtime( $ipFile )
: 0;
if( $oldTime < $time ) { $oldTime = $time; }
$newTime= $oldTime + ( $iTime / $iMaxVisit );
if( $oldTime >= $time + ( $iTime * $iMaxVisit )) {
touch( $ipFile, $time + ( $iTime * ( $iMaxVisit - 1 )) + $iPenalty );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "More than $iMaxVisit visits from your IP-Address within the last $iTime secs. Please wait $iPenalty secs before retrying.</p></body></html>";
$useragent= ( isset( $_SERVER[ 'HTTP_USER_AGENT' ]))
? $_SERVER[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
$logLine= "$remote ". date( 'd/m/Y H:i:s' ) ." $useragent\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// a tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() is disabled in some linux kernels (eg 2.4)
array_shift( $log );// fopen, fclose put as close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
}
touch( $ipFile, $newTime );
}
ignore_user_abort( $oldSetting );
// -------------- Stop blocking badly-behaved bots --------
Sorry for such a long post.
The final item now (possibly) is to reverse the use of atime & mtime ... I'll run the routine for another week to give it a good test
Blocked IPs:What is interesting about this is that this character browses from my home-town (Nottingham, UK).
* 62.254.0.30 [ nott-cache-5.server.ntli.net ] 1000 line(s)
62.254.0.30 09/05/2005 17:13:45 (11)
62.254.0.30 09/05/2005 17:13:44 (11)
62.254.0.30 09/05/2005 17:13:43 (3)
62.254.0.30 09/05/2005 17:13:37 (9)
62.254.0.30 09/05/2005 17:13:36 (11)
62.254.0.30 09/05/2005 17:13:35 (16)
62.254.0.30 09/05/2005 17:13:34 (11)
62.254.0.30 09/05/2005 17:13:33 (7)
//-------------------------------------------------------------------------------------
So, here is the re-written routine, incorporating all amendments:
The items prepended with an underscore are Constants that need to be define()'d somewhere in your script before this snippet of code gets used:
eg:
define( '_B_DIRECTORY', '/full/path/on/server/' );
define( '_B_LOGFILE', 'logfile.name' );
define( '_B_LOGMAXLINES', '1000' );
These Constants can be variables or even constant-values within the code - your choice.
Directory permissions: `_B_DIRECTORY` needs to be read-writeable by the apache-group.
Both $ipLogFile and $ipFile are created on-the-fly if not already existing.
//----------------Start-blocking-badly-behaved-bots---------------------------------------------
$oldSetting= ignore_user_abort( TRUE );
$remote = $_SERVER[ 'REMOTE_ADDR' ];
$bInterval= 10;// secs; check interval (best < 30 secs)
$bMaxVisit= 20;// Maximum visits allowed within $bInterval
$bPenalty= 60;// Seconds before visitor is allowed back
$bTotVisit= 500;// total visits allowed within a 24-hr period
$bTotBlock= 42200;// secs; period to block long-duration scrapers
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
$logLine= '';
$time= time();
$fileATime= $time;// access time:-tracks visits
$fileMTime= $time;// modification time:-tracks duration
if( file_exists( $ipFile )) {
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
// foll test keeps the tracking going to catch slow scrapers
if((( $time - $fileATime ) > $bInterval ) or (( $time - $fileMTime ) > 84400 )) {
$fileMTime = $fileATime = $time;
}
$fileATime++;
$visits= $fileATime - $fileMTime;
$duration= $time - $fileMTime;// secs
if( $duration < 1 ) $duration = 1;
// test for fast scrapers
if(( $visits >= $bMaxVisit ) and (( $visits / $duration ) > ( $bMaxVisit / $bInterval ))) {
$fileMTime= $time = $time - $bInterval;
$fileATime= $time + $bMaxVisit + (( $bMaxVisit * $bPenalty ) / $bInterval );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "$visits visits from your IP-Address within the last $duration secs. Please wait $bPenalty secs before retrying.</p></body></html>";
$logLine= "$remote ". date( 'd/m/Y H:i:s' ) ." $useragent\n";
} elseif( $visits >= $bTotVisit ) { // test for slow scrapers
$fileMTime= $time = $time - $bInterval;
$fileATime= $time + $bMaxVisit + (( $bMaxVisit * $bTotBlock ) / $bInterval );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under undue load</b><br />";
echo "$visits visits from your IP-Address within the last 24 hours. Please wait ". (( int ) $bTotBlock / 3600 ) ." hours before retrying.</p></body></html>";
$logLine= "$remote ". date( 'd/m/Y H:i:s' ) ." $useragent (slow scraper)\n";
}
// log badly-behaved bots, then nuke 'em
if( $logLine ) {
touch( $ipFile, $fileMTime, $fileATime );
$useragent= ( isset( $_SERVER[ 'HTTP_USER_AGENT' ]))
? $_SERVER[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() disabled in some kernels (eg 2.4)
array_shift( $log );// fopen,fclose put close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
}
}
touch( $ipFile, $fileMTime, $fileATime );
ignore_user_abort( $oldSetting );
//----------------Stop-blocking-badly-behaved-bots----------------------------------------------
1) Re: DDoS: Blocking single IP's won't be effective if the attacker has enough remote machines. “Researchers estimate that the number of zombie machines in botnets increases by 300,000 to 350,000 every month,” notes Gostev, and “the total number of zombies is estimated at several million.” (http://esj.com/enterprise/article.aspx?EditorialsID=1364)
2) Even if I block 10,000 bots, how fast can a webserver serve up the "Access Denied" message? In some cases, I wonder if one could programmatically tell one's firewall to send ICMP redirects (back to the originating machine) or ICMP "unreachable" ...not sure how that would play with routing "valid" traffic though.
3) If I am not worried about someone making a copy for their local browsing and they have a fast mirroring tool, I'd "try" to base denying service based on my webserver's load. It seems unnecessary to block fast requesters if one's webserver is lightly loaded.
4) Instead of serving up scripted pages with every page-load, one might use "squid" in its webserver-accelerator mode to serve up static content. With judicious use of the last modified date, squid could reload a dynamic page only when the underlying database of content has changed. I.e. - if a dynamic page is database driven, only have squid update its cached, static page when the database causes the page to change. If there is a need to serve up random/rotating "ads", maybe the squid cached page could use a dynamic element only for the "ad". This is especially useful if one is serving up ads from a 3rd party ad-placement service. Alternatively one could force squid to reload the page every "N" seconds to pick up a new "ad". "N" could be a low number if server load is "low", and higher if server load is high.
5) I might consider that IP's aren't always constant -- especially for home users. Even some DSL ISP's use DHCP and a given machine may have a new IP every time it reboots. IP reuse by different user should be considered when deciding on "penalties". I.e. - banned for minutes is probably harmless, but banning for days -- suppose one or more of my favorite site-users gets accidently put on a "banned list" -- not the end of the world, but an inconvenience, nevertheless.
6) If the webserver runs a script every time to decide access, isn't that a separate process create, file access, and a reply with content? If one's webserver supports access control's by directory - it might be more efficient to modify an access file like ".htaccess" on the fly. Apache "should"** be more efficient at blocking access than dynamically serving up one's own "access denied" page. It seems like a return code of "503" - Service Temporarily Unavailable" might be a good choice for temporary blocking.
Hope I haven't written too much -- was just some random thoughts I had when reading this thread... :-)
Linda
First issue: do you have bandwidth costs for your site? If `yes' (and you run PHP) then the script is useful. If `no' (or you do not have access to PHP) then it is academic, for at least *one* of it's principal values.
There are 2 main uses for the script:
1) Re: DDoS: Blocking single IP's won't be effective if the attacker has enough remote machines.Absolutely true - the script as it stands is useless against this. The place to stop such things (if at all possible) is at the Firewall.
3) ... I'd "try" to base denying service based on my webserver's load.Here is a snippet of script to do that:
/*
* _freebsd_loadavg() - Gets the max() system load average from uname(1)
*
* The max() Load Average will be returned
*/
function _freebsd_loadavg() {
$buffer = `uptime`;
ereg( "averag(es¦e): ([0-9][.][0-9][0-9]),([0-9][.][0-9][0-9]),([0-9][.][0-9][0-9]*)", $buffer, $load );
return max(( float ) $load[ 2 ], ( float ) $load[ 3 ], ( float ) $load[ 4 ]);
}// _freebsd_loadavg()
/*
* _linux_loadavg() - Gets the max() system load average from /proc/loadavg
*
* The max() Load Average will be returned
*/
function _linux_loadavg() {
$buffer= '0 0 0';
$f= fopen( '/proc/loadavg', 'r' );
if(!feof( $f )) $buffer = fgets( $f, 1024 );
fclose( $f );
$load= explode( ' ', $buffer );
return max(( float ) $load[ 0 ], ( float ) $load[ 1 ], ( float ) $load[ 2 ]);
}// _linux__loadavg()
6) ... it might be more efficient to modify an access file like ".htaccess" on the fly.Modifying it is easy... de-modifying it is another matter.
The point is that this script works and is easy. Other solutions are certainly possible. Why do you not explore them, and present your (improved) solution? That way, we all improve. I, for one, will certainly look forward to it.
#1 where do you put the script, in the root?
I've got a file of pre-written sub-routines (
function()s) which is included on each web-script at the top:
require_once( '/server/path/to/include.file' );
The routine in msg#36 is at the top of this include file.
#2 How is it called? is it linked anywhere?This question is answered by the above; as soon as a web-script is called by a browser, the include.file is run, and any php-script on that page is run. Put it all as close to the top of any scripts as possible and any blocking will be done immediately.
I am just wary of catching the search-bots that I want to scrape the site.
Blocked IPs:
* 66.249.66.172 [ crawl-66-249-66-172.googlebot.com ] 128 line(s)
128 total lines in log-file.
Log Line Lines
66.249.66.172 21/06/2005 01:48:41 1
66.249.66.172 21/06/2005 01:48:31 1
66.249.66.172 21/06/2005 01:48:21 1
66.249.66.172 21/06/2005 01:48:11 1
66.249.66.172 21/06/2005 01:47:57 1
66.249.66.172 21/06/2005 01:47:47 1
66.249.66.172 21/06/2005 01:47:37 1
66.249.66.172 21/06/2005 01:47:26 1
66.249.66.172 21/06/2005 01:47:16 1
66.249.66.172 21/06/2005 01:47:03 1
66.249.66.172 21/06/2005 01:46:52 1
66.249.66.172 21/06/2005 01:46:39 1
66.249.66.172 21/06/2005 01:46:25 1
...
66.249.66.172 21/06/2005 01:24:17 1
66.249.66.172 21/06/2005 01:24:06 1
66.249.66.172 21/06/2005 01:23:53 1
66.249.66.172 21/06/2005 01:23:43 1
66.249.66.172 21/06/2005 01:23:32 1
66.249.66.172 21/06/2005 01:23:21 (slow scraper) 1
I guess that
$bTotVisit= 500;(total visits allowed within a 24-hr period) is too low.
Still, all functions are now proven. They all work fine.
There's more than one kind of GBot, identified both by the referer (sic) string and by their behaviour:
1 HTTP/1.0 Googlebot/2.1 (+http://www.google.com/bot.html)
2 HTTP/1.1 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Here is the latest sequence on IP:66.249.65.232 that got this [#2] b*stard blocked:
$bInterval = 7; // secs; check interval (best < 30 secs)
$bMaxVisit = 14; // Maximum visits allowed within $bInterval
[26/Jun/2005:06:15:38 +0100] "GET /search.php?start=5454&next=1 HTTP/1.1" 200 7348
[26/Jun/2005:06:15:39 +0100] "GET /search.php?start=4418&next=1 HTTP/1.1" 200 7361
[26/Jun/2005:06:15:39 +0100] "GET /search.php?start=3497&next=1 HTTP/1.1" 200 7260
[26/Jun/2005:06:15:40 +0100] "GET /search.php?start=5354&prev=1 HTTP/1.1" 200 7451
[26/Jun/2005:06:15:40 +0100] "GET /search.php?eeprom=245625-01¯o=9&with=1 HTTP/1.1" 200 6452
[26/Jun/2005:06:15:41 +0100] "GET /mfcs.php?mid=118&nid=13945 HTTP/1.1" 200 6366
[26/Jun/2005:06:15:41 +0100] "GET /search.php?start=5454&prev=1 HTTP/1.1" 200 7344
[26/Jun/2005:06:15:41 +0100] "GET /search.php?start=3267&next=1 HTTP/1.1" 200 7180
[26/Jun/2005:06:15:42 +0100] "GET /search.php?eeprom=PCMCIA%5C1456VQC_DATA%20FAX_PCMCIA_MODEM-C17A&with=1 HTTP/1.1" 200 6402
[26/Jun/2005:06:15:42 +0100] "GET /search.php?start=4688&next=1 HTTP/1.1" 200 7412
[26/Jun/2005:06:15:43 +0100] "GET /search.php?start=9372&prev=1 HTTP/1.1" 200 7243
[26/Jun/2005:06:15:43 +0100] "GET /search.php?start=2517&prev=1 HTTP/1.1" 200 7620
[26/Jun/2005:06:15:44 +0100] "GET /search.php?start=5505&next=1 HTTP/1.1" 200 7399
[26/Jun/2005:06:15:44 +0100] "GET /search.php?start=4354&prev=1 HTTP/1.1" 503 146
---------------------
next part
Blocking Badly Behaved Bots #3 [webmasterworld.com]
[edited by: jatar_k at 3:37 pm (utc) on Oct. 11, 2005]