Forum Moderators: coopster
Now, because my site is on a temporary server, and the main server is offline (but still connected to the web) whilst it gets an upgrade, I have had a chance to find out the flaw in the logic.
The "$newtime" was updated with the wrong value, causing perfectly-well-behaved bots to always get blocked if they crawled the site for long enough.
Here are some figures using the W3C link-validator [validator.w3.org], which crawls at 1 hit/sec:
iMaxVisit = 10
iTime = 10 (seconds)
$newTime = $oldTime + $iTime;
oldT-test = ( $oldTime - $time - ( $iTime * $iMaxVisit ))01:48:35 oldT=01:50:32 newT=01:50:42 oldT-test=17 Blocked
01:48:32 oldT=01:50:31 newT=01:50:41 oldT-test=19 Blocked
01:48:31 oldT=01:50:18 newT=01:50:28 oldT-test=7 Blocked
01:48:29 oldT=01:50:08 newT=01:50:18 oldT-test=-1
01:48:28 oldT=01:49:58 newT=01:50:08 oldT-test=-10
01:48:26 oldT=01:49:48 newT=01:49:58 oldT-test=-18
01:48:25 oldT=01:49:38 newT=01:49:48 oldT-test=-27
01:48:21 oldT=01:49:28 newT=01:49:38 oldT-test=-33
01:48:20 oldT=01:49:18 newT=01:49:28 oldT-test=-42
01:48:19 oldT=01:49:08 newT=01:49:18 oldT-test=-51
01:48:15 oldT=01:48:58 newT=01:49:08 oldT-test=-57
01:48:14 oldT=01:48:48 newT=01:48:58 oldT-test=-66
01:48:13 oldT=01:48:38 newT=01:48:48 oldT-test=-75
01:48:10 oldT=01:48:28 newT=01:48:38 oldT-test=-82
01:48:09 oldT=01:48:18 newT=01:48:28 oldT-test=-91
01:48:08 oldT=01:48:08 newT=01:48:18 oldT-test=-100
You can see that it gets blocked, even though there should be no problem. Changing the logic of the $newtime setting fixes it:
iMaxVisit = 10
iTime = 10
$newTime = $oldTime + ( $iTime / $iMaxVisit );02:06:49 oldT=02:06:49 newT=02:06:50 oldT-test=-100
02:06:48 oldT=02:06:48 newT=02:06:49 oldT-test=-100
02:06:44 oldT=02:06:44 newT=02:06:45 oldT-test=-100
02:06:43 oldT=02:06:43 newT=02:06:44 oldT-test=-100
02:06:42 oldT=02:06:42 newT=02:06:43 oldT-test=-100
02:06:40 oldT=02:06:40 newT=02:06:41 oldT-test=-100
02:06:39 oldT=02:06:39 newT=02:06:40 oldT-test=-100
02:06:37 oldT=02:06:37 newT=02:06:38 oldT-test=-100
02:06:35 oldT=02:06:35 newT=02:06:36 oldT-test=-100
02:06:34 oldT=02:06:34 newT=02:06:35 oldT-test=-100
02:06:33 oldT=02:06:33 newT=02:06:34 oldT-test=-100iMaxVisit = 5
iTime = 10
$newTime = $oldTime + ( $iTime / $iMaxVisit );02:15:26 oldT=02:15:33 newT=02:15:35 oldT-test=-43
02:15:25 oldT=02:15:31 newT=02:15:33 oldT-test=-44
02:15:24 oldT=02:15:29 newT=02:15:31 oldT-test=-45
02:15:21 oldT=02:15:27 newT=02:15:29 oldT-test=-44
02:15:20 oldT=02:15:25 newT=02:15:27 oldT-test=-45
02:15:19 oldT=02:15:23 newT=02:15:25 oldT-test=-46
02:15:17 oldT=02:15:21 newT=02:15:23 oldT-test=-46
02:15:16 oldT=02:15:19 newT=02:15:21 oldT-test=-47
02:15:15 oldT=02:15:17 newT=02:15:19 oldT-test=-48
02:15:13 oldT=02:15:15 newT=02:15:17 oldT-test=-48
02:15:12 oldT=02:15:13 newT=02:15:15 oldT-test=-49
02:15:11 oldT=02:15:11 newT=02:15:13 oldT-test=-50
And here is the modified routine:
// -------------- Start blocking badly-behaved bots -------
$oldSetting= ignore_user_abort( TRUE );
$remote = $_SERVER[ 'REMOTE_ADDR' ];
if(( substr( $remote, 0, 10 ) == '66.249.64.' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
( substr( $remote, 0, 10 ) == '66.249.65.' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
( substr( $remote, 0, 10 ) == '66.249.66.' ) or// Google has blocks 72.14.192.0 - 72.14.207.255
( substr( $remote, 0, 9 ) == '216.239.3' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
( substr( $remote, 0, 9 ) == '216.239.4' ) or
( substr( $remote, 0, 9 ) == '216.239.5' ) or
( substr( $remote, 0, 10 ) == '65.54.188.' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
( substr( $remote, 0, 10 ) == '207.46.98.' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
( substr( $remote, 0, 13 ) == '66.194.55.242' )// Ocelli
) {
// let well-behaved bots through
} else {
$iTime= 5;// secs; check interval
$iMaxVisit= 20;// Maximum visits allowed within $iTime
$iPenalty= 60;// Seconds before visitor is allowed back
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
$logLine= '';
$newTime= 0;
$time= time();
$oldTime= ( file_exists( $ipFile ))
? filemtime( $ipFile )
: 0;
if( $oldTime < $time ) { $oldTime = $time; }
$newTime= $oldTime + ( $iTime / $iMaxVisit );
if( $oldTime >= $time + ( $iTime * $iMaxVisit )) {
touch( $ipFile, $time + ( $iTime * ( $iMaxVisit - 1 )) + $iPenalty );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "More than $iMaxVisit visits from your IP-Address within the last $iTime secs. Please wait $iPenalty secs before retrying.</p></body></html>";
$useragent= ( isset( $_SERVER[ 'HTTP_USER_AGENT' ]))
? $_SERVER[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
$logLine= "$remote ". date( 'd/m/Y H:i:s' ) ." $useragent\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// a tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() is disabled in some linux kernels (eg 2.4)
array_shift( $log );// fopen, fclose put as close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
}
touch( $ipFile, $newTime );
}
ignore_user_abort( $oldSetting );
// -------------- Stop blocking badly-behaved bots --------
Sorry for such a long post.
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
With old script was very easy jus to change it
$iplogdir = "/home/whatever/public_html/whatever/";
and $iplogfile = "iplog.dat";
I do not know what and where is log placed of banning and how to make it.
If i just try to use script without chaning anything i got ofcourse error
Warning: touch(): Unable to create file _B_DIRECTORY7e1 because Permission denied
This is great script however because of lack of php language i do not know what to do by$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
define( '_B_DIRECTORY', '/path/on/server/' );
define( '_B_LOGFILE', 'logfile.name' );
define( '_B_LOGMAXLINES', '1000' );
Directory permissions: `_B_DIRECTORY` ("Block-directory") needs to be (at least) read-writeable by the apache-group:
# ls -al _B_DIRECTORY
drwxrwxrwx 2 owner apache-owner 4096 Mar 23 07:02 _B_DIRECTORY
PS Don't you hate the way that this forum compresses quotes into a tall, thin block? No need, as the quote will fit into the horizontal space available. Pshah!
<?php
define( '_B_DIRECTORY', '/home/pro/public_html/forum/ban/' );
define( '_B_LOGFILE', 'log.txt' );
define( '_B_LOGMAXLINES', '3000' );
// -------------- Start blocking badly-behaved bots -------
$oldSetting= ignore_user_abort( TRUE );
$remote = $_SERVER[ 'REMOTE_ADDR' ];
if(( substr( $remote, 0, 10 ) == '66.249.64.' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
( substr( $remote, 0, 10 ) == '66.249.65.' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
( substr( $remote, 0, 10 ) == '66.249.66.' ) or// Google has blocks 72.14.192.0 - 72.14.207.255
( substr( $remote, 0, 9 ) == '216.239.3' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
( substr( $remote, 0, 9 ) == '216.239.4' ) or
( substr( $remote, 0, 9 ) == '216.239.5' ) or
( substr( $remote, 0, 10 ) == '65.54.188.' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
( substr( $remote, 0, 10 ) == '207.46.98.' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
( substr( $remote, 0, 13 ) == '66.194.55.242' )// Ocelli
) {
// let well-behaved bots through
} else {
$iTime= 5;// secs; check interval
$iMaxVisit= 3;// Maximum visits allowed within $iTime
$iPenalty= 60;// Seconds before visitor is allowed back
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
$logLine= '';
$newTime= 0;
$time= time();
$oldTime= ( file_exists( $ipFile ))
? filemtime( $ipFile )
: 0;
if( $oldTime < $time ) { $oldTime = $time; }
$newTime= $oldTime + ( $iTime / $iMaxVisit );
if( $oldTime >= $time + ( $iTime * $iMaxVisit )) {
touch( $ipFile, $time + ( $iTime * ( $iMaxVisit - 1 )) + $iPenalty );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "More than $iMaxVisit visits from your IP-Address within the last $iTime secs. Please wait $iPenalty secs before retrying.</p></body></html>";
$useragent= ( isset( $_SERVER[ 'HTTP_USER_AGENT' ]))
? $_SERVER[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
$logLine= "$remote ". date( 'd/m/Y H:i:s' ) ." $useragent\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// a tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() is disabled in some linux kernels (eg 2.4)
array_shift( $log );// fopen, fclose put as close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
}
touch( $ipFile, $newTime );
}
ignore_user_abort( $oldSetting );
?>
Edit:
----------------
Old code works like charm.I've jkust integreated this part within old code
$remote = $_SERVER[ 'REMOTE_ADDR' ];
if(( substr( $remote, 0, 10 ) == '66.249.64.' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
( substr( $remote, 0, 10 ) == '66.249.65.' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
( substr( $remote, 0, 10 ) == '66.249.66.' ) or// Google has blocks 72.14.192.0 - 72.14.207.255
( substr( $remote, 0, 9 ) == '216.239.3' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
( substr( $remote, 0, 9 ) == '216.239.4' ) or
( substr( $remote, 0, 9 ) == '216.239.5' ) or
( substr( $remote, 0, 10 ) == '65.54.188.' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
( substr( $remote, 0, 10 ) == '207.46.98.' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
( substr( $remote, 0, 13 ) == '66.194.55.242' )// Ocelli
However i got one problem and it is some error code with phpBB (output already sent etc)
If i delete?> on the end of script it works smothly and without any problem.
Strange but it works!?
[edited by: jatar_k at 4:44 pm (utc) on April 4, 2005]
[edit reason] fixed link [/edit]
$iTime= 5;//
$iPenalty= 600
$iMaxVisit= 15
Excellent works and it seems to be first defense not against bots but also against brute forcing and Dos-ing of your site.
Thanks
$iTime= 5;// secs; check interval
$iMaxVisit= 2;// Maximum visits allowed within $iTime
If i set configration higher
$iTime= 5;// secs; check interval
$iMaxVisit= 5;// Maximum visits allowed within $iTime
i wasn't able to temp. ban myself even with constatntly 100-150 clicking - refresh.
Thanks but i like more first script that prevent excellent brute forcing and dos.
Actuall it works.
As a practical example, on my own site on the very day that I posted the first message the modified routine as above stopped someone trying to hit my site at over 25 times a second. Here is an extract from the log-file (the number at the end of each line is the number of times blocked in that second of time):
69.227.20.74 10/03/2005 19:45:03 Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) 25
69.227.20.74 10/03/2005 19:45:02 Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) 24
69.227.20.74 10/03/2005 19:45:01 Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) 24
69.227.20.74 10/03/2005 19:45:00 Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) 25
69.227.20.74 10/03/2005 19:44:59 Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) 25
Anyway, having said all that, if you are content with what you have then I fully understand staying with it. I'm just very pernickity...
(sorry to have replied so slow)
$iTime= 5;// secs; check interval
$iMaxVisit= 5;// Maximum visits allowed within $iTime
even with more than 100 repeated clicks.
Script was able to block me only on:
$iTime= 5;// secs; check interval
$iMaxVisit= 2;// Maximum visits allowed within $iTime
Old script with configuration
$iTime= 5;//
$iPenalty= 600
$iMaxVisit= 15
has blocked already couple of IP's.
It did not block any good bot but some users that use Firefox has problems if they use "Drag and Go", an extension of Firefox, to open multiple threads in the background and review them later.
I've already contact them and they do not gonna to use function "Drag and Go" anymore to avoid be temp. banned again.
Maybe is old script not perfect but new script failed to works with definition example:
$iTime= 5;// secs; check interval
$iMaxVisit= 5;// Maximum visits allowed within $iTime
If visitor/bot after 25 clicks (what is allowed) did a more than 5 cliks within 5 second, then script does not ban them and act as nothing happend.
I do not what is problem but definition that you explain how it works does not works in my testing.
$itime = 5; // Minimum number of seconds between visits
$ipenalty = 300; // Seconds before visitor is allowed back
$imaxvisit = 15; // Maximum visits
New script:
$iTime= 5;// secs; check interval
$iMaxVisit= 15;// Maximum visits allowed within $iTime
$iPenalty= 300;// Seconds before visitor is allowed back
Old script works according to formula 15 x 5 = 75 and then if you make 15 or more clicks within 5 second you'll get temporaty ban.
New script failed to block anything even with 200 fast clicks after period of 75.
There is something wrong.
btw
It will be usefull to log not only ip/date/time/user agent but also page of your site where refreshing happend.
New script failed to block anything
filemtimeand
touchwork with integers, whereas the routine tries to set with a float. With your values this effectively leaves the mtime unchanged, hence no block.
I am trying to evaluate a correct block-test. At this instant it is twisting my head around.
BTW:
What jdMorgan is referring to is the difference between refresh and shift-refresh (or f5). The former will just refresh the page layout from already-downloaded-files whereas the latter two force a re-load of the files, regardless of the cache settings between your computer and the server (do not forget proxies). The very best I managed was 6xsec, and most were 4xsec or less.
I used the following snippet of php:
$fileTime= ( file_exists( $ipFile ))
? filemtime( $ipFile )
: 0;
if( $oldTime < $time ) { $oldTime = $time; }
$newTime= $oldTime + ( $iTime / $iMaxVisit );
$testTime= $time + ( $iTime * $iMaxVisit );
if( $oldTime >= $time + ( $iTime * $iMaxVisit )) {
touch( $ipFile, $time + ( $iTime * ( $iMaxVisit - 1 )) + $iPenalty );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "More than $iMaxVisit visits from your IP-Address within the last $iTime secs. Please wait $iPenalty secs before retrying.</p></body></html>";
$logLine= date( 'd/m/Y H:i:s' ) ." \$fileTime=$fileTime \$oldTime=$oldTime \$newTime=$newTime \$testTime=$testTime (Blocked)\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() disabled in some kernels (eg 2.4)
array_shift( $log );// fopen,fclose put close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
} else {
touch( $ipFile, $newTime );
header( 'HTTP/1.0 200 Service Is Fine' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server working well</b><br />";
echo "This is just a little text to send before quitting.</p></body></html>";
$logLine= date( 'd/m/Y H:i:s' ) ." \$fileTime=$fileTime \$oldTime=$oldTime \$newTime=$newTime \$testTime=$testTime\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() disabled in some kernels (eg 2.4)
array_shift( $log );// fopen,fclose put close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
}
mtime(modification time) and
atime(access time) to track both visits and duration. This will be a problem for some servers, which can be mounted with atime updates disabled [uk.php.net] to increase the performance. This is not the situation on my server, but I obviously cannot speak for yours.
Here are snippets from a specially-constructed log-file, showing the block coming into effect. Just to make life more interesting, I have also changed some of the variables to try to make them a little more understandable.
$bInterval = 5; // secs; check interval
$bMaxVisit = 15; // Maximum visits allowed within $bInterval
On the emboldened-line there are 60 visits in 20 seconds, which is (just) within the limits. The next visit goes over the limit, and trips the block:
16/04/2005 03:33:13 $fileATime=1113618793 $fileMTime=1113618794 $visits=1 $duration=1
16/04/2005 03:33:14 $fileATime=1113618793 $fileMTime=1113618795 $visits=2 $duration=1
16/04/2005 03:33:14 $fileATime=1113618793 $fileMTime=1113618796 $visits=3 $duration=1
16/04/2005 03:33:15 $fileATime=1113618793 $fileMTime=1113618797 $visits=4 $duration=2
...
16/04/2005 03:33:33 $fileATime=1113618793 $fileMTime=1113618852 $visits=59 $duration=20
16/04/2005 03:33:33 $fileATime=1113618793 $fileMTime=1113618853 $visits=60 $duration=20
16/04/2005 03:33:33 $fileATime=1113618793 $fileMTime=1113618854 $visits=61 $duration=20 (Blocked)
16/04/2005 03:33:34 $fileATime=1113618793 $fileMTime=1113618944 $visits=151 $duration=21 (Blocked)
16/04/2005 03:33:34 $fileATime=1113618793 $fileMTime=1113618945 $visits=152 $duration=21 (Blocked)
The items prepended with an underscore are Constants that need to be define()'d somewhere in your script before this snippet of code gets used:
eg:
define( '_B_DIRECTORY', '/full/path/on/server/' );
define( '_B_LOGFILE', 'logfile.name' );
define( '_B_LOGMAXLINES', '1000' );
These Constants can be variables or even constant-values within the code - your choice.
Directory permissions: `_B_DIRECTORY` needs to be read-writeable by the apache-group.
Both $ipLogFile and $ipFile are created on-the-fly if not already existing.
//----------------Start-blocking-badly-behaved-bots---------------------------------------------
$bInterval= 5;// secs; check interval
$bMaxVisit= 10;// Maximum visits allowed within $bInterval
$bPenalty= 60;// Seconds before visitor is allowed back
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
$logLine= '';
$time= time();
$fileATime= $time;// access time used to track duration
$fileMTime= $time;// modification time used to track visits
if( file_exists( $ipFile )) {
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
if( $fileMTime < $time ) { $fileMTime = $fileATime = $time; }
$fileMTime++;
$visits= $fileMTime - $fileATime;
$duration= $time - $fileATime;// secs
if( $duration < 1 ) $duration = 1;
if(( $visits > $bMaxVisit ) and (( $visits / $duration ) > ( $bMaxVisit / $bInterval ))) {
touch( $ipFile, $time + ( $bInterval * ( $bMaxVisit - 1 )) + $bPenalty, $fileATime );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "$visits visits from your IP-Address within the last $duration secs. Please wait $bPenalty secs before retrying.</p></body></html>";
$useragent= ( isset( $_SERVER[ 'HTTP_USER_AGENT' ]))
? $_SERVER[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
$logLine= "$remote ". date( 'd/m/Y H:i:s' ) ." $useragent\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() disabled in some kernels (eg 2.4)
array_shift( $log );// fopen,fclose put close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
}
}
touch( $ipFile, $fileMTime, $fileATime );
}
ignore_user_abort( $oldSetting );
//----------------Stop-blocking-badly-behaved-bots----------------------------------------------
HTH
touch( $ipFile, $fileMTime, $fileATime );
Warning: touch(): Unable to create file because No such file or directory in /home/pb/public_html/forum/refreshblock2.php on line 63
LIne 63 is:
touch( $ipFile, $fileMTime, $fileATime );
I've already directory with log.txt inside direcotry
define( '_B_DIRECTORY', '/home/pb/public_html/forum/ban/' );
define( '_B_LOGFILE', 'log.txt' );
define( '_B_LOGMAXLINES', '3000' );
What here must be done.
Thanks
I've did a changes but got error on line 63:
touch( $ipFile, $fileMTime, $fileATime );
Warning: touch(): Unable to create file because No such file or directory...
touchcreates the file if it does not already exist, so the problem can only be with the directory.
Re-check the directory name (watch out for case: `ban' is different from `Ban' on Linux, for example).
Re-check the directory permissions (see msg 5), although I doubt that this will be the problem, since the error should be different.
Finally, if nothing else works, change line 63 to:
touch( $ipFile, $fileMTime );and--if the error message goes away--then I reckon that your server is mounted with atime updates disabled. This latter is a pure guess, as I do not have any experience of that situation.
After a little research: the 3rd parameter
atimeis not present in the PHP 4.2.2 documentation and I am not sure at what version it was added. There are bugs [php.net] in
touchin Version 4.2.3, 4.2.1 and 4.0.2. If all else fails you will have to revert to the original code. Just make sure that you include the section to let good-bots through, 'cos otherwise they will be blocked (on my multi-thousand page site they crawl so long that they will *always* get blocked, even at 1-hit-every-3-secs).
The code in msg 1 can be used if
$itime>=
$imaxvisitand
$itime / $imaxvisitevaluates to an integer. eg:
$iTime= 15;would work fine (15 hits in 15 secs is a lot for a single visitor).
$iMaxVisit= 15;
Finally, the section of php posted in msg 15 was lifted direct from the page which is now running live on my (active) server. It is working fine, and I am eagerly waiting for it to catch yet more bots!
Addendum:Dolcevita:
It will be usefull to log not only ip/date/time/user agent but also page of your site where refreshing happend.
Three questions:
1/ 15 visits per 5 secs; is it good value?
2/ Is there any better solution for allowing well-behaved bots than to use a if/else statement checking the remote address?
3/ If not, any ID where I could get a list of the must-have friendly-bots?
Thanks
$bInterval= 5;// secs; check interval
$bMaxVisit= 10;// Maximum visits allowed within $bInterval
$bPenalty= 300;// Seconds before visitor is allowed back
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
and please tell me how can I ban them from eating my Bandwidth ,visiting my server.
PS: I do not trust ex iron curtin countries like(cz=chehoslovakia,pl poland and so on) webmasters and scam artists >hackers 302 redirect....from those countries.
[edited by: jatar_k at 6:35 pm (utc) on April 20, 2005]
[edit reason] no urls thanks [/edit]
- spiders hit the site too fast
- offline downloaders and scrapers, with 40K-80K pages it must stop
Seems like with a few modifications this script could also set a page download limit per IP.
The issue then becomes letting the legitimate spiders pass vs blocking downloaders and scrapers.
Seems easy enough with a large list of exceptions.
The problem I have regarding the penalty time is persistent.
It is longer than what it is indicated, that is even if I press refresh after a longer period of time, I still get banned.
touch()command within the banned-section was based on the old routine, and not re-written for the new algorithm. It is now re-written (see below), and works fine:
$bInterval= 5;// secs; check intervalAfter waiting for (a little more than) 60 secs it let me back in again (the new coding resets both
$bMaxVisit= 10;// Maximum visits allowed within $bInterval
$bPenalty= 60;// Seconds before visitor is allowed back
20/04/2005 18:44:51 $fileATime=1114019091 $fileMTime=1114019092 $visits=1 $duration=1
20/04/2005 18:44:51 $fileATime=1114019091 $fileMTime=1114019093 $visits=2 $duration=1
...
20/04/2005 18:44:54 $fileATime=1114019091 $fileMTime=1114019100 $visits=9 $duration=3
20/04/2005 18:44:54 $fileATime=1114019089 $fileMTime=1114019219 $visits=10 $duration=3 (Blocked)
20/04/2005 18:44:55 $fileATime=1114019090 $fileMTime=1114019220 $visits=131 $duration=6 (Blocked)
20/04/2005 18:46:09 $fileATime=1114019090 $fileMTime=1114019221 $visits=131 $duration=79
--------------------------------------------------------------------------------------------------------------------------------------------------------------
$fileATimeand
$fileMTimeto do this).
artdeco:
...please tell me how can I ban them from eating my Bandwidth
incrediBILL:
The issue then becomes letting the legitimate spiders pass vs blocking downloaders and scrapers.
Finally (and apologies in advance for making these pages so large) here is the full revised routine, with IP-checking for legit spiders now removed:
The items prepended with an underscore are Constants that need to be define()'d somewhere in your script before this snippet of code gets used:
eg:
define( '_B_DIRECTORY', '/full/path/on/server/' );
define( '_B_LOGFILE', 'logfile.name' );
define( '_B_LOGMAXLINES', '1000' );
These Constants can be variables or even constant-values within the code - your choice.
Directory permissions: `_B_DIRECTORY` needs to be read-writeable by the apache-group.
Both $ipLogFile and $ipFile are created on-the-fly if not already existing.
//----------------Start-blocking-badly-behaved-bots---------------------------------------------
$oldSetting= ignore_user_abort( TRUE );
$bInterval= 5;// secs; check interval
$bMaxVisit= 10;// Maximum visits allowed within $bInterval
$bPenalty= 60;// Seconds before visitor is allowed back
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
$logLine= '';
$time= time();
$fileATime= $time;// access time:-tracks duration
$fileMTime= $time;// modification time:-tracks visits
if( file_exists( $ipFile )) {
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
if( $fileMTime < $time ) { $fileMTime = $fileATime = $time; }
$fileMTime++;
$visits= $fileMTime - $fileATime;
$duration= $time - $fileATime;// secs
if( $duration < 1 ) $duration = 1;
if(( $visits >= $bMaxVisit ) and (( $visits / $duration ) > ( $bMaxVisit / $bInterval ))) {
$fileATime= $time = $time - $bInterval;
$fileMTime= $time + $bMaxVisit + (( $bMaxVisit * $bPenalty ) / $bInterval );
touch( $ipFile, $fileMTime, $fileATime );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "$visits visits from your IP-Address within the last $duration secs. Please wait $bPenalty secs before retrying.</p></body></html>";
$remote = $_SERVER[ 'REMOTE_ADDR' ];
$useragent= ( isset( $_SERVER[ 'HTTP_USER_AGENT' ]))
? $_SERVER[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
$logLine= "$remote ". date( 'd/m/Y H:i:s' ) ." $useragent\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() disabled in some kernels (eg 2.4)
array_shift( $log );// fopen,fclose put close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
}
}
touch( $ipFile, $fileMTime, $fileATime );
ignore_user_abort( $oldSetting );
//----------------Stop-blocking-badly-behaved-bots----------------------------------------------
An illegitimate spider will not care about your bandwidth - it will scrape your site as fast as you let it. This is the mark of the beast.
I think you missed my second point.
The script as presented fixes issue #1 which is the bandwidth hogs
I want to stop competitors from crawling my pages after a few hundred pages or so, instead of letting them get access to 40k+ pages. A human has a limit to how much information they can process in a day, but the offloading and scraping, even at a slow rate just keeps going and going until they have my entire site.
I'm looking into a modification that will track that behavior and just shut them down after a few hundred pages with a nice friendly message like "You have exceeded your page view allotment for today, come back tomorrow you greedy pig"
I want to stop competitors from crawling my pages ... until they have my entire site.
I'm looking into a modification that will track that behavior and just shut them down after a few hundred pages
$ipFile) and uses the mod/access time of this file to activate a block. The same file could also contain content...
As written, the routine pre-processes a page. What you want would require that it is split to both pre- and post-process a page:
ob_start()at the beginning of the page
ob_get_contents()and
strlen()at the end.
$ipFile
[added]:
Y'know, after preening myself on my clever, clever reply, it has occurred to me that there is an even simpler solution...
The routine tracks the number of visits in
$visits. It would be very easy to add a check for the gross number of visits.
$visitsis reset to zero when the mod-time falls behind the current time (a hit-rate of less than 1 a sec, which admittedly is 86k hits in 24 hours). My experience across the last 18 months, however, is that *no-one* scraping a site has the patience to hit your site so slowly (look at msg#9). Hence my original comment.
I'm looking into a modification that will track that behavior and just shut them down after a few hundred pages
[after yet more thinking]:
A very sensible modification to the routine in msg#25 is to adjust the set-zero time (mod in bold):
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
if(( $time - $fileMTime ) > $bInterval ) { $fileMTime = $fileATime = $time; }
$fileMTime++;
$bIntervalis not huge this will be OK, and will also maintain the tracking for
$visitsand
$durationfor slow scrapers. It will now be possible to add new block values, then test:
$bTotBlock=86400;//secs; period to block long-duration scrapers
$bTotVisit=1000;//total visits allowed within a 24-hr period
...
if(( $visits >= $bTotVisit ) and ( $duration < 84400 )) {
$fileATime= $time = $time - $bInterval;
$fileMTime= $time + $bMaxVisit + ( $bMaxVisit * $bTotBlock / $bInterval );
touch( $ipFile, $fileMTime, $fileATime );
(your msg goes here!)
exit();
}
You will obviously have to be careful to choose values which will not block search-bots, else re-implement the IP-checking of msg#1.
The following line:
$remote = $_SERVER[ 'REMOTE_ADDR' ];appears after
$remoteis used. It should be at this position:
$oldSetting= ignore_user_abort( TRUE );
$remote = $_SERVER[ 'REMOTE_ADDR' ];
$bInterval= 5;// secs; check interval
Checking my block log, the routine in msg#25 caught 96 attempted visits from 35 IP-addresses in a 15 minute span yesterday (Friday 22). My live site has all error messages switched off, so the notice on using a variable that is not yet declared did not appear. Sigh.
Sorry to all.
if()clause for well-behaved bots) has been running without problems for 5 days now. In fact, I was getting worried that it might not be working. Then I checked the block log this evening...
(the number at the end of each line is the number of times blocked within that second):
$bInterval= 10;
$bMaxVisit= 20;
$bPenalty= 60;
$bTotVisit= 500;
$bTotBlock= 42200;
$ipLength= 3;
24.174.219.197 26/04/2005 23:00:42 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 2
24.174.219.197 26/04/2005 23:00:41 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 5
24.174.219.197 26/04/2005 23:00:40 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 6
24.174.219.197 26/04/2005 23:00:39 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 6
24.174.219.197 26/04/2005 23:00:38 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 5
24.174.219.197 26/04/2005 23:00:37 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 6
24.174.219.197 26/04/2005 23:00:36 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 4
24.174.219.197 26/04/2005 23:00:35 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 5
24.174.219.197 26/04/2005 23:00:34 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 3
( 24.174.219.197 = cpe-24-174-219-197.elp.res.rr.com )
The final item now (possibly) is to reverse the use of
atime&
mtime. At the moment, access-time lags behind modification-time, which does not make logical sense. Also, any access to one of the
$ipLogFiles from outside of the routine may trip a block. I would welcome any comments on this.
I'll run the routine for another week to give it a good test (the previous mistakes have spooked me), then post the listing. If people want it.