homepage Welcome to WebmasterWorld Guest from 54.204.64.152
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

This 42 message thread spans 2 pages: 42 ( [1] 2 > >     
Blocking Badly Behaved Bots
An update/fix for a very useful routine
AlexK




msg:1299766
 3:07 am on Mar 10, 2005 (gmt 0)

I have been using (a slightly modified version of) the routine published on WebmasterWorld here [webmasterworld.com] for the last 18 months. It certainly stopped unruly bots and site-scrapers, but the logic was flawed somewhere.

Now, because my site is on a temporary server, and the main server is offline (but still connected to the web) whilst it gets an upgrade, I have had a chance to find out the flaw in the logic.

The "$newtime" was updated with the wrong value, causing perfectly-well-behaved bots to always get blocked if they crawled the site for long enough.

Here are some figures using the W3C link-validator [validator.w3.org], which crawls at 1 hit/sec:

iMaxVisit = 10 
iTime = 10 (seconds)
$newTime = $oldTime + $iTime;
oldT-test = ( $oldTime - $time - ( $iTime * $iMaxVisit ))

01:48:35 oldT=01:50:32 newT=01:50:42 oldT-test=17 Blocked
01:48:32 oldT=01:50:31 newT=01:50:41 oldT-test=19 Blocked
01:48:31 oldT=01:50:18 newT=01:50:28 oldT-test=7 Blocked
01:48:29 oldT=01:50:08 newT=01:50:18 oldT-test=-1
01:48:28 oldT=01:49:58 newT=01:50:08 oldT-test=-10
01:48:26 oldT=01:49:48 newT=01:49:58 oldT-test=-18
01:48:25 oldT=01:49:38 newT=01:49:48 oldT-test=-27
01:48:21 oldT=01:49:28 newT=01:49:38 oldT-test=-33
01:48:20 oldT=01:49:18 newT=01:49:28 oldT-test=-42
01:48:19 oldT=01:49:08 newT=01:49:18 oldT-test=-51
01:48:15 oldT=01:48:58 newT=01:49:08 oldT-test=-57
01:48:14 oldT=01:48:48 newT=01:48:58 oldT-test=-66
01:48:13 oldT=01:48:38 newT=01:48:48 oldT-test=-75
01:48:10 oldT=01:48:28 newT=01:48:38 oldT-test=-82
01:48:09 oldT=01:48:18 newT=01:48:28 oldT-test=-91
01:48:08 oldT=01:48:08 newT=01:48:18 oldT-test=-100

You can see that it gets blocked, even though there should be no problem. Changing the logic of the $newtime setting fixes it:

iMaxVisit = 10 
iTime = 10
$newTime = $oldTime + ( $iTime / $iMaxVisit );

02:06:49 oldT=02:06:49 newT=02:06:50 oldT-test=-100
02:06:48 oldT=02:06:48 newT=02:06:49 oldT-test=-100
02:06:44 oldT=02:06:44 newT=02:06:45 oldT-test=-100
02:06:43 oldT=02:06:43 newT=02:06:44 oldT-test=-100
02:06:42 oldT=02:06:42 newT=02:06:43 oldT-test=-100
02:06:40 oldT=02:06:40 newT=02:06:41 oldT-test=-100
02:06:39 oldT=02:06:39 newT=02:06:40 oldT-test=-100
02:06:37 oldT=02:06:37 newT=02:06:38 oldT-test=-100
02:06:35 oldT=02:06:35 newT=02:06:36 oldT-test=-100
02:06:34 oldT=02:06:34 newT=02:06:35 oldT-test=-100
02:06:33 oldT=02:06:33 newT=02:06:34 oldT-test=-100

iMaxVisit = 5
iTime = 10
$newTime = $oldTime + ( $iTime / $iMaxVisit );

02:15:26 oldT=02:15:33 newT=02:15:35 oldT-test=-43
02:15:25 oldT=02:15:31 newT=02:15:33 oldT-test=-44
02:15:24 oldT=02:15:29 newT=02:15:31 oldT-test=-45
02:15:21 oldT=02:15:27 newT=02:15:29 oldT-test=-44
02:15:20 oldT=02:15:25 newT=02:15:27 oldT-test=-45
02:15:19 oldT=02:15:23 newT=02:15:25 oldT-test=-46
02:15:17 oldT=02:15:21 newT=02:15:23 oldT-test=-46
02:15:16 oldT=02:15:19 newT=02:15:21 oldT-test=-47
02:15:15 oldT=02:15:17 newT=02:15:19 oldT-test=-48
02:15:13 oldT=02:15:15 newT=02:15:17 oldT-test=-48
02:15:12 oldT=02:15:13 newT=02:15:15 oldT-test=-49
02:15:11 oldT=02:15:11 newT=02:15:13 oldT-test=-50

And here is the modified routine:

// -------------- Start blocking badly-behaved bots -------
$oldSetting= ignore_user_abort( TRUE );
$remote = $_SERVER[ 'REMOTE_ADDR' ];
if(( substr( $remote, 0, 10 ) == '66.249.64.' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
( substr( $remote, 0, 10 ) == '66.249.65.' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
( substr( $remote, 0, 10 ) == '66.249.66.' ) or// Google has blocks 72.14.192.0 - 72.14.207.255
( substr( $remote, 0, 9 ) == '216.239.3' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
( substr( $remote, 0, 9 ) == '216.239.4' ) or
( substr( $remote, 0, 9 ) == '216.239.5' ) or
( substr( $remote, 0, 10 ) == '65.54.188.' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
( substr( $remote, 0, 10 ) == '207.46.98.' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
( substr( $remote, 0, 13 ) == '66.194.55.242' )// Ocelli
) {
// let well-behaved bots through
} else {
$iTime= 5;// secs; check interval
$iMaxVisit= 20;// Maximum visits allowed within $iTime
$iPenalty= 60;// Seconds before visitor is allowed back
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
$logLine= '';
$newTime= 0;
$time= time();
$oldTime= ( file_exists( $ipFile ))
? filemtime( $ipFile )
: 0;
if( $oldTime < $time ) { $oldTime = $time; }
$newTime= $oldTime + ( $iTime / $iMaxVisit );
if( $oldTime >= $time + ( $iTime * $iMaxVisit )) {
touch( $ipFile, $time + ( $iTime * ( $iMaxVisit - 1 )) + $iPenalty );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "More than $iMaxVisit visits from your IP-Address within the last $iTime secs. Please wait $iPenalty secs before retrying.</p></body></html>";
$useragent= ( isset( $_SERVER[ 'HTTP_USER_AGENT' ]))
? $_SERVER[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
$logLine= "$remote ". date( 'd/m/Y H:i:s' ) ." $useragent\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// a tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() is disabled in some linux kernels (eg 2.4)
array_shift( $log );// fopen, fclose put as close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
}
touch( $ipFile, $newTime );
}
ignore_user_abort( $oldSetting );
// -------------- Stop blocking badly-behaved bots --------

Sorry for such a long post.

 

jatar_k




msg:1299767
 10:22 pm on Mar 14, 2005 (gmt 0)

I guess I forgot to say thanks AlexK, I added a link from the original thread to this updated version

bloke in a box




msg:1299768
 12:57 pm on Mar 15, 2005 (gmt 0)

Nice, that's one routine I'll certainly find useful.

Thanks. :)

dolcevita




msg:1299769
 9:51 am on Mar 23, 2005 (gmt 0)

This is great script however because of lack of php language i do not know what to do by

$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy

With old script was very easy jus to change it
$iplogdir = "/home/whatever/public_html/whatever/";

and $iplogfile = "iplog.dat";

I do not know what and where is log placed of banning and how to make it.
If i just try to use script without chaning anything i got ofcourse error
Warning: touch(): Unable to create file _B_DIRECTORY7e1 because Permission denied

AlexK




msg:1299770
 2:34 am on Mar 24, 2005 (gmt 0)

This is great script however because of lack of php language i do not know what to do by

$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy


The items prepended with an underscore are Constants that need to be define()'d somewhere in your script before this snippet of code gets used:
eg:

define( '_B_DIRECTORY', '/path/on/server/' );
define( '_B_LOGFILE', 'logfile.name' );
define( '_B_LOGMAXLINES', '1000' );

It will do no harm to change these Constants to variables or even constant-values within the code - your choice. I make use of 2 of these Constants elsewhere in another file, and gather all Constants together in one include-file, hence their use.

Directory permissions: `_B_DIRECTORY` ("Block-directory") needs to be (at least) read-writeable by the apache-group:
# ls -al _B_DIRECTORY
drwxrwxrwx 2 owner apache-owner 4096 Mar 23 07:02 _B_DIRECTORY

I would also advise that _B_DIRECTORY should be, if at all possible, below (not accessible from) the web-directory root.

PS Don't you hate the way that this forum compresses quotes into a tall, thin block? No need, as the quote will fit into the horizontal space available. Pshah!

dolcevita




msg:1299771
 12:28 pm on Apr 4, 2005 (gmt 0)

I try but it does not works on my board.
Previosly file works good (your first block file posted a couple motnh a go [webmasterworld.com...] ) but with new i have problem.It just does not block anything and actually does not works.

<?php
define( '_B_DIRECTORY', '/home/pro/public_html/forum/ban/' );
define( '_B_LOGFILE', 'log.txt' );
define( '_B_LOGMAXLINES', '3000' );
// -------------- Start blocking badly-behaved bots -------
$oldSetting= ignore_user_abort( TRUE );
$remote = $_SERVER[ 'REMOTE_ADDR' ];
if(( substr( $remote, 0, 10 ) == '66.249.64.' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
( substr( $remote, 0, 10 ) == '66.249.65.' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
( substr( $remote, 0, 10 ) == '66.249.66.' ) or// Google has blocks 72.14.192.0 - 72.14.207.255
( substr( $remote, 0, 9 ) == '216.239.3' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
( substr( $remote, 0, 9 ) == '216.239.4' ) or
( substr( $remote, 0, 9 ) == '216.239.5' ) or
( substr( $remote, 0, 10 ) == '65.54.188.' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
( substr( $remote, 0, 10 ) == '207.46.98.' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
( substr( $remote, 0, 13 ) == '66.194.55.242' )// Ocelli
) {
// let well-behaved bots through
} else {
$iTime= 5;// secs; check interval
$iMaxVisit= 3;// Maximum visits allowed within $iTime
$iPenalty= 60;// Seconds before visitor is allowed back
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
$logLine= '';
$newTime= 0;
$time= time();
$oldTime= ( file_exists( $ipFile ))
? filemtime( $ipFile )
: 0;
if( $oldTime < $time ) { $oldTime = $time; }
$newTime= $oldTime + ( $iTime / $iMaxVisit );
if( $oldTime >= $time + ( $iTime * $iMaxVisit )) {
touch( $ipFile, $time + ( $iTime * ( $iMaxVisit - 1 )) + $iPenalty );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "More than $iMaxVisit visits from your IP-Address within the last $iTime secs. Please wait $iPenalty secs before retrying.</p></body></html>";
$useragent= ( isset( $_SERVER[ 'HTTP_USER_AGENT' ]))
? $_SERVER[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
$logLine= "$remote ". date( 'd/m/Y H:i:s' ) ." $useragent\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// a tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() is disabled in some linux kernels (eg 2.4)
array_shift( $log );// fopen, fclose put as close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
}
touch( $ipFile, $newTime );
}
ignore_user_abort( $oldSetting );
?>

Edit:
----------------

Old code works like charm.I've jkust integreated this part within old code

$remote = $_SERVER[ 'REMOTE_ADDR' ];
if(( substr( $remote, 0, 10 ) == '66.249.64.' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
( substr( $remote, 0, 10 ) == '66.249.65.' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
( substr( $remote, 0, 10 ) == '66.249.66.' ) or// Google has blocks 72.14.192.0 - 72.14.207.255
( substr( $remote, 0, 9 ) == '216.239.3' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
( substr( $remote, 0, 9 ) == '216.239.4' ) or
( substr( $remote, 0, 9 ) == '216.239.5' ) or
( substr( $remote, 0, 10 ) == '65.54.188.' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
( substr( $remote, 0, 10 ) == '207.46.98.' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
( substr( $remote, 0, 13 ) == '66.194.55.242' )// Ocelli

However i got one problem and it is some error code with phpBB (output already sent etc)
If i delete?> on the end of script it works smothly and without any problem.
Strange but it works!?

[edited by: jatar_k at 4:44 pm (utc) on April 4, 2005]
[edit reason] fixed link [/edit]

dolcevita




msg:1299772
 9:28 am on Apr 6, 2005 (gmt 0)

Great, great script.
Today script caugt first 2 IP.
However as i said before still using old script because new does not works for me.
Cfg of old script i set to

$iTime= 5;//
$iPenalty= 600
$iMaxVisit= 15

Excellent works and it seems to be first defense not against bots but also against brute forcing and Dos-ing of your site.

Thanks

dolcevita




msg:1299773
 7:59 pm on Apr 6, 2005 (gmt 0)

Actuall it works.I was able to simulate bot on my own site by configuration

$iTime= 5;// secs; check interval
$iMaxVisit= 2;// Maximum visits allowed within $iTime

If i set configration higher

$iTime= 5;// secs; check interval
$iMaxVisit= 5;// Maximum visits allowed within $iTime

i wasn't able to temp. ban myself even with constatntly 100-150 clicking - refresh.
Thanks but i like more first script that prevent excellent brute forcing and dos.

AlexK




msg:1299774
 1:24 am on Apr 13, 2005 (gmt 0)

dolcevita:
Actuall it works.

The old script certainly works... the problem is that it will catch *anybody* (particularly any bot) that browses the site for any period of time. The first table in the first message shows that (with an interval of 10 secs) the test-variable drops by 10 secs every hit, meaning that with iMaxVisit = 10 after 10 hits the bot is blocked.

As a practical example, on my own site on the very day that I posted the first message the modified routine as above stopped someone trying to hit my site at over 25 times a second. Here is an extract from the log-file (the number at the end of each line is the number of times blocked in that second of time):
69.227.20.74 10/03/2005 19:45:03 Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) 25
69.227.20.74 10/03/2005 19:45:02 Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) 24
69.227.20.74 10/03/2005 19:45:01 Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) 24
69.227.20.74 10/03/2005 19:45:00 Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) 25
69.227.20.74 10/03/2005 19:44:59 Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) 25

Strictly, the if() routine for well-behaved bots should not now be necessary.

Anyway, having said all that, if you are content with what you have then I fully understand staying with it. I'm just very pernickity...

(sorry to have replied so slow)

dolcevita




msg:1299775
 4:19 pm on Apr 13, 2005 (gmt 0)

Because my site was a couple times under attacks i'm more happy with older script.
As i said new script failed to block me by my own testing on:

$iTime= 5;// secs; check interval
$iMaxVisit= 5;// Maximum visits allowed within $iTime

even with more than 100 repeated clicks.
Script was able to block me only on:

$iTime= 5;// secs; check interval
$iMaxVisit= 2;// Maximum visits allowed within $iTime

Old script with configuration

$iTime= 5;//
$iPenalty= 600
$iMaxVisit= 15

has blocked already couple of IP's.
It did not block any good bot but some users that use Firefox has problems if they use "Drag and Go", an extension of Firefox, to open multiple threads in the background and review them later.
I've already contact them and they do not gonna to use function "Drag and Go" anymore to avoid be temp. banned again.
Maybe is old script not perfect but new script failed to works with definition example:

$iTime= 5;// secs; check interval
$iMaxVisit= 5;// Maximum visits allowed within $iTime

If visitor/bot after 25 clicks (what is allowed) did a more than 5 cliks within 5 second, then script does not ban them and act as nothing happend.
I do not what is problem but definition that you explain how it works does not works in my testing.

AlexK




msg:1299776
 1:46 am on Apr 14, 2005 (gmt 0)

dolcevita:
...new script failed to block me by my own testing

My main site is still on a temp server (ISP dragging it's feet) so I will make a point of re-testing because of what you say. Not tonight though... far too late!

jdMorgan




msg:1299777
 2:24 am on Apr 14, 2005 (gmt 0)

> i wasn't able to temp. ban myself even with constatntly 100-150 clicking - refresh.

Make sure you flush and disable your browser cache if you are using "clicks" to test...

Jim

dolcevita




msg:1299778
 8:49 am on Apr 14, 2005 (gmt 0)

Cache are disabled on my browser and i just did test with same circumstance on both script:
Old script:

$itime = 5; // Minimum number of seconds between visits
$ipenalty = 300; // Seconds before visitor is allowed back
$imaxvisit = 15; // Maximum visits

New script:

$iTime= 5;// secs; check interval
$iMaxVisit= 15;// Maximum visits allowed within $iTime
$iPenalty= 300;// Seconds before visitor is allowed back

Old script works according to formula 15 x 5 = 75 and then if you make 15 or more clicks within 5 second you'll get temporaty ban.
New script failed to block anything even with 200 fast clicks after period of 75.
There is something wrong.

btw

It will be usefull to log not only ip/date/time/user agent but also page of your site where refreshing happend.

AlexK




msg:1299779
 11:31 pm on Apr 15, 2005 (gmt 0)

dolcevita:
New script failed to block anything

Mea culpa, dolcevita. I did the checks using the values you used ($itime = 5, $ipenalty = 300, $imaxvisit = 15) and you are right. The problem is that
filemtime and touch work with integers, whereas the routine tries to set with a float. With your values this effectively leaves the mtime unchanged, hence no block.

I am trying to evaluate a correct block-test. At this instant it is twisting my head around.

BTW:
What jdMorgan is referring to is the difference between refresh and shift-refresh (or f5). The former will just refresh the page layout from already-downloaded-files whereas the latter two force a re-load of the files, regardless of the cache settings between your computer and the server (do not forget proxies). The very best I managed was 6xsec, and most were 4xsec or less.

I used the following snippet of php:
$fileTime= ( file_exists( $ipFile ))
? filemtime( $ipFile )
: 0;
if( $oldTime < $time ) { $oldTime = $time; }
$newTime= $oldTime + ( $iTime / $iMaxVisit );
$testTime= $time + ( $iTime * $iMaxVisit );
if( $oldTime >= $time + ( $iTime * $iMaxVisit )) {
touch( $ipFile, $time + ( $iTime * ( $iMaxVisit - 1 )) + $iPenalty );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "More than $iMaxVisit visits from your IP-Address within the last $iTime secs. Please wait $iPenalty secs before retrying.</p></body></html>";
$logLine= date( 'd/m/Y H:i:s' ) ." \$fileTime=$fileTime \$oldTime=$oldTime \$newTime=$newTime \$testTime=$testTime (Blocked)\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() disabled in some kernels (eg 2.4)
array_shift( $log );// fopen,fclose put close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
} else {
touch( $ipFile, $newTime );
header( 'HTTP/1.0 200 Service Is Fine' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server working well</b><br />";
echo "This is just a little text to send before quitting.</p></body></html>";
$logLine= date( 'd/m/Y H:i:s' ) ." \$fileTime=$fileTime \$oldTime=$oldTime \$newTime=$newTime \$testTime=$testTime\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() disabled in some kernels (eg 2.4)
array_shift( $log );// fopen,fclose put close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
}

I'll get back to you when I've untwisted my head.

AlexK




msg:1299780
 3:40 am on Apr 16, 2005 (gmt 0)

This is now solved by going back to first principles. It uses both the
mtime (modification time) and atime (access time) to track both visits and duration. This will be a problem for some servers, which can be mounted with atime updates disabled [uk.php.net] to increase the performance. This is not the situation on my server, but I obviously cannot speak for yours.

Here are snippets from a specially-constructed log-file, showing the block coming into effect. Just to make life more interesting, I have also changed some of the variables to try to make them a little more understandable.

$bInterval = 5; // secs; check interval
$bMaxVisit = 15; // Maximum visits allowed within $bInterval

On the emboldened-line there are 60 visits in 20 seconds, which is (just) within the limits. The next visit goes over the limit, and trips the block:
16/04/2005 03:33:13 $fileATime=1113618793 $fileMTime=1113618794 $visits=1 $duration=1
16/04/2005 03:33:14 $fileATime=1113618793 $fileMTime=1113618795 $visits=2 $duration=1
16/04/2005 03:33:14 $fileATime=1113618793 $fileMTime=1113618796 $visits=3 $duration=1
16/04/2005 03:33:15 $fileATime=1113618793 $fileMTime=1113618797 $visits=4 $duration=2
...
16/04/2005 03:33:33 $fileATime=1113618793 $fileMTime=1113618852 $visits=59 $duration=20
16/04/2005 03:33:33 $fileATime=1113618793 $fileMTime=1113618853 $visits=60 $duration=20
16/04/2005 03:33:33 $fileATime=1113618793 $fileMTime=1113618854 $visits=61 $duration=20 (Blocked)
16/04/2005 03:33:34 $fileATime=1113618793 $fileMTime=1113618944 $visits=151 $duration=21 (Blocked)
16/04/2005 03:33:34 $fileATime=1113618793 $fileMTime=1113618945 $visits=152 $duration=21 (Blocked)

And here is the modified routine. I did also check it live on the server exactly as shown, and was able to trip the block:
The items prepended with an underscore are Constants that need to be define()'d somewhere in your script before this snippet of code gets used:
eg:
define( '_B_DIRECTORY', '/full/path/on/server/' );
define( '_B_LOGFILE', 'logfile.name' );
define( '_B_LOGMAXLINES', '1000' );
These Constants can be variables or even constant-values within the code - your choice.
Directory permissions: `_B_DIRECTORY` needs to be read-writeable by the apache-group.
Both $ipLogFile and $ipFile are created on-the-fly if not already existing.
//----------------Start-blocking-badly-behaved-bots---------------------------------------------
$bInterval= 5;// secs; check interval
$bMaxVisit= 10;// Maximum visits allowed within $bInterval
$bPenalty= 60;// Seconds before visitor is allowed back
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
$logLine= '';
$time= time();
$fileATime= $time;// access time used to track duration
$fileMTime= $time;// modification time used to track visits
if( file_exists( $ipFile )) {
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
if( $fileMTime < $time ) { $fileMTime = $fileATime = $time; }
$fileMTime++;
$visits= $fileMTime - $fileATime;
$duration= $time - $fileATime;// secs
if( $duration < 1 ) $duration = 1;
if(( $visits > $bMaxVisit ) and (( $visits / $duration ) > ( $bMaxVisit / $bInterval ))) {
touch( $ipFile, $time + ( $bInterval * ( $bMaxVisit - 1 )) + $bPenalty, $fileATime );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "$visits visits from your IP-Address within the last $duration secs. Please wait $bPenalty secs before retrying.</p></body></html>";
$useragent= ( isset( $_SERVER[ 'HTTP_USER_AGENT' ]))
? $_SERVER[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
$logLine= "$remote ". date( 'd/m/Y H:i:s' ) ." $useragent\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() disabled in some kernels (eg 2.4)
array_shift( $log );// fopen,fclose put close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
}
}
touch( $ipFile, $fileMTime, $fileATime );
}
ignore_user_abort( $oldSetting );
//----------------Stop-blocking-badly-behaved-bots----------------------------------------------

It should, now, be OK to remove the good-bots opt-out at the beginning but (being a cautious fellow) I've left them in.

HTH

dolcevita




msg:1299781
 10:43 am on Apr 16, 2005 (gmt 0)

I've did a changes but got error on line 63:

touch( $ipFile, $fileMTime, $fileATime );

Warning: touch(): Unable to create file because No such file or directory in /home/pb/public_html/forum/refreshblock2.php on line 63

LIne 63 is:
touch( $ipFile, $fileMTime, $fileATime );

I've already directory with log.txt inside direcotry

define( '_B_DIRECTORY', '/home/pb/public_html/forum/ban/' );
define( '_B_LOGFILE', 'log.txt' );
define( '_B_LOGMAXLINES', '3000' );

What here must be done.

Thanks

AlexK




msg:1299782
 3:28 pm on Apr 17, 2005 (gmt 0)

dolcevita:
I've did a changes but got error on line 63:
touch( $ipFile, $fileMTime, $fileATime );
Warning: touch(): Unable to create file because No such file or directory...

touch creates the file if it does not already exist, so the problem can only be with the directory.

Re-check the directory name (watch out for case: `ban' is different from `Ban' on Linux, for example).

Re-check the directory permissions (see msg 5), although I doubt that this will be the problem, since the error should be different.

Finally, if nothing else works, change line 63 to:
touch( $ipFile, $fileMTime );
and--if the error message goes away--then I reckon that your server is mounted with atime updates disabled. This latter is a pure guess, as I do not have any experience of that situation.

After a little research: the 3rd parameter atime is not present in the PHP 4.2.2 documentation and I am not sure at what version it was added. There are bugs [php.net] in touch in Version 4.2.3, 4.2.1 and 4.0.2. If all else fails you will have to revert to the original code. Just make sure that you include the section to let good-bots through, 'cos otherwise they will be blocked (on my multi-thousand page site they crawl so long that they will *always* get blocked, even at 1-hit-every-3-secs).

The code in msg 1 can be used if $itime >= $imaxvisit and $itime / $imaxvisit evaluates to an integer. eg:
$iTime= 15;
$iMaxVisit= 15;
would work fine (15 hits in 15 secs is a lot for a single visitor).

Finally, the section of php posted in msg 15 was lifted direct from the page which is now running live on my (active) server. It is working fine, and I am eagerly waiting for it to catch yet more bots!

Addendum:Dolcevita:
It will be usefull to log not only ip/date/time/user agent but also page of your site where refreshing happend.

You can add anything that you like to the log-file, but I would advise to try to keep the lines reasonably short. Some of the User-Agents fields are very long.

dolcevita




msg:1299783
 6:52 pm on Apr 17, 2005 (gmt 0)

Sorry for trouble.It works and i did error with coding and replacing old code with new.
Do not know hwere i got error but with coyp/paste new code and then modify of files i finally got it to works.
Thanks for script.

AlexK




msg:1299784
 3:24 am on Apr 18, 2005 (gmt 0)

dolcevita:
Sorry for trouble.It works

(wipes brow) Phew! Glad that you are back to the good life. And many thanks for your comments - without your help I would have been blithely sailing along letting all and sundry rip off my site.

tomda




msg:1299785
 7:09 am on Apr 19, 2005 (gmt 0)

The script works great. It seems that the penalty time is longer than what it is indicated but who cares if it is just for robots.

Three questions:

1/ 15 visits per 5 secs; is it good value?

2/ Is there any better solution for allowing well-behaved bots than to use a if/else statement checking the remote address?

3/ If not, any ID where I could get a list of the must-have friendly-bots?

Thanks

dolcevita




msg:1299786
 12:29 pm on Apr 19, 2005 (gmt 0)

It is not only script against robots.It is first defense against Ddos attack or brute forcing of your server.
I use script (old and now new) about 3-4 weeks and not any good robot has been caught.

$bInterval= 5;// secs; check interval
$bMaxVisit= 10;// Maximum visits allowed within $bInterval
$bPenalty= 300;// Seconds before visitor is allowed back
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files

tomda




msg:1299787
 10:14 am on Apr 20, 2005 (gmt 0)

Thanks Dolce for your feedback.

The problem I have regarding the penalty time is persistent.

It is longer than what it is indicated, that is even if I press refresh after a longer period of time, I still get banned.

Am I the only one to have such problem?

artdeco




msg:1299788
 11:08 am on Apr 20, 2005 (gmt 0)

Does anyone has visits from

and please tell me how can I ban them from eating my Bandwidth ,visiting my server.
PS: I do not trust ex iron curtin countries like(cz=chehoslovakia,pl poland and so on) webmasters and scam artists >hackers 302 redirect....from those countries.

[edited by: jatar_k at 6:35 pm (utc) on April 20, 2005]
[edit reason] no urls thanks [/edit]

incrediBILL




msg:1299789
 4:15 pm on Apr 20, 2005 (gmt 0)

I have 2 problems and this addresses one of them:

- spiders hit the site too fast
- offline downloaders and scrapers, with 40K-80K pages it must stop

Seems like with a few modifications this script could also set a page download limit per IP.

The issue then becomes letting the legitimate spiders pass vs blocking downloaders and scrapers.

Seems easy enough with a large list of exceptions.

AlexK




msg:1299790
 6:38 pm on Apr 20, 2005 (gmt 0)

tomda:
The problem I have regarding the penalty time is persistent.
It is longer than what it is indicated, that is even if I press refresh after a longer period of time, I still get banned.

Another boo-boo on my part - the
touch() command within the banned-section was based on the old routine, and not re-written for the new algorithm. It is now re-written (see below), and works fine:
$bInterval= 5;// secs; check interval
$bMaxVisit= 10;// Maximum visits allowed within $bInterval
$bPenalty= 60;// Seconds before visitor is allowed back
20/04/2005 18:44:51 $fileATime=1114019091 $fileMTime=1114019092 $visits=1 $duration=1
20/04/2005 18:44:51 $fileATime=1114019091 $fileMTime=1114019093 $visits=2 $duration=1
...
20/04/2005 18:44:54 $fileATime=1114019091 $fileMTime=1114019100 $visits=9 $duration=3
20/04/2005 18:44:54 $fileATime=1114019089 $fileMTime=1114019219 $visits=10 $duration=3 (Blocked)
20/04/2005 18:44:55 $fileATime=1114019090 $fileMTime=1114019220 $visits=131 $duration=6 (Blocked)
20/04/2005 18:46:09 $fileATime=1114019090 $fileMTime=1114019221 $visits=131 $duration=79
--------------------------------------------------------------------------------------------------------------------------------------------------------------
After waiting for (a little more than) 60 secs it let me back in again (the new coding resets both $fileATime and $fileMTime to do this).

artdeco:
...please tell me how can I ban them from eating my Bandwidth

The routine is impartial to IP. You could add an IP-checking routine immediately before the ban-check and increase the number of visits, which would effectively ban them (look at msg 1), but really any such check should be done via iptables (the firewall on your server).

incrediBILL:
The issue then becomes letting the legitimate spiders pass vs blocking downloaders and scrapers.

An illegitimate spider will not care about your bandwidth - it will scrape your site as fast as you let it. This is the mark of the beast.

Finally (and apologies in advance for making these pages so large) here is the full revised routine, with IP-checking for legit spiders now removed:
The items prepended with an underscore are Constants that need to be define()'d somewhere in your script before this snippet of code gets used:
eg:
define( '_B_DIRECTORY', '/full/path/on/server/' );
define( '_B_LOGFILE', 'logfile.name' );
define( '_B_LOGMAXLINES', '1000' );
These Constants can be variables or even constant-values within the code - your choice.
Directory permissions: `_B_DIRECTORY` needs to be read-writeable by the apache-group.
Both $ipLogFile and $ipFile are created on-the-fly if not already existing.
//----------------Start-blocking-badly-behaved-bots---------------------------------------------
$oldSetting= ignore_user_abort( TRUE );
$bInterval= 5;// secs; check interval
$bMaxVisit= 10;// Maximum visits allowed within $bInterval
$bPenalty= 60;// Seconds before visitor is allowed back
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $remote ), -$ipLength );
$logLine= '';
$time= time();
$fileATime= $time;// access time:-tracks duration
$fileMTime= $time;// modification time:-tracks visits
if( file_exists( $ipFile )) {
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
if( $fileMTime < $time ) { $fileMTime = $fileATime = $time; }
$fileMTime++;
$visits= $fileMTime - $fileATime;
$duration= $time - $fileATime;// secs
if( $duration < 1 ) $duration = 1;
if(( $visits >= $bMaxVisit ) and (( $visits / $duration ) > ( $bMaxVisit / $bInterval ))) {
$fileATime= $time = $time - $bInterval;
$fileMTime= $time + $bMaxVisit + (( $bMaxVisit * $bPenalty ) / $bInterval );
touch( $ipFile, $fileMTime, $fileATime );
header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "$visits visits from your IP-Address within the last $duration secs. Please wait $bPenalty secs before retrying.</p></body></html>";
$remote = $_SERVER[ 'REMOTE_ADDR' ];
$useragent= ( isset( $_SERVER[ 'HTTP_USER_AGENT' ]))
? $_SERVER[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
$logLine= "$remote ". date( 'd/m/Y H:i:s' ) ." $useragent\n";
$log= file( $ipLogFile );
if( $fp = fopen( $ipLogFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// flock() disabled in some kernels (eg 2.4)
array_shift( $log );// fopen,fclose put close together as possible
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();
}
}
touch( $ipFile, $fileMTime, $fileATime );
ignore_user_abort( $oldSetting );
//----------------Stop-blocking-badly-behaved-bots----------------------------------------------

incrediBILL




msg:1299791
 12:33 am on Apr 21, 2005 (gmt 0)

An illegitimate spider will not care about your bandwidth - it will scrape your site as fast as you let it. This is the mark of the beast.

I think you missed my second point.

The script as presented fixes issue #1 which is the bandwidth hogs

I want to stop competitors from crawling my pages after a few hundred pages or so, instead of letting them get access to 40k+ pages. A human has a limit to how much information they can process in a day, but the offloading and scraping, even at a slow rate just keeps going and going until they have my entire site.

I'm looking into a modification that will track that behavior and just shut them down after a few hundred pages with a nice friendly message like "You have exceeded your page view allotment for today, come back tomorrow you greedy pig"

AlexK




msg:1299792
 6:38 am on Apr 21, 2005 (gmt 0)

incrediBILL:
I want to stop competitors from crawling my pages ... until they have my entire site.
I'm looking into a modification that will track that behavior and just shut them down after a few hundred pages

Hmmm. The routine as written creates an empty file (
$ipFile) and uses the mod/access time of this file to activate a block. The same file could also contain content...

As written, the routine pre-processes a page. What you want would require that it is split to both pre- and post-process a page:

    1 Use ob_start() at the beginning of the page
    2 Use
    ob_get_contents() and strlen() at the end.
    3 Increment the integer obtained in
    $ipFile
    4 Adjust mod/access time
    5 Check against your limits during pre-processing

You owe me a beer; a pint of ice-cold Guinness, please.

[added]:
Y'know, after preening myself on my clever, clever reply, it has occurred to me that there is an even simpler solution...

The routine tracks the number of visits in $visits. It would be very easy to add a check for the gross number of visits.

$visits is reset to zero when the mod-time falls behind the current time (a hit-rate of less than 1 a sec, which admittedly is 86k hits in 24 hours). My experience across the last 18 months, however, is that *no-one* scraping a site has the patience to hit your site so slowly (look at msg#9). Hence my original comment.

AlexK




msg:1299793
 8:38 am on Apr 21, 2005 (gmt 0)

incrediBILL:
I'm looking into a modification that will track that behavior and just shut them down after a few hundred pages

[after yet more thinking]:
A very sensible modification to the routine in msg#25 is to adjust the set-zero time (mod in bold):
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
if(( $time - $fileMTime ) > $bInterval ) { $fileMTime = $fileATime = $time; }
$fileMTime++;

As long as
$bInterval is not huge this will be OK, and will also maintain the tracking for $visits and $duration for slow scrapers. It will now be possible to add new block values, then test:
$bTotBlock=86400;//secs; period to block long-duration scrapers
$bTotVisit=1000;//total visits allowed within a 24-hr period
...
if(( $visits >= $bTotVisit ) and ( $duration < 84400 )) {
$fileATime= $time = $time - $bInterval;
$fileMTime= $time + $bMaxVisit + ( $bMaxVisit * $bTotBlock / $bInterval );
touch( $ipFile, $fileMTime, $fileATime );
(your msg goes here!)
exit();
}

(none of the above has been checked in practice, although I *will* implement the first bit on my own server. I think.)

You will obviously have to be careful to choose values which will not block search-bots, else re-implement the IP-checking of msg#1.

AlexK




msg:1299794
 2:56 am on Apr 23, 2005 (gmt 0)

mea culpa, mea twit

The following line:
$remote = $_SERVER[ 'REMOTE_ADDR' ];
appears after $remote is used. It should be at this position:

$oldSetting= ignore_user_abort( TRUE );
$remote = $_SERVER[ 'REMOTE_ADDR' ];
$bInterval= 5;// secs; check interval

Checking my block log, the routine in msg#25 caught 96 attempted visits from 35 IP-addresses in a 15 minute span yesterday (Friday 22). My live site has all error messages switched off, so the notice on using a variable that is not yet declared did not appear. Sigh.

Sorry to all.

AlexK




msg:1299795
 6:04 am on Apr 28, 2005 (gmt 0)

The updated routine (including all amendments, including incrediBILL's request for trapping slow, long-term scrapers, but without the
if() clause for well-behaved bots) has been running without problems for 5 days now. In fact, I was getting worried that it might not be working. Then I checked the block log this evening...

(the number at the end of each line is the number of times blocked within that second):
$bInterval= 10;
$bMaxVisit= 20;
$bPenalty= 60;
$bTotVisit= 500;
$bTotBlock= 42200;
$ipLength= 3;
24.174.219.197 26/04/2005 23:00:42 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 2
24.174.219.197 26/04/2005 23:00:41 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 5
24.174.219.197 26/04/2005 23:00:40 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 6
24.174.219.197 26/04/2005 23:00:39 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 6
24.174.219.197 26/04/2005 23:00:38 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 5
24.174.219.197 26/04/2005 23:00:37 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 6
24.174.219.197 26/04/2005 23:00:36 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 4
24.174.219.197 26/04/2005 23:00:35 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 5
24.174.219.197 26/04/2005 23:00:34 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1) 3
( 24.174.219.197 = cpe-24-174-219-197.elp.res.rr.com )

20 pages, then 42 attempts in 8 seconds blocked until this scraper got the point. It's a very good feeling. Also, no search-bots blocked at all. Excellent.

The final item now (possibly) is to reverse the use of atime & mtime. At the moment, access-time lags behind modification-time, which does not make logical sense. Also, any access to one of the $ipLogFiles from outside of the routine may trip a block. I would welcome any comments on this.

I'll run the routine for another week to give it a good test (the previous mistakes have spooked me), then post the listing. If people want it.

This 42 message thread spans 2 pages: 42 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved