homepage Welcome to WebmasterWorld Guest from 54.197.147.90
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Integrating IP-Whitelist with Unruly Bot-Blocking Script
An answer to a sticky from someone having problems
AlexK




msg:3290086
 7:19 pm on Mar 22, 2007 (gmt 0)

I got a sticky from "a fellow Yorkshireman" having problems integrating the IP-Whitelist into the Unruly Bot-Blocking Script that I champion. These things are always best discussed in the forums - all can learn, and others can chip in with better ideas.

First,

This is the sticky, heavily edited to cut down on the PHP-listing (which is within the download link above):
...for some reason i can't get the ip whitelist working with it. Obviously i don't want to block legitimate bots and an ip list is a better idea than the user agents for reasons you mentioned in your post.
.
I copied the ip whitelist code over the relevant lines at the top of the file. So :
.
$oldSetting= ignore_user_abort( TRUE );// otherwise can screw-up logfile
if(!empty( $GLOBALS[ '_SERVER' ])) {
<set $_SERVER_ARRAY>
}
global ${$_SERVER_ARRAY};
$ipRemote= ${$_SERVER_ARRAY}[ 'REMOTE_ADDR' ];
.
Became:
.
function ipIsInNet( $ip, $net ) {
<function coding>
}
$oldSetting= ignore_user_abort( TRUE );
if(!empty( $GLOBALS[ '_SERVER' ])) {
<set $_SERVER_ARRAY>
}
global ${$_SERVER_ARRAY};
$ipRemote= ${$_SERVER_ARRAY}[ 'REMOTE_ADDR' ];
if(
<'good' bot == TRUE>
) {
// let well-behaved bots through
} else {
// block routine
}
.
I then put the rest of the blocking code inside the else loop, but it didn't seem to do anything when i tested it.

First, I do not personally use this Whitelist. Having said that...

function ipIsInNet() is in the right place, so let's concentrate on the Whitelist. In the code are 2 comments:

    line 63 // test for slow scrapers
    line 79 // test for fast scrapers
...and it is the if() code that follows both of those that wants to be actioned when *not* a 'good' bot. Probably, then, better to reverse the logic of the given Whitelist code. Thus:
59 if( $duration > $bStartOver ) {// restart tracking
60 $startTime= $hitsTime = $time;
61 $duration= $visits = 1;
62 } else if( $duration < 1 ) $duration = 1;
-- if(!(
-- <test for 'good' bots>
-- )) {
63 // test for slow scrapers
64 if() {
...
79 // test for fast scrapers
80 } elseif() {
...
97 }
-- } // <end of if() test for 'good' bots, reversed>
98 // log badly-behaved bots, then nuke 'em
99 if( $bLogLine ) {
<rest of routine follows as normal>

My final piece of advice would be to make sure that you use
error_reporting( E_ALL ); at the top whilst testing, and fix ALL the ugly warnings and notices.

HTH, and do report back.

[edited by: coopster at 3:37 am (utc) on Mar. 27, 2007]
[edit reason] linked to library thread [/edit]

 

rsmarsha




msg:3294095
 8:38 am on Mar 27, 2007 (gmt 0)

I tested the ip block code by adding my own ip to the top, and echo'ing messages in the loop. I can't get the ip whitelist to actually execute the "let through" loop, it always runs through the "block" loop instead for some reason. I checked $ipRemote is echo'ing the correct value.

AlexK




msg:3295318
 1:04 pm on Mar 28, 2007 (gmt 0)

Hello rsmarsha - sorry, but I cannot repeat your result.

I copied the block of code in the current script [webmasterworld.com] (download link end of msg#1) from lines 230 - 271 into "test.php" and added the following at the bottom:
if(
...
) {
// let well-behaved bots through
echo "\$ipRemote=`$ipRemote`; allowed through<br />";
} else {
// block routine
echo "\$ipRemote=`$ipRemote`; rejected<br />";
}

That gave me "$ipRemote=`192.168.1.1`; rejected". So the following was added at the top of the
if() block:

ipIsInNet( $ipRemote, '192.168.1.1/32' ) or// me

...and re-uploaded and re-run. On the screen in front of me now is "$ipRemote=`192.168.1.1`; allowed through".

So, my apologies, but obviously there is a copy mistake somewhere in your code.

[edited by: eelixduppy at 4:51 pm (utc) on April 2, 2007]
[edit reason] obfuscated IPs [/edit]

rsmarsha




msg:3300007
 1:55 pm on Apr 2, 2007 (gmt 0)

I had to enter :

ipIsInNet( $ipRemote, '192.168.1.1/167' )

instead of just 167 on the end for it to work.

I'm unsure how the following blocks the whole range, to me it looks like it just blocks 74.6.0.0 - 74.6.0.16.

ipIsInNet( $ipRemote, '74.6.0.0/16' ) or// Inktomi has blocks 74.6.0.0 - 74.6.255.255

[edited by: eelixduppy at 4:56 pm (utc) on April 2, 2007]
[edit reason] obfuscated IP [/edit]

rsmarsha




msg:3300027
 2:08 pm on Apr 2, 2007 (gmt 0)

I looked into the post i made above and am i right in thinking that this whitelist uses the CIDR range?

In that case instead of 192.168.88.167/167 to let 1.9.168.88.167 through, should i put 192.168.88.167/32?

[edited by: eelixduppy at 8:05 pm (utc) on April 2, 2007]
[edit reason] obfuscated IPs [/edit]

AlexK




msg:3300075
 2:48 pm on Apr 2, 2007 (gmt 0)

am i right in thinking that this whitelist uses the CIDR range?

Correct. That's why the example I gave ends in "/32" (a single IP). An alternative (again drawn from the Whitelist in the script comments) is:

( substr( $ipRemote, 0, 13 ) == '66.194.55.242' ) or // Ocelli
(note that you would need to modify that for your IP to *14* digits)

If neither of the above works, then you had better check that you have copied the code correctly. Even better, do exactly what I showed in msg#3295318 ...

OK, here's the code:

<?php
// test.php - to test whitelist coding, and personal IP
// lines 230 - 271 from bot-block.php.txt, with small amendments at bottom
function ipIsInNet( $ip, $net ) {
// note that $net is IP-range in CIDR format
if( preg_match( '/^([^\/]+)\/([^\/]+)$/', $net, $ms )) {
$mask = 0xFFFFFFFF << ( 32 - $ms[2] );
return ( ip2long( $ip ) & $mask ) == ( ip2long( $ms) & $mask );
}
return FALSE;
}
$oldSetting= ignore_user_abort( TRUE );
if(!empty( $GLOBALS[ '_SERVER' ])) {
$_SERVER_ARRAY= '_SERVER';
} elseif(!empty( $GLOBALS[ 'HTTP_SERVER_VARS' ])) {
$_SERVER_ARRAY= 'HTTP_SERVER_VARS';
} else {
$_SERVER_ARRAY= 'GLOBALS';
}
global ${$_SERVER_ARRAY};
$ipRemote= ${$_SERVER_ARRAY}[ 'REMOTE_ADDR' ];
if(
ipIsInNet( $ipRemote, '64.62.128.0/20' ) or// Gigablast has blocks 64.62.128.0 - 64.62.255.255
ipIsInNet( $ipRemote, '66.154.100.0/22' ) or// Gigablast has blocks 66.154.100.0 - 66.154.103.255
ipIsInNet( $ipRemote, '64.233.160.0/19' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
ipIsInNet( $ipRemote, '66.249.64.0/19' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
ipIsInNet( $ipRemote, '72.14.192.0/19' ) or// Google has blocks 72.14.192.0 - 72.14.239.255
ipIsInNet( $ipRemote, '72.14.224.0/20' ) or
ipIsInNet( $ipRemote, '216.239.32.0/19' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
ipIsInNet( $ipRemote, '66.196.64.0/18' ) or// Inktomi has blocks 66.196.64.0 - 66.196.127.255
ipIsInNet( $ipRemote, '74.6.0.0/16' ) or// Inktomi has blocks 74.6.0.0 - 74.6.255.255
ipIsInNet( $ipRemote, '66.228.160.0/19' ) or// Overture has blocks 66.228.160.0 - 66.228.191.255
ipIsInNet( $ipRemote, '68.142.192.0/18' ) or// Inktomi has blocks 68.142.192.0 - 68.142.255.255
ipIsInNet( $ipRemote, '72.30.0.0/16' ) or// Inktomi has blocks 72.30.0.0 - 72.30.255.255
ipIsInNet( $ipRemote, '64.4.0.0/18' ) or// MS-Hotmail has blocks 64.4.0.0 - 64.4.63.255
ipIsInNet( $ipRemote, '65.52.0.0/14' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
ipIsInNet( $ipRemote, '207.46.0.0/16' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
ipIsInNet( $ipRemote, '207.68.128.0/18' ) or// MS has blocks 207.68.128.0 - 207.68.207.255
ipIsInNet( $ipRemote, '207.68.192.0/20' ) or
ipIsInNet( $ipRemote, '65.192.0.0/11' ) or// Teoma has blocks 65.192.0.0 - 65.223.255.255
( substr( $ipRemote, 0, 13 ) == '66.194.55.242' )// Ocelli
) {
// let well-behaved bots through
echo "\$ipRemote=`$ipRemote`; allowed through<br />";
} else {
// block routine
echo "\$ipRemote=`$ipRemote`; rejected<br />";
}
?>

[1][edited by: AlexK at 2:50 pm (utc) on April 2, 2007]

rsmarsha




msg:3300814
 7:28 am on Apr 3, 2007 (gmt 0)

Thanks

I did all that and it works fine. :) It was just the ip i was entering wrong i think, i entered my ip for checking in the normal format. :)

Thanks for your help, much appreciated.

rsmarsha




msg:3307205
 7:34 am on Apr 10, 2007 (gmt 0)

Any ideas on how to calculate a cidr number from a range of ips?

*edit*

I think i have found a calculator for CIDR at :

[subnet-calculator.com...]

Managed to work a CIDR for my first example.

AlexK




msg:3307327
 12:01 pm on Apr 10, 2007 (gmt 0)

Nice site! (into my bookmarks)

touchring




msg:3577607
 6:30 am on Feb 18, 2008 (gmt 0)

Hi alex, a silly question, but which code comes in here?

-- if(!(
-- <test for 'good' bots>;
-- )) {

I place the following code but encountered an error:

if(!(
function ipIsInNet( $ip, $net ) {
// note that $net is IP-range in CIDR format
if( preg_match( '/^([^\/]+)\/([^\/]+)$/', $net, $ms )) {
$mask = 0xFFFFFFFF << ( 32 - $ms[2] );
return ( ip2long( $ip ) & $mask ) == ( ip2long( $ms[1] ) & $mask );
}
return FALSE;
....
ipIsInNet( $ipRemote, '207.68.192.0/20' ) or
ipIsInNet( $ipRemote, '65.192.0.0/11' ) or// Teoma has blocks 65.192.0.0 - 65.223.255.255
( substr( $ipRemote, 0, 13 ) == '66.194.55.242' )// Ocelli
)) {

Thanks for advice.

AlexK




msg:3577969
 4:13 pm on Feb 18, 2008 (gmt 0)

Hello touchring, and if it hasn't already been said to you, welcome to WebmasterWorld!

You have placed the function ipIsInNet() inside (rather than outside) the if() clause. Place that function declaration at top/bottom of your script--or, better, have an file of common functions that you use, and include it at the top--and try again.

If it still fails, reduce it down to the bare essentials and try again. Also try the trouble-shooting that I posted.

If everything fails, and you are tearing your hair out, post the snippet for review.

touchring




msg:3584634
 4:04 am on Feb 26, 2008 (gmt 0)

Thanks Alex, let it try to work it out. I was checking the logs last few days, i noticed that Google or mediabot is still being blocked even with the following settings, so i guess i need to work on the whitelist again. :)

$bInterval= 7;// secs; check interval (best > 5 < 30 secs)
$bMaxVisit= 14;// Max visits allowed within $bInterval (MUST be > $bInterval)
$bPenalty= 60;// Seconds before visitor is allowed back
$bTotVisit= 8000;// tot visits within $bStartOver (0==no slow-scraper block)
$bStartOver= 86400;// secs, default 1 day; restart tracking

touchring




msg:3584669
 5:09 am on Feb 26, 2008 (gmt 0)

I've replaced a whole chunk of code - to avoid messing up with the formating, i uploaded the amended php script to: <snip>

Got this error, any suggestions? Thanks again. :)

Warning: touch() [function.touch]: Unable to access in /home/virtual/site1/fst/var/www/html/block/bot-whitelist.php on line 126

[edited by: eelixduppy at 5:12 am (utc) on Feb. 26, 2008]
[edit reason] no URLs, please [/edit]

AlexK




msg:3586643
 12:37 am on Feb 28, 2008 (gmt 0)

The error comes on the last-line-but-one of the routine, so it has nothing to do with the bot-blocking script or your amendments.

Finding why you get this error involves perfectly standard PHP bug-hunting. What is the location of the file that it is trying to touch()? Does the directory exist. Is PHP allowed to access that directory? Etc, etc.

touchring




msg:3589295
 2:17 pm on Mar 2, 2008 (gmt 0)

I'm ashamed to say that my php skills are novice at best - i could do echo, if-else, some google debugging though. Here's the amended code i use to try and incorporate whitelisting. :)

While making the replacement, these are the lines i took note of, this is as much as i can understand from the comments:

function ipIsInNet( $ip, $net ) { 
....
$oldSetting= ignore_user_abort( TRUE );
....
)) {
// these are now NOT any of the above
// test for slow scrapers
if( ... ) {
...
// test for fast scrapers
} elseif( ... ) {
...
}
}
// log badly-behaved bots, then nuke 'em
if( $bLogLine ) {
(and so on with normal coding)
.....

Amended code:


// -------------- Start blocking badly-behaved bots : top code -------

function ipIsInNet( $ip, $net ) {
// note that $net is IP-range in CIDR format
if( preg_match( '/^([^\/]+)\/([^\/]+)$/', $net, $ms )) {
$mask = 0xFFFFFFFF << ( 32 - $ms[2] );
return ( ip2long( $ip ) & $mask ) == ( ip2long( $ms[1] ) & $mask );
}
return FALSE;
}

$oldSetting= ignore_user_abort( TRUE );
if( !empty( $GLOBALS[ '_SERVER' ])) {
$_SERVER_ARRAY= '_SERVER';
} elseif( !empty( $GLOBALS[ 'HTTP_SERVER_VARS' ])) {
$_SERVER_ARRAY= 'HTTP_SERVER_VARS';
} else {
$_SERVER_ARRAY= 'GLOBALS';
}
global ${$_SERVER_ARRAY};
$ipRemote= ${$_SERVER_ARRAY}[ 'REMOTE_ADDR' ];
$bInterval= 7;// secs; check interval (best > 5 < 30 secs)
$bMaxVisit= 14;// Max visits allowed within $bInterval (MUST be > $bInterval)
$bPenalty= 60;// Seconds before visitor is allowed back
$bTotVisit= 10;// tot visits within $bStartOver (0==no slow-scraper block)
//$bTotVisit= 8000;// tot visits within $bStartOver (0==no slow-scraper block)
$bStartOver= 86400;// secs, default 1 day; restart tracking
$ipLength= 3;// integer; 2=255 files, 3=4,096 files (best > 1 < 6)
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
if( !(
ipIsInNet( $ipRemote, '220.255.4.134' ) or
ipIsInNet( $ipRemote, '64.1.215.164' ) or
ipIsInNet( $ipRemote, '66.249.73.205' ) or
ipIsInNet( $ipRemote, '64.62.128.0/20' ) or// Gigablast has blocks 64.62.128.0 - 64.62.255.255
ipIsInNet( $ipRemote, '66.154.100.0/22' ) or// Gigablast has blocks 66.154.100.0 - 66.154.103.255
ipIsInNet( $ipRemote, '64.233.160.0/19' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
ipIsInNet( $ipRemote, '66.249.64.0/19' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
ipIsInNet( $ipRemote, '72.14.192.0/19' ) or// Google has blocks 72.14.192.0 - 72.14.239.255
ipIsInNet( $ipRemote, '72.14.224.0/20' ) or
ipIsInNet( $ipRemote, '216.239.32.0/19' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
ipIsInNet( $ipRemote, '66.196.64.0/18' ) or// Inktomi has blocks 66.196.64.0 - 66.196.127.255
ipIsInNet( $ipRemote, '74.6.0.0/16' ) or// Inktomi has blocks 74.6.0.0 - 74.6.255.255
ipIsInNet( $ipRemote, '66.228.160.0/19' ) or// Overture has blocks 66.228.160.0 - 66.228.191.255
ipIsInNet( $ipRemote, '68.142.192.0/18' ) or// Inktomi has blocks 68.142.192.0 - 68.142.255.255
ipIsInNet( $ipRemote, '72.30.0.0/16' ) or// Inktomi has blocks 72.30.0.0 - 72.30.255.255
ipIsInNet( $ipRemote, '64.4.0.0/18' ) or// MS-Hotmail has blocks 64.4.0.0 - 64.4.63.255
ipIsInNet( $ipRemote, '65.52.0.0/14' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
ipIsInNet( $ipRemote, '207.46.0.0/16' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
ipIsInNet( $ipRemote, '207.68.128.0/18' ) or// MS has blocks 207.68.128.0 - 207.68.207.255
ipIsInNet( $ipRemote, '207.68.192.0/20' ) or
ipIsInNet( $ipRemote, '65.192.0.0/11' ) or// Teoma has blocks 65.192.0.0 - 65.223.255.255
( substr( $ipRemote, 0, 13 ) == '66.194.55.242' )// Ocelli
)) {
// test for slow scrapers
if(
( $bTotVisit > 0 ) and
( $visits >= $bTotVisit )
) {
$useragent= ( isset( ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]))
? ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
$wait= ( int ) ( $bStartOver - $duration + 1 );// secs
header( 'HTTP/1.0 503 Service Unavailable' );
header( "Retry-After: $wait" );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server Down.</b><br />";
//echo "$visits visits from your IP-Address within the last ". (( int ) ( $duration / 3600 )) ." hours. Please wait ". (( int ) ( $wait / 3600 )) ." hours before retrying.</p></body></html>";
$bLogLine= "$ipRemote ". date( 'd/m/Y H:i:s' ) ." $useragent (slow scraper stopped)\n";
// test for fast scrapers
} elseif(
( $visits >= $bMaxVisit ) and
(( $visits / $duration ) > ( $bMaxVisit / $bInterval ))
) {
$startTime= $time;
$hitsTime= $time + (( $bMaxVisit * $bPenalty ) / $bInterval );
$wait= ( int ) ( $hitsTime - $startTime + 1 );
$useragent= ( isset( ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]))
? ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
header( 'HTTP/1.0 503 Service Unavailable' );
header( "Retry-After: $wait" );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server Down.</b><br />";
//echo "You are scraping this site too quickly. Please wait at least $wait secs before retrying.</p></body></html>";
$bLogLine= "$ipRemote ". date( 'd/m/Y H:i:s' ) ." $useragent (fast scraper stopped)\n";
}
// log badly-behaved bots, then nuke 'em
if( $bLogLine ) {
touch( $ipFile, $startTime, $hitsTime );
$log= file( $ipLogFile );// flock() disabled in some kernels (eg 2.4)
if( $fp = fopen( $ipLogFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// fopen,fclose put close together as possible
while( count( $log ) >= _B_LOGMAXLINES ) array_shift( $log );
array_push( $log, $bLogLine );
$bLogLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $bLogLine );
fclose( $fp );
}
exit();
}
}
touch( $ipFile, $startTime, $hitsTime );
ignore_user_abort( $oldSetting );
// -------------- Stop blocking badly-behaved bots : top code --------


AlexK




msg:3590333
 9:44 pm on Mar 3, 2008 (gmt 0)

Jeez, touchring!

The way that you transfer from novice to expert is by practice. And actually listening to what others say. Then practising it.

The error comes on the last-line-but-one of the routine, so it has nothing to do with the bot-blocking script or your amendments.

So, your script with amendments is fine.

Finding why you get this error involves perfectly standard PHP bug-hunting.

  1. What is the location of the file that it is trying to touch()? (consult the error message)
  2. Does that directory exist?
  3. Is PHP allowed to access that directory? (dir permissions, etc)
  4. Etc
  5. etc.

Do some obvious stuff. If the directory already exists, write a tiny test-script with just the touch() line in it. Or something.

touchring




msg:3590690
 10:12 am on Mar 4, 2008 (gmt 0)

Thanks for replying. :)

1). What is the location of the file that it is trying to touch()? (consult the error message)

>>> Warning: touch() [function.touch]: Unable to access in /home/virtual/site1/fst/var/www/html/block/bot-whitelist.php on line 126

Line 126 is basically just this line:
touch( $ipFile, $startTime, $hitsTime );

I checked bot-whitelist.php, chmod 755, same as the original bot.php that worked without any issues.

I also tried renaming bot-whitelist.php to bot.php (exactly same name, same CHMOD value, file path and _B_DIRECTORY folder as the working bot.php), but no help.

2). Does that directory exist?

>>> Do you mean _B_DIRECTORY? Yes, it exists. Bot.php (original script) can write to that directory.

3). Is PHP allowed to access that directory? (dir permissions, etc)

>>> Yes.

AlexK




msg:3591651
 7:18 am on Mar 5, 2008 (gmt 0)

Line 126 is basically just this line:
touch( $ipFile, $startTime, $hitsTime );

Yup - "last-line-but-one of the routine".

Warning: touch() [function.touch]: Unable to access in /home/...

The error is causing you (and me!) confusion because the value for $ipFile that should be there is apparently missing (there will be a space on screen in the original error, which gets collapsed in a browser unless you use <pre>...</pre>). That means that $ipFile has a value of '' (empty string) or--more likely--NULL.

(I've just copied the snippet into my text editor & searched for it)

Sure 'nuff, you have forgotten to declare/initialise it.

There is a good lesson for you & other PHP-learners here, touchring - set error reporting to EVERYTHING when testing:

error_reporting( E_ALL );

You would then have had a Notice on screen about uninitialised variables, which would have alerted you to the source of the error.

Difficult to swallow--but essential--lessons, huh?

touchring




msg:3593812
 7:26 am on Mar 7, 2008 (gmt 0)

Thanks, after adding in "error_reporting( E_ALL );", the error becomes:

Notice: Undefined variable: visits in /home/virtual/site1/fst/var/www/html/directory/block/bot.php on line 81
Notice: Undefined variable: visits in /home/virtual/site1/fst/var/www/html/directory/block/bot.php on line 96
Notice: Undefined variable: bLogLine in /home/virtual/site1/fst/var/www/html/directory/block/bot.php on line 114
Notice: Undefined variable: ipFile in /home/virtual/site1/fst/var/www/html/directory/block/bot.php on line 131
Notice: Undefined variable: startTime in /home/virtual/site1/fst/var/www/html/directory/block/bot.php on line 131
Notice: Undefined variable: hitsTime in /home/virtual/site1/fst/var/www/html/directory/block/bot.php on line 131
Warning: touch() [function.touch]: Unable to access in /home/virtual/site1/fst/var/www/html/directory/block/bot.php on line 131

So i copied over the entire chunk of code and inserted just below - $ipLogFile= _B_DIRECTORY . _B_LOGFILE;


if( $ipLength > 3 ) {// 4=65,025 files, 5=1,044,480 files
// make sure that _B_DIRECTORY is inside script dir
// eg: if /path/to/script/ then /path/to/script/block/
// not: /path/to/block/
$bDirPrefix= 'b_';
$tmp= substr( md5( $ipRemote ), -$ipLength );
$ipFile= _B_DIRECTORY . $bDirPrefix . substr( $tmp, 0, 2 );// 255 dirs
if( !is_dir( $ipFile )) {
$oldMask= umask( 0 );// prevent umask value interfering
if( !mkdir( $ipFile, 0700 )) die( "Failed to create dir: '$ipFile'" );
umask( $oldMask );
}
$ipFile.= DIRECTORY_SEPARATOR . substr( $tmp, 2 );
} else {
$ipFile= _B_DIRECTORY . substr( md5( $ipRemote ), -$ipLength );
}
$bLogLine= '';
$time= $startTime = $hitsTime = time();

Now the error becomes:

Notice: Undefined variable: visits in /home/virtual/site1/fst/var/www/html/directory/block/bot.php on line 101
Notice: Undefined variable: visits in /home/virtual/site1/fst/var/www/html/directory/block/bot.php on line 116

So, i added -
$visits = $hitsTime - $startTime;

Now there's no more errors, but the blocking also does not work, even with $bTotVisit= 10;

This is the final code, take note of "// debug start" and "// debug end":

// -------------- Start blocking badly-behaved bots : top code -------

error_reporting( E_ALL );

function ipIsInNet( $ip, $net ) {
// note that $net is IP-range in CIDR format
if( preg_match( '/^([^\/]+)\/([^\/]+)$/', $net, $ms )) {
$mask = 0xFFFFFFFF << ( 32 - $ms[2] );
return ( ip2long( $ip ) & $mask ) == ( ip2long( $ms[1] ) & $mask );
}
return FALSE;
}

$oldSetting= ignore_user_abort( TRUE );
if( !empty( $GLOBALS[ '_SERVER' ])) {
$_SERVER_ARRAY= '_SERVER';
} elseif( !empty( $GLOBALS[ 'HTTP_SERVER_VARS' ])) {
$_SERVER_ARRAY= 'HTTP_SERVER_VARS';
} else {
$_SERVER_ARRAY= 'GLOBALS';
}
global ${$_SERVER_ARRAY};
$ipRemote= ${$_SERVER_ARRAY}[ 'REMOTE_ADDR' ];
$bInterval= 7;// secs; check interval (best > 5 < 30 secs)
$bMaxVisit= 14;// Max visits allowed within $bInterval (MUST be > $bInterval)
$bPenalty= 60;// Seconds before visitor is allowed back
$bTotVisit= 10;// tot visits within $bStartOver (0==no slow-scraper block)
//$bTotVisit= 8000;// tot visits within $bStartOver (0==no slow-scraper block)
$bStartOver= 86400;// secs, default 1 day; restart tracking
$ipLength= 3;// integer; 2=255 files, 3=4,096 files (best > 1 < 6)
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;

// debug start

if( $ipLength > 3 ) {// 4=65,025 files, 5=1,044,480 files
// make sure that _B_DIRECTORY is inside script dir
// eg: if /path/to/script/ then /path/to/script/block/
// not: /path/to/block/
$bDirPrefix= 'b_';
$tmp= substr( md5( $ipRemote ), -$ipLength );
$ipFile= _B_DIRECTORY . $bDirPrefix . substr( $tmp, 0, 2 );// 255 dirs
if( !is_dir( $ipFile )) {
$oldMask= umask( 0 );// prevent umask value interfering
if( !mkdir( $ipFile, 0700 )) die( "Failed to create dir: '$ipFile'" );
umask( $oldMask );
}
$ipFile.= DIRECTORY_SEPARATOR . substr( $tmp, 2 );
} else {
$ipFile= _B_DIRECTORY . substr( md5( $ipRemote ), -$ipLength );
}
$bLogLine= '';
$time= $startTime = $hitsTime = time();

$visits= $hitsTime - $startTime;

// debug end

if( !(
ipIsInNet( $ipRemote, '220.255.4.134' ) or
ipIsInNet( $ipRemote, '64.1.215.164' ) or
ipIsInNet( $ipRemote, '66.249.73.205' ) or
ipIsInNet( $ipRemote, '64.62.128.0/20' ) or// Gigablast has blocks 64.62.128.0 - 64.62.255.255
ipIsInNet( $ipRemote, '66.154.100.0/22' ) or// Gigablast has blocks 66.154.100.0 - 66.154.103.255
ipIsInNet( $ipRemote, '64.233.160.0/19' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
ipIsInNet( $ipRemote, '66.249.64.0/19' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
ipIsInNet( $ipRemote, '72.14.192.0/19' ) or// Google has blocks 72.14.192.0 - 72.14.239.255
ipIsInNet( $ipRemote, '72.14.224.0/20' ) or
ipIsInNet( $ipRemote, '216.239.32.0/19' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
ipIsInNet( $ipRemote, '66.196.64.0/18' ) or// Inktomi has blocks 66.196.64.0 - 66.196.127.255
ipIsInNet( $ipRemote, '74.6.0.0/16' ) or// Inktomi has blocks 74.6.0.0 - 74.6.255.255
ipIsInNet( $ipRemote, '66.228.160.0/19' ) or// Overture has blocks 66.228.160.0 - 66.228.191.255
ipIsInNet( $ipRemote, '68.142.192.0/18' ) or// Inktomi has blocks 68.142.192.0 - 68.142.255.255
ipIsInNet( $ipRemote, '72.30.0.0/16' ) or// Inktomi has blocks 72.30.0.0 - 72.30.255.255
ipIsInNet( $ipRemote, '64.4.0.0/18' ) or// MS-Hotmail has blocks 64.4.0.0 - 64.4.63.255
ipIsInNet( $ipRemote, '65.52.0.0/14' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
ipIsInNet( $ipRemote, '207.46.0.0/16' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
ipIsInNet( $ipRemote, '207.68.128.0/18' ) or// MS has blocks 207.68.128.0 - 207.68.207.255
ipIsInNet( $ipRemote, '207.68.192.0/20' ) or
ipIsInNet( $ipRemote, '65.192.0.0/11' ) or// Teoma has blocks 65.192.0.0 - 65.223.255.255
( substr( $ipRemote, 0, 13 ) == '66.194.55.242' )// Ocelli
)) {
// test for slow scrapers
if(
( $bTotVisit > 0 ) and
( $visits >= $bTotVisit )
) {
$useragent= ( isset( ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]))
? ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
$wait= ( int ) ( $bStartOver - $duration + 1 );// secs
header( 'HTTP/1.0 503 Service Unavailable' );
header( "Retry-After: $wait" );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server Down.</b><br />";
//echo "$visits visits from your IP-Address within the last ". (( int ) ( $duration / 3600 )) ." hours. Please wait ". (( int ) ( $wait / 3600 )) ." hours before retrying.</p></body></html>";
$bLogLine= "$ipRemote ". date( 'd/m/Y H:i:s' ) ." $useragent (slow scraper stopped)\n";
// test for fast scrapers
} elseif(
( $visits >= $bMaxVisit ) and
(( $visits / $duration ) > ( $bMaxVisit / $bInterval ))
) {
$startTime= $time;
$hitsTime= $time + (( $bMaxVisit * $bPenalty ) / $bInterval );
$wait= ( int ) ( $hitsTime - $startTime + 1 );
$useragent= ( isset( ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]))
? ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
header( 'HTTP/1.0 503 Service Unavailable' );
header( "Retry-After: $wait" );
header( 'Connection: close' );
header( 'Content-Type: text/html' );
echo "<html><body><p><b>Server Down.</b><br />";
//echo "You are scraping this site too quickly. Please wait at least $wait secs before retrying.</p></body></html>";
$bLogLine= "$ipRemote ". date( 'd/m/Y H:i:s' ) ." $useragent (fast scraper stopped)\n";
}
// log badly-behaved bots, then nuke 'em
if( $bLogLine ) {
touch( $ipFile, $startTime, $hitsTime );
$log= file( $ipLogFile );// flock() disabled in some kernels (eg 2.4)
if( $fp = fopen( $ipLogFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _B_LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// fopen,fclose put close together as possible
while( count( $log ) >= _B_LOGMAXLINES ) array_shift( $log );
array_push( $log, $bLogLine );
$bLogLine= implode( '', $log );
$fp= fopen( $ipLogFile, 'w' );
}
fputs( $fp, $bLogLine );
fclose( $fp );
}
exit();
}
}
touch( $ipFile, $startTime, $hitsTime );
ignore_user_abort( $oldSetting );
// -------------- Stop blocking badly-behaved bots : top code --------

AlexK




msg:3594758
 3:38 am on Mar 8, 2008 (gmt 0)

Now there's no more errors, but the blocking also does not work, even with $bTotVisit= 10;

I got so tired of people saying "it doesn't work" that I started a log of all the (attempted) scrapers blocked by this routine on my site (go to my site--url in profile--then > Forums >(bottom of page)> Site Info & Diary > 'Roll of Dishonour'). Blocks are a daily event, and I usually only post these days when there are 4 or more in the log on the same day.

There is a far simpler way for you to handle this.

The problems with the G-bot etc are caused by the slow-scraper block, so do not use it. It is then a good idea to set the roll-over period to be shorter than 24 hours, so:

$bTotVisit= 0;// tot visits within $bStartOver (0==no slow-scraper block)
$bStartOver= 10800;// 4 hours; restart tracking

That will catch all the fast scrapers, and leave the search-bots alone. The script will then be smaller, faster and ready-bugged.

touchring




msg:3598644
 4:58 pm on Mar 12, 2008 (gmt 0)

I got so tired of people saying "it doesn't work" that I started a log of all the (attempted) scrapers blocked by this routine on my site (go to my site--url in profile--then > Forums >(bottom of page)> Site Info & Diary > 'Roll of Dishonour'). Blocks are a daily event, and I usually only post these days when there are 4 or more in the log on the same day.

Thanks, the code works (the original one without the whitelist amendments i tried doing). Well, except for a problem with Googlebot, which got caught by the script even with the following setting.

$bTotVisit= 10000;// tot visits within $bStartOver
$bStartOver= 86400;// secs, default 1 day; restart

I just check Google webmaster, google bot stopped indexing my site since that day, 26th feb. :(

There is a far simpler way for you to handle this.

The problems with the G-bot etc are caused by the slow-scraper block, so do not use it. It is then a good idea to set the roll-over period to be shorter than 24 hours, so:

$bTotVisit= 0;// tot visits within $bStartOver (0==no slow-scraper block)
$bStartOver= 10800;// 4 hours; restart tracking

Yup, trying this now. Thanks. :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved