Forum Moderators: coopster
The original revised-routine works fine--in fact, so far since June, there have been 677 attempts by 20 different people to rip off my site; all blocked by the routine. However, I have noticed a mistake in the long-term scraper code:
should be:
if( file_exists( $ipFile )) {
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
$fileATime++;
$visits= $fileATime - $fileMTime;
$duration= $time - $fileMTime;// secs
// foll test also keeps tracking going to catch slow scrapers
if( $duration > 86400 ) {// 24 hours; start over
$fileMTime= $fileATime = $time;
$duration= $visits = 1;
} else if( $duration < 1 ) $duration = 1;
// test for slow scrapers
if( $visits >= $bTotVisit ) {
[edited by: coopster at 1:57 pm (utc) on Aug. 7, 2008]
The modified routine works extremely well. In fact, so well that you may need to carefully consider whether to use it. You see, the scrapers that it caught were the Adsense bot and (just a handful of times) the Yahoo! Slurp! bot.
That looked like it would stop the adsense bot which I would think would be a problem for an adsense website.
$bot ='N';
// Check if it is a good bot
$agents = array('Googlebot', 'Yahoo', 'msnbot', 'Jeeves', 'Mediapartners');
$ref = $_SERVER['HTTP_USER_AGENT'];
foreach ( $agents as $agent )
{
if ( strpos ( $ref, $agent )!== false )
{
$bot = 'Y';
}
}
if ($bot == 'N')
{
Do The Badly Behaved Script
}
You can also use the ip address based exclusions outlined earlier in the thread if you prefer.
the scrapers that it caught were the Adsense bot
I'd forgotten all about that!
It has not been a problem on my site after the first week or so. Nothing has changed on the site, even though hits for Nov were 155% of Oct's figures. Specifically, Adsense revenue for Oct not only held, but actually rose. I do not think that the block-script was connected to this!
In general, my attitude is NOT to use whitelist-exclusions, although I thoroughly understand that others may be more nervous of the consequences (it may be a good idea to include the IP-address whitelist code as comments to the routine; I'll do that, and upload it later). I take the view that bots (just like people) should be judged by their behaviour, rather than their parents. My background is northern England working class, so you will perhaps understand my suspicion of aristocratic tendencies.
Gibisan:
good bot test
Nice and simple idea, but has the fatal flaw of opening up a door for anyone that can fake the User-Agent (dead easy).
Request for assistance:
Does anyone know of a simple, swift preg-routine to test for IP-blocks of the style:
123.123.123.0/8?
what if a client/bot visit you at morning 1 time, and come back at afternoon hitting you 1000 times within 1-10 secs.
is that out of the spirit of this script?
okay, it's for "badly behaved bot", not for "attacker". but there's some "good" user surfing on your site regularly, duration>1000 visits=100, and suddenly want to download all links from your a page, or even start to use webzip/webdownload. it will crawl so fast. until hit the 1000th page
This technique works well on my site which is dynamic and it takes .5 to 1 seconds to generate many pages as the 5th ot 6th page can sometimes tell the 1st and 2nd to abort.
check for banned bots at the start AND stop of each web page
Unless atime/mtime changes are not written out until the end of the script operation (not the case, AFAIK) your comments do not apply.
There are 9 lines of code for a non-block from read to write:
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
$fileATime++;
$visits= $fileATime - $fileMTime;
$duration= $time - $fileMTime;// secs
if( restart tracking?
if( slow scraper?
} elseif( fast scraper?
touch( $ipFile, $fileMTime, $fileATime ); At this point
$ipFileis changed globally, and any other thread can react to it. I've been using--and checking--variants of this script since 2003, with upto 15 page requests/sec attempted, and have not yet detected what you suggest.
Oh! I've just (I think) understood what you are saying:- prevent page delivery at the bottom of the page.
Ah yes, good idea! The one issue here is that PHP caches file info, so that cache would have to be discarded with
clearstatcache()... yes! that needs adding. Fast bots only, I think. I'll add it asap.
Request for assistance:Does anyone know of a simple, swift preg-routine to test for IP-blocks of the style: 123.123.123.0/8?
<?php
function IsIPInNet( $ip, $net ) {
if( preg_match( '/^([^\/]+)\/([^\/]+)$/', $net, $ms ) ) {
$mask = 0xFFFFFFFF << ( 32 - $ms[2] );
return ( ip2long( $ip ) & $mask ) == ( ip2long( $ms[1] ) & $mask );
}
return false;
}
echo IsIPInNet( '1.2.3.4', '1.2.3.0/24' )? "yes" : "no";
echo IsIPInNet( '2.2.3.4', '1.2.3.0/24' )? "yes" : "no";
?>
incrediBILL:
A snippet of optional code added to enact your suggestion. Many thanks! (you both have credits on the page)
New version uploaded just now. Small name-changes to some variables to help keep the namespace unique, but main code algorithm remains unchanged.
One word of caution reguarding prepended code:
Make sure it has NO BROWSER OUTPUT or it will interrupt SESSION creation in your following php programs.
In the bad-bot program be careful to erase or fill every blank line and make sure <?php is the very first line. Blank lines in php code leak through the php parser and go straight to the browser thus upsetting the rule that SESSION must be the first output of a php program (unless you want to hassle with ob_start and ob_flush, etc.). It's easier to just close up the blank lines, I think.
[edited by: jatar_k at 2:29 pm (utc) on Dec. 21, 2005]
[edit reason] removed url [/edit]
>> Blank lines in php code leak through the php parser and go straight to the browser
unless I am misunderstanding you, this isn't true
blank lines inside <?php don't go to the browser at all
blank lines before and after <?php and?> do though
<sorry for OT AlexK>
let's stay on topic folks, great routine, great support
thanks again AlexK
saltlakejohn:
The point about spaces before/after prepend files is a good one, and will be added to the comments with a credit - thanks.
[edited by: jatar_k at 2:37 pm (utc) on Dec. 21, 2005]
[edited by: coopster at 2:49 am (utc) on Aug. 6, 2008]
On to my point:
Does anyone have a very high traffic site using this? If so, how well does it handle monthly traffic of say 5 to 10 million page impressions?
The reason I ask is I'm about to launch a Yellow Pages type site which will not have many users to start with, but it will be spider food. I'd like to make sure it's not going to cripple the server.
I can see that an integer value of 2 or 3 is suggested to allow 256 or 4096 file combinations (based on the number of characters taken from the md5 representation of the IP address).
Surely a value of 2 will bar 0.4 percent of all traffic for any (and every) banned md5 substring.
The problem is reduced with a setting of 3, but it still seems to be an innacurate way to track the myriad of possible addresses.
I would suggest an extension of the length of substring used, maybe to 5. The first 2 of these could represent a set of 256 directories each of which having up to 4096 files.
I'm suggesting a directory structure as I'm not sure that all systems would cope well with a million files in a single directory (and 65K in one is a strain for some).
Alternatively it could be set at 2+2, giving 65536 possibilities.
I will probably be implementing this on my server, I just thought it may be of interest to you all.
Currently it's feasible, although unlikely, that a sustained rouge bot attack spread over a large set of IP's could knock out a considerable proportion of traffic for any site implementing the 256 possibility version of the script.
how well does it handle monthly traffic of say 5 to 10 million page impressions?
suggest an extension of the length of substring used, maybe to 5 ... directory structure
$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $ipRemote ), -$ipLength );
Becomes:
$ipLength= 5;// integer; 4 = 65536 files spread accross 256 folders, 5 = 1 Million files spread accross 256 Folders
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $ipRemote ), -$ipLength, 2 ) . "/" . substr( md5( $ipRemote ), -($ipLength-2) );
Also, the small matter of setting up the 256 folders, I used a 'quick and dirty' script (shown below - and doctored as I didn't even bother with the constant) I'll leave it for someone else to do a better job of it.
<?php
// This is very poor, it does not check that directories are created
// or even loop to be more elegant. But I knew it would work for my
// immediate needs so I was lazy.
//
// In keeping with the previous script
define( '_B_DIRECTORY', '/full/path/on/server/to/block/dir/' );
mkdir( _B_DIRECTORY . "00", 0775);
mkdir( _B_DIRECTORY . "10", 0775);
mkdir( _B_DIRECTORY . "20", 0775);
mkdir( _B_DIRECTORY . "30", 0775);
mkdir( _B_DIRECTORY . "40", 0775);
mkdir( _B_DIRECTORY . "50", 0775);
mkdir( _B_DIRECTORY . "60", 0775);
mkdir( _B_DIRECTORY . "70", 0775);
mkdir( _B_DIRECTORY . "80", 0775);
mkdir( _B_DIRECTORY . "90", 0775);
mkdir( _B_DIRECTORY . "a0", 0775);
mkdir( _B_DIRECTORY . "b0", 0775);
mkdir( _B_DIRECTORY . "c0", 0775);
mkdir( _B_DIRECTORY . "d0", 0775);
mkdir( _B_DIRECTORY . "e0", 0775);
mkdir( _B_DIRECTORY . "f0", 0775);
mkdir( _B_DIRECTORY . "01", 0775);
mkdir( _B_DIRECTORY . "11", 0775);
mkdir( _B_DIRECTORY . "21", 0775);
mkdir( _B_DIRECTORY . "31", 0775);
mkdir( _B_DIRECTORY . "41", 0775);
mkdir( _B_DIRECTORY . "51", 0775);
mkdir( _B_DIRECTORY . "61", 0775);
mkdir( _B_DIRECTORY . "71", 0775);
mkdir( _B_DIRECTORY . "81", 0775);
mkdir( _B_DIRECTORY . "91", 0775);
mkdir( _B_DIRECTORY . "a1", 0775);
mkdir( _B_DIRECTORY . "b1", 0775);
mkdir( _B_DIRECTORY . "c1", 0775);
mkdir( _B_DIRECTORY . "d1", 0775);
mkdir( _B_DIRECTORY . "e1", 0775);
mkdir( _B_DIRECTORY . "f1", 0775);
mkdir( _B_DIRECTORY . "02", 0775);
mkdir( _B_DIRECTORY . "12", 0775);
mkdir( _B_DIRECTORY . "22", 0775);
mkdir( _B_DIRECTORY . "32", 0775);
mkdir( _B_DIRECTORY . "42", 0775);
mkdir( _B_DIRECTORY . "52", 0775);
mkdir( _B_DIRECTORY . "62", 0775);
mkdir( _B_DIRECTORY . "72", 0775);
mkdir( _B_DIRECTORY . "82", 0775);
mkdir( _B_DIRECTORY . "92", 0775);
mkdir( _B_DIRECTORY . "a2", 0775);
mkdir( _B_DIRECTORY . "b2", 0775);
mkdir( _B_DIRECTORY . "c2", 0775);
mkdir( _B_DIRECTORY . "d2", 0775);
mkdir( _B_DIRECTORY . "e2", 0775);
mkdir( _B_DIRECTORY . "f2", 0775);
mkdir( _B_DIRECTORY . "03", 0775);
mkdir( _B_DIRECTORY . "13", 0775);
mkdir( _B_DIRECTORY . "23", 0775);
mkdir( _B_DIRECTORY . "33", 0775);
mkdir( _B_DIRECTORY . "43", 0775);
mkdir( _B_DIRECTORY . "53", 0775);
mkdir( _B_DIRECTORY . "63", 0775);
mkdir( _B_DIRECTORY . "73", 0775);
mkdir( _B_DIRECTORY . "83", 0775);
mkdir( _B_DIRECTORY . "93", 0775);
mkdir( _B_DIRECTORY . "a3", 0775);
mkdir( _B_DIRECTORY . "b3", 0775);
mkdir( _B_DIRECTORY . "c3", 0775);
mkdir( _B_DIRECTORY . "d3", 0775);
mkdir( _B_DIRECTORY . "e3", 0775);
mkdir( _B_DIRECTORY . "f3", 0775);
mkdir( _B_DIRECTORY . "04", 0775);
mkdir( _B_DIRECTORY . "14", 0775);
mkdir( _B_DIRECTORY . "24", 0775);
mkdir( _B_DIRECTORY . "34", 0775);
mkdir( _B_DIRECTORY . "44", 0775);
mkdir( _B_DIRECTORY . "54", 0775);
mkdir( _B_DIRECTORY . "64", 0775);
mkdir( _B_DIRECTORY . "74", 0775);
mkdir( _B_DIRECTORY . "84", 0775);
mkdir( _B_DIRECTORY . "94", 0775);
mkdir( _B_DIRECTORY . "a4", 0775);
mkdir( _B_DIRECTORY . "b4", 0775);
mkdir( _B_DIRECTORY . "c4", 0775);
mkdir( _B_DIRECTORY . "d4", 0775);
mkdir( _B_DIRECTORY . "e4", 0775);
mkdir( _B_DIRECTORY . "f4", 0775);
mkdir( _B_DIRECTORY . "05", 0775);
mkdir( _B_DIRECTORY . "15", 0775);
mkdir( _B_DIRECTORY . "25", 0775);
mkdir( _B_DIRECTORY . "35", 0775);
mkdir( _B_DIRECTORY . "45", 0775);
mkdir( _B_DIRECTORY . "55", 0775);
mkdir( _B_DIRECTORY . "65", 0775);
mkdir( _B_DIRECTORY . "75", 0775);
mkdir( _B_DIRECTORY . "85", 0775);
mkdir( _B_DIRECTORY . "95", 0775);
mkdir( _B_DIRECTORY . "a5", 0775);
mkdir( _B_DIRECTORY . "b5", 0775);
mkdir( _B_DIRECTORY . "c5", 0775);
mkdir( _B_DIRECTORY . "d5", 0775);
mkdir( _B_DIRECTORY . "e5", 0775);
mkdir( _B_DIRECTORY . "f5", 0775);
mkdir( _B_DIRECTORY . "06", 0775);
mkdir( _B_DIRECTORY . "16", 0775);
mkdir( _B_DIRECTORY . "26", 0775);
mkdir( _B_DIRECTORY . "36", 0775);
mkdir( _B_DIRECTORY . "46", 0775);
mkdir( _B_DIRECTORY . "56", 0775);
mkdir( _B_DIRECTORY . "66", 0775);
mkdir( _B_DIRECTORY . "76", 0775);
mkdir( _B_DIRECTORY . "86", 0775);
mkdir( _B_DIRECTORY . "96", 0775);
mkdir( _B_DIRECTORY . "a6", 0775);
mkdir( _B_DIRECTORY . "b6", 0775);
mkdir( _B_DIRECTORY . "c6", 0775);
mkdir( _B_DIRECTORY . "d6", 0775);
mkdir( _B_DIRECTORY . "e6", 0775);
mkdir( _B_DIRECTORY . "f6", 0775);
mkdir( _B_DIRECTORY . "07", 0775);
mkdir( _B_DIRECTORY . "17", 0775);
mkdir( _B_DIRECTORY . "27", 0775);
mkdir( _B_DIRECTORY . "37", 0775);
mkdir( _B_DIRECTORY . "47", 0775);
mkdir( _B_DIRECTORY . "57", 0775);
mkdir( _B_DIRECTORY . "67", 0775);
mkdir( _B_DIRECTORY . "77", 0775);
mkdir( _B_DIRECTORY . "87", 0775);
mkdir( _B_DIRECTORY . "97", 0775);
mkdir( _B_DIRECTORY . "a7", 0775);
mkdir( _B_DIRECTORY . "b7", 0775);
mkdir( _B_DIRECTORY . "c7", 0775);
mkdir( _B_DIRECTORY . "d7", 0775);
mkdir( _B_DIRECTORY . "e7", 0775);
mkdir( _B_DIRECTORY . "f7", 0775);
mkdir( _B_DIRECTORY . "08", 0775);
mkdir( _B_DIRECTORY . "18", 0775);
mkdir( _B_DIRECTORY . "28", 0775);
mkdir( _B_DIRECTORY . "38", 0775);
mkdir( _B_DIRECTORY . "48", 0775);
mkdir( _B_DIRECTORY . "58", 0775);
mkdir( _B_DIRECTORY . "68", 0775);
mkdir( _B_DIRECTORY . "78", 0775);
mkdir( _B_DIRECTORY . "88", 0775);
mkdir( _B_DIRECTORY . "98", 0775);
mkdir( _B_DIRECTORY . "a8", 0775);
mkdir( _B_DIRECTORY . "b8", 0775);
mkdir( _B_DIRECTORY . "c8", 0775);
mkdir( _B_DIRECTORY . "d8", 0775);
mkdir( _B_DIRECTORY . "e8", 0775);
mkdir( _B_DIRECTORY . "f8", 0775);
mkdir( _B_DIRECTORY . "09", 0775);
mkdir( _B_DIRECTORY . "19", 0775);
mkdir( _B_DIRECTORY . "29", 0775);
mkdir( _B_DIRECTORY . "39", 0775);
mkdir( _B_DIRECTORY . "49", 0775);
mkdir( _B_DIRECTORY . "59", 0775);
mkdir( _B_DIRECTORY . "69", 0775);
mkdir( _B_DIRECTORY . "79", 0775);
mkdir( _B_DIRECTORY . "89", 0775);
mkdir( _B_DIRECTORY . "99", 0775);
mkdir( _B_DIRECTORY . "a9", 0775);
mkdir( _B_DIRECTORY . "b9", 0775);
mkdir( _B_DIRECTORY . "c9", 0775);
mkdir( _B_DIRECTORY . "d9", 0775);
mkdir( _B_DIRECTORY . "e9", 0775);
mkdir( _B_DIRECTORY . "f9", 0775);
mkdir( _B_DIRECTORY . "0a", 0775);
mkdir( _B_DIRECTORY . "1a", 0775);
mkdir( _B_DIRECTORY . "2a", 0775);
mkdir( _B_DIRECTORY . "3a", 0775);
mkdir( _B_DIRECTORY . "4a", 0775);
mkdir( _B_DIRECTORY . "5a", 0775);
mkdir( _B_DIRECTORY . "6a", 0775);
mkdir( _B_DIRECTORY . "7a", 0775);
mkdir( _B_DIRECTORY . "8a", 0775);
mkdir( _B_DIRECTORY . "9a", 0775);
mkdir( _B_DIRECTORY . "aa", 0775);
mkdir( _B_DIRECTORY . "ba", 0775);
mkdir( _B_DIRECTORY . "ca", 0775);
mkdir( _B_DIRECTORY . "da", 0775);
mkdir( _B_DIRECTORY . "ea", 0775);
mkdir( _B_DIRECTORY . "fa", 0775);
mkdir( _B_DIRECTORY . "0b", 0775);
mkdir( _B_DIRECTORY . "1b", 0775);
mkdir( _B_DIRECTORY . "2b", 0775);
mkdir( _B_DIRECTORY . "3b", 0775);
mkdir( _B_DIRECTORY . "4b", 0775);
mkdir( _B_DIRECTORY . "5b", 0775);
mkdir( _B_DIRECTORY . "6b", 0775);
mkdir( _B_DIRECTORY . "7b", 0775);
mkdir( _B_DIRECTORY . "8b", 0775);
mkdir( _B_DIRECTORY . "9b", 0775);
mkdir( _B_DIRECTORY . "ab", 0775);
mkdir( _B_DIRECTORY . "bb", 0775);
mkdir( _B_DIRECTORY . "cb", 0775);
mkdir( _B_DIRECTORY . "db", 0775);
mkdir( _B_DIRECTORY . "eb", 0775);
mkdir( _B_DIRECTORY . "fb", 0775);
mkdir( _B_DIRECTORY . "0c", 0775);
mkdir( _B_DIRECTORY . "1c", 0775);
mkdir( _B_DIRECTORY . "2c", 0775);
mkdir( _B_DIRECTORY . "3c", 0775);
mkdir( _B_DIRECTORY . "4c", 0775);
mkdir( _B_DIRECTORY . "5c", 0775);
mkdir( _B_DIRECTORY . "6c", 0775);
mkdir( _B_DIRECTORY . "7c", 0775);
mkdir( _B_DIRECTORY . "8c", 0775);
mkdir( _B_DIRECTORY . "9c", 0775);
mkdir( _B_DIRECTORY . "ac", 0775);
mkdir( _B_DIRECTORY . "bc", 0775);
mkdir( _B_DIRECTORY . "cc", 0775);
mkdir( _B_DIRECTORY . "dc", 0775);
mkdir( _B_DIRECTORY . "ec", 0775);
mkdir( _B_DIRECTORY . "fc", 0775);
mkdir( _B_DIRECTORY . "0d", 0775);
mkdir( _B_DIRECTORY . "1d", 0775);
mkdir( _B_DIRECTORY . "2d", 0775);
mkdir( _B_DIRECTORY . "3d", 0775);
mkdir( _B_DIRECTORY . "4d", 0775);
mkdir( _B_DIRECTORY . "5d", 0775);
mkdir( _B_DIRECTORY . "6d", 0775);
mkdir( _B_DIRECTORY . "7d", 0775);
mkdir( _B_DIRECTORY . "8d", 0775);
mkdir( _B_DIRECTORY . "9d", 0775);
mkdir( _B_DIRECTORY . "ad", 0775);
mkdir( _B_DIRECTORY . "bd", 0775);
mkdir( _B_DIRECTORY . "cd", 0775);
mkdir( _B_DIRECTORY . "dd", 0775);
mkdir( _B_DIRECTORY . "ed", 0775);
mkdir( _B_DIRECTORY . "fd", 0775);
mkdir( _B_DIRECTORY . "0e", 0775);
mkdir( _B_DIRECTORY . "1e", 0775);
mkdir( _B_DIRECTORY . "2e", 0775);
mkdir( _B_DIRECTORY . "3e", 0775);
mkdir( _B_DIRECTORY . "4e", 0775);
mkdir( _B_DIRECTORY . "5e", 0775);
mkdir( _B_DIRECTORY . "6e", 0775);
mkdir( _B_DIRECTORY . "7e", 0775);
mkdir( _B_DIRECTORY . "8e", 0775);
mkdir( _B_DIRECTORY . "9e", 0775);
mkdir( _B_DIRECTORY . "ae", 0775);
mkdir( _B_DIRECTORY . "be", 0775);
mkdir( _B_DIRECTORY . "ce", 0775);
mkdir( _B_DIRECTORY . "de", 0775);
mkdir( _B_DIRECTORY . "ee", 0775);
mkdir( _B_DIRECTORY . "fe", 0775);
mkdir( _B_DIRECTORY . "0f", 0775);
mkdir( _B_DIRECTORY . "1f", 0775);
mkdir( _B_DIRECTORY . "2f", 0775);
mkdir( _B_DIRECTORY . "3f", 0775);
mkdir( _B_DIRECTORY . "4f", 0775);
mkdir( _B_DIRECTORY . "5f", 0775);
mkdir( _B_DIRECTORY . "6f", 0775);
mkdir( _B_DIRECTORY . "7f", 0775);
mkdir( _B_DIRECTORY . "8f", 0775);
mkdir( _B_DIRECTORY . "9f", 0775);
mkdir( _B_DIRECTORY . "af", 0775);
mkdir( _B_DIRECTORY . "bf", 0775);
mkdir( _B_DIRECTORY . "cf", 0775);
mkdir( _B_DIRECTORY . "df", 0775);
mkdir( _B_DIRECTORY . "ef", 0775);
mkdir( _B_DIRECTORY . "ff", 0775);
?>
[The phpBB2 folks have changed the auto_login_key algorithm, which has nicely screwed my interface-class. I am busy re-writing the sessions part of the wretched thing. Thank God it is not live yet (but should be). So, any excuse to get away from that. And I did not think that the code should take too long, and it did not.]
The code additions are pretty much the same as the previous message, but it checks for both $ipLength and whether a dir exists or not.
Q for inbound:
mkdir( _B_DIRECTORY . "00", 0775);
I've been looking at IP restrictions for known problem countries, do you think that would be worthwhile?
I know that IP tables are a pain, I'm just talking about the bigger blocks that are allocated to bot-loving nations.
It makes sense for us in particular as we have 100% UK specific advertisers (being Local Search specialists) so traffic from China is of little interest to us.
I've been looking at IP restrictions for known problem countries
Seriously, that's something that has also occurred to myself. If you look at the comments under the routine (begins line#243) you will see a simple, reasonably-quick algorithm for ID-ing any block of IP-numbers (thanks to Hanu) and line #254 has an even quicker means of ID-ing a single IP. By reversing the logic given in that routine (as written, it lets well-behaved bots through) and supplying a suitable error header() you can easily block any IP or IP-block that you wish.
Here is a more sophisticated method of obtaining the IP, which may be useful (note that this board converts all pipe (¦) entries into a broken-vertical line (¦), and they need converting back):
function _getUserIP() {// Obtain and encode users IP
// from phpBB2 (common.php + functions.php)
global $HTTP_SERVER_VARS, $HTTP_ENV_VARS, $REMOTE_ADDR;
if( getenv( 'HTTP_X_FORWARDED_FOR' )!= '' ) {
$client_ip = (!empty( $HTTP_SERVER_VARS[ 'REMOTE_ADDR' ]))
? $HTTP_SERVER_VARS[ 'REMOTE_ADDR' ]
: ((!empty( $HTTP_ENV_VARS[ 'REMOTE_ADDR' ]))
? $HTTP_ENV_VARS[ 'REMOTE_ADDR' ]
: $REMOTE_ADDR
);
$entries = explode( ',', getenv( 'HTTP_X_FORWARDED_FOR' ));
reset( $entries );
while( list(, $entry ) = each( $entries )) {
$entry = trim( $entry );
if( preg_match( "/^([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)/", $entry, $ip_list )) {
$private_ip = array( '/^0\./', '/^127\.0\.0\.1/',
'/^192\.168\..*/',
'/^172\.((1[6-9])¦(2[0-9])¦(3[0-1]))\..*/',
'/^10\..*/', '/^224\..*/',
'/^240\..*/'
);
$found_ip = preg_replace( $private_ip, $client_ip, $ip_list[ 1 ]);
if ( $client_ip!= $found_ip ) { $client_ip = $found_ip; break; }
}
}
} else {
$client_ip = (!empty( $HTTP_SERVER_VARS[ 'REMOTE_ADDR' ]))
? $HTTP_SERVER_VARS[ 'REMOTE_ADDR' ]
: ((!empty( $HTTP_ENV_VARS[ 'REMOTE_ADDR' ]))
? $HTTP_ENV_VARS[ 'REMOTE_ADDR' ]
: $REMOTE_ADDR
);
}// if( getenv( 'HTTP_X_FORWARDED_FOR')!= '' ) else
$ip_sep = explode( '.', $client_ip );
return sprintf( '%02x%02x%02x%02x', $ip_sep[ 0 ], $ip_sep[ 1 ], $ip_sep[ 2 ], $ip_sep[ 3 ]);
}// function BB2_Interface::_getUserIP()
It has occurred to me that a better means of doing that would be to integrate it into IPTables for the Firewall but, once again, any work on all of that is for the future. Much more pressing concerns for myself at this moment.
I've been running a self made script based on IP tables from MaxMind(GeoLite Country)
Any code that you could send will be welcome. It is clearly better if that code makes use of facilities available on every computer (hence my consideration of iptables, even though that would be linux-only), but all help greatly welcomed.
If a fast- or a slow-scraper is to be blocked for a period of time it is much more efficient for that to be done at the firewall, probably with an error-message at the first block. It would require complete interface code for the firewall, I think, and the time that that would take causes me not to approach it at this moment. There is nothing to stop me acquiring some sample code for the future, however, and I would greatly appreciate any assistance in that.
Over to you.
My current site is a .com; I have also got a .co.uk site which is both rather old (it is the original site) and full of .html pages. Checking through the Apache logs this morning I discovered the following:
217.160.75.202 - - [15/Jan/2006:12:03:40 +0000] "GET /robots.txt HTTP/1.1" 200 23 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:40 +0000] "GET / HTTP/1.1" 200 1417 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:40 +0000] "GET /toolbar.html HTTP/1.1" 200 2102 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:40 +0000] "GET /home.html HTTP/1.1" 200 11657 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:41 +0000] "GET / HTTP/1.1" 200 1417 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:41 +0000] "GET /mfc/index.html HTTP/1.1" 301 240 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:41 +0000] "GET /chips/index.html HTTP/1.1" 301 242 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:41 +0000] "GET /help/index.html HTTP/1.1" 301 241 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:41 +0000] "GET /modblame.html HTTP/1.1" 200 52104 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:43 +0000] "GET /help/56k.html HTTP/1.1" 200 71030 "-" "DTAAgent" In:- Out:-:-pct.
...
217.160.75.202 - - [15/Jan/2006:12:09:31 +0000] "GET /stats.php?page=chips/amb562.html&domain=1 HTTP/1.1" 200 2784 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:09:32 +0000] "GET /? HTTP/1.1" 200 1417 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:09:32 +0000] "GET /stats.php?page=help/form.html&silent=plain&domain=1 HTTP/1.1" 200 6 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:09:32 +0000] "GET /stats.php?page=help/form1.html&silent=plain&domain=1 HTTP/1.1" 200 6 "-" "DTAAgent" In:- Out:-:-pct.
Damn them.
Not a lot of pages, but that will at a minimum slow the server down, plus I have to pay for that scrapers' bandwidth (the .com site has 100,000+ pages on it, which is why my main focus on blocking scrapers is with that site).
Damn them!
The bot code was put into it's own file (
block_bad_bots.php) and the following added to the bottom of the VirtualHost directive for the .co.uk site in httpd.conf (Apache servers only):
# php directives
# 2006-01-19 added to block bad-bots -AK
#
# <IfModule mod_php4.c>
AddType application/x-httpd-php .html
php_value auto_prepend_file "/server/path/to/file/block_bad_bots.php"
# </IfModule>
# End of php directives
Then, a simple
# apachectl gracefulput it up and running.
Notes:
1 I know that php is running on my system, so the IfModules are commented out, and are there for reference only.
2 It is incredibly difficult to trip the block with a normal browser. I had to add an
echo "Yes, I am here, dumbo";to the code to prove to myself that it was working.
Warning: touch(): Unable to create file _B_DIRECTORYccd because Permission denied in [script path here] on line 114
I know this came up before here [webmasterworld.com] but the suggestions there don't seem to fix the problem.
- The defined directory definitely exists
- The directory has been chmod'ed 777
- Removing the $fileATime from line 114 doesn't remove the error
Any ideas? The only alternative was to use the old script, but it seems to be falling way behind since 2003.
Warning: touch(): Unable to create file _B_DIRECTORYccd
Change the error-reporting to show Notices, and you will get an Error-Notice about that if so.
The other obvious thing is to
echo()out the value of
$ipFileonce it has been set, and test on the server whether you can
touchit.