Forum Moderators: coopster

Message Too Old, No Replies

Blocking Badly Behaved Bots #3

Small correction to a previously-posted routine

         

AlexK

1:22 pm on Oct 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, this is a correction to an update [webmasterworld.com] (which itself had several corrections [webmasterworld.com]) to a posting [webmasterworld.com]. Sigh and double-sigh.

The original revised-routine works fine--in fact, so far since June, there have been 677 attempts by 20 different people to rip off my site; all blocked by the routine. However, I have noticed a mistake in the long-term scraper code:

should be:

if( file_exists( $ipFile )) {
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
$fileATime++;
$visits= $fileATime - $fileMTime;
$duration= $time - $fileMTime;// secs
// foll test also keeps tracking going to catch slow scrapers
if( $duration > 86400 ) {// 24 hours; start over
$fileMTime= $fileATime = $time;
$duration= $visits = 1;
} else if( $duration < 1 ) $duration = 1;
// test for slow scrapers
if( $visits >= $bTotVisit ) {

Rather than re-posting the whole (revised) routine all over again, for just a one-line change, the entire bot-block routine can be downloaded at this link [download.modem-help.co.uk]. I warmly endorse it to you.

[edited by: coopster at 1:57 pm (utc) on Aug. 7, 2008]

AlexK

7:46 pm on Dec 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My site has Adsense on every page.

Sorry, ogletree, but I have no idea what has triggered your question.

ogletree

10:22 pm on Dec 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The modified routine works extremely well. In fact, so well that you may need to carefully consider whether to use it. You see, the scrapers that it caught were the Adsense bot and (just a handful of times) the Yahoo! Slurp! bot.

That looked like it would stop the adsense bot which I would think would be a problem for an adsense website.

Gibisan

1:03 am on Dec 17, 2005 (gmt 0)

10+ Year Member



Mediapartners-Google was the first bot that the script caught during my testing. I added it to the good bot test which preceeds the badly behaved checks and now there is no problem.

$bot ='N';
// Check if it is a good bot
$agents = array('Googlebot', 'Yahoo', 'msnbot', 'Jeeves', 'Mediapartners');
$ref = $_SERVER['HTTP_USER_AGENT'];

foreach ( $agents as $agent )
{
if ( strpos ( $ref, $agent )!== false )
{
$bot = 'Y';
}
}

if ($bot == 'N')
{
Do The Badly Behaved Script
}

You can also use the ip address based exclusions outlined earlier in the thread if you prefer.

AlexK

9:16 am on Dec 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



the scrapers that it caught were the Adsense bot

I'd forgotten all about that!

It has not been a problem on my site after the first week or so. Nothing has changed on the site, even though hits for Nov were 155% of Oct's figures. Specifically, Adsense revenue for Oct not only held, but actually rose. I do not think that the block-script was connected to this!

In general, my attitude is NOT to use whitelist-exclusions, although I thoroughly understand that others may be more nervous of the consequences (it may be a good idea to include the IP-address whitelist code as comments to the routine; I'll do that, and upload it later). I take the view that bots (just like people) should be judged by their behaviour, rather than their parents. My background is northern England working class, so you will perhaps understand my suspicion of aristocratic tendencies.

Gibisan:

good bot test

Nice and simple idea, but has the fatal flaw of opening up a door for anyone that can fake the User-Agent (dead easy).

Request for assistance:

Does anyone know of a simple, swift preg-routine to test for IP-blocks of the style:

123.123.123.0/8
?

AlexK

11:35 am on Dec 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A new version of the routine uploaded just now.

The main routine algorithm remains unchanged; the start-over period has been made into a variable ($bStartOver, to restart tracking, default 1 day) + comments--drawn from this thread, and many thanks to all--added to the bottom.

Xuefer

7:49 am on Dec 18, 2005 (gmt 0)

10+ Year Member



good thread, but how about the performance?
storing a lot of files in hd, and look it up.
enable ext3 with dir_index?
but how to tune settings on tmpfs for lots of files?

Xuefer

9:38 am on Dec 18, 2005 (gmt 0)

10+ Year Member



mtime/atime is nice but can only be used for simple algo.

what if a client/bot visit you at morning 1 time, and come back at afternoon hitting you 1000 times within 1-10 secs.
is that out of the spirit of this script?

okay, it's for "badly behaved bot", not for "attacker". but there's some "good" user surfing on your site regularly, duration>1000 visits=100, and suddenly want to download all links from your a page, or even start to use webzip/webdownload. it will crawl so fast. until hit the 1000th page

incrediBILL

9:49 am on Dec 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



HINT: opposed to just checking for banned bot behaviour at the top of the page, if you check for banned bots at the start AND stop of each web page you can abort many simulateous requests as the 10th page request in 2 seconds might be detected by the 3td page still trying to generate.

This technique works well on my site which is dynamic and it takes .5 to 1 seconds to generate many pages as the 5th ot 6th page can sometimes tell the 1st and 2nd to abort.

AlexK

4:53 pm on Dec 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



incrediBILL:
check for banned bots at the start AND stop of each web page

Unless atime/mtime changes are not written out until the end of the script operation (not the case, AFAIK) your comments do not apply.

There are 9 lines of code for a non-block from read to write:

$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
$fileATime++;
$visits= $fileATime - $fileMTime;
$duration= $time - $fileMTime;// secs
if( restart tracking?
if( slow scraper?
} elseif( fast scraper?
touch( $ipFile, $fileMTime, $fileATime );

At this point

$ipFile
is changed globally, and any other thread can react to it. I've been using--and checking--variants of this script since 2003, with upto 15 page requests/sec attempted, and have not yet detected what you suggest.

Oh! I've just (I think) understood what you are saying:- prevent page delivery at the bottom of the page.

Ah yes, good idea! The one issue here is that PHP caches file info, so that cache would have to be discarded with

clearstatcache()
... yes! that needs adding. Fast bots only, I think. I'll add it asap.

AlexK

4:58 pm on Dec 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Xuefer:

See msg#24 + 26.

I do not have the answer to your questions, other than: on Linux k2.4 and k2.6 bisteringly fast.

Hanu

7:57 pm on Dec 18, 2005 (gmt 0)

10+ Year Member



Request for assistance:

Does anyone know of a simple, swift preg-routine to test for IP-blocks of the style: 123.123.123.0/8?

<?php

function IsIPInNet( $ip, $net ) {
if( preg_match( '/^([^\/]+)\/([^\/]+)$/', $net, $ms ) ) {
$mask = 0xFFFFFFFF << ( 32 - $ms[2] );
return ( ip2long( $ip ) & $mask ) == ( ip2long( $ms[1] ) & $mask );
}
return false;
}

echo IsIPInNet( '1.2.3.4', '1.2.3.0/24' )? "yes" : "no";
echo IsIPInNet( '2.2.3.4', '1.2.3.0/24' )? "yes" : "no";

?>

AlexK

4:04 am on Dec 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hanu:
Many thanks - and very nice, too. It has been added to the IP-whitelist code.

incrediBILL:
A snippet of optional code added to enact your suggestion. Many thanks! (you both have credits on the page)

New version uploaded just now. Small name-changes to some variables to help keep the namespace unique, but main code algorithm remains unchanged.

saltlakejohn

7:35 am on Dec 21, 2005 (gmt 0)



I downloaded Alex Kemp's masterful code to block bad bots (downloaded at <see url in msg 1>) It works super well and I made it an auto-prepend to every page view on my sites.

One word of caution reguarding prepended code:
Make sure it has NO BROWSER OUTPUT or it will interrupt SESSION creation in your following php programs.

In the bad-bot program be careful to erase or fill every blank line and make sure <?php is the very first line. Blank lines in php code leak through the php parser and go straight to the browser thus upsetting the rule that SESSION must be the first output of a php program (unless you want to hassle with ob_start and ob_flush, etc.). It's easier to just close up the blank lines, I think.

[edited by: jatar_k at 2:29 pm (utc) on Dec. 21, 2005]
[edit reason] removed url [/edit]

jatar_k

8:27 am on Dec 21, 2005 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Welcome to WebmasterWorld saltlakejohn,

>> Blank lines in php code leak through the php parser and go straight to the browser

unless I am misunderstanding you, this isn't true

blank lines inside <?php don't go to the browser at all

blank lines before and after <?php and?> do though

<sorry for OT AlexK>

let's stay on topic folks, great routine, great support

thanks again AlexK

tomda

8:45 am on Dec 21, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Let's stay on topic folks

Just a suggestion:
May be we could create a "Blocking Badly Behaved Bots #4" with all the code (including the last update).

AlexK

10:20 am on Dec 21, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



tomda:
The link in the first message is continuously updated with all code changes.

saltlakejohn:
The point about spaces before/after prepend files is a good one, and will be added to the comments with a credit - thanks.

[edited by: jatar_k at 2:37 pm (utc) on Dec. 21, 2005]

[edited by: coopster at 2:49 am (utc) on Aug. 6, 2008]

inbound

9:36 pm on Jan 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This routine seems to be ideal for what I need. I love PHP, especially the community spirity that is often displayed.

On to my point:

Does anyone have a very high traffic site using this? If so, how well does it handle monthly traffic of say 5 to 10 million page impressions?

The reason I ask is I'm about to launch a Yellow Pages type site which will not have many users to start with, but it will be spider food. I'd like to make sure it's not going to cripple the server.

inbound

3:02 am on Jan 2, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A little question on the number of unique tracking files.

I can see that an integer value of 2 or 3 is suggested to allow 256 or 4096 file combinations (based on the number of characters taken from the md5 representation of the IP address).

Surely a value of 2 will bar 0.4 percent of all traffic for any (and every) banned md5 substring.

The problem is reduced with a setting of 3, but it still seems to be an innacurate way to track the myriad of possible addresses.

I would suggest an extension of the length of substring used, maybe to 5. The first 2 of these could represent a set of 256 directories each of which having up to 4096 files.

I'm suggesting a directory structure as I'm not sure that all systems would cope well with a million files in a single directory (and 65K in one is a strain for some).

Alternatively it could be set at 2+2, giving 65536 possibilities.

I will probably be implementing this on my server, I just thought it may be of interest to you all.

Currently it's feasible, although unlikely, that a sustained rouge bot attack spread over a large set of IP's could knock out a considerable proportion of traffic for any site implementing the 256 possibility version of the script.

AlexK

5:06 am on Jan 2, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



inbound:
how well does it handle monthly traffic of say 5 to 10 million page impressions?

My site is only ~quarter million/month (is a breeze at that) so we will have to wait for someone else to report - perhaps yourself?

suggest an extension of the length of substring used, maybe to 5 ... directory structure

Excellent suggestion. This will be added to comments for later work. Unfortunately I cannot tackle it at this instant - pressing work on my own site to complete.

inbound

2:43 am on Jan 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



the code change is very simple:

$ipLength= 3;// integer; 2 = 255 files, 3 = 4,096 files
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $ipRemote ), -$ipLength );

Becomes:

$ipLength= 5;// integer; 4 = 65536 files spread accross 256 folders, 5 = 1 Million files spread accross 256 Folders
$ipLogFile= _B_DIRECTORY . _B_LOGFILE;
$ipFile= _B_DIRECTORY . substr( md5( $ipRemote ), -$ipLength, 2 ) . "/" . substr( md5( $ipRemote ), -($ipLength-2) );

Also, the small matter of setting up the 256 folders, I used a 'quick and dirty' script (shown below - and doctored as I didn't even bother with the constant) I'll leave it for someone else to do a better job of it.

<?php
// This is very poor, it does not check that directories are created
// or even loop to be more elegant. But I knew it would work for my
// immediate needs so I was lazy.
//
// In keeping with the previous script
define( '_B_DIRECTORY', '/full/path/on/server/to/block/dir/' );
mkdir( _B_DIRECTORY . "00", 0775);
mkdir( _B_DIRECTORY . "10", 0775);
mkdir( _B_DIRECTORY . "20", 0775);
mkdir( _B_DIRECTORY . "30", 0775);
mkdir( _B_DIRECTORY . "40", 0775);
mkdir( _B_DIRECTORY . "50", 0775);
mkdir( _B_DIRECTORY . "60", 0775);
mkdir( _B_DIRECTORY . "70", 0775);
mkdir( _B_DIRECTORY . "80", 0775);
mkdir( _B_DIRECTORY . "90", 0775);
mkdir( _B_DIRECTORY . "a0", 0775);
mkdir( _B_DIRECTORY . "b0", 0775);
mkdir( _B_DIRECTORY . "c0", 0775);
mkdir( _B_DIRECTORY . "d0", 0775);
mkdir( _B_DIRECTORY . "e0", 0775);
mkdir( _B_DIRECTORY . "f0", 0775);
mkdir( _B_DIRECTORY . "01", 0775);
mkdir( _B_DIRECTORY . "11", 0775);
mkdir( _B_DIRECTORY . "21", 0775);
mkdir( _B_DIRECTORY . "31", 0775);
mkdir( _B_DIRECTORY . "41", 0775);
mkdir( _B_DIRECTORY . "51", 0775);
mkdir( _B_DIRECTORY . "61", 0775);
mkdir( _B_DIRECTORY . "71", 0775);
mkdir( _B_DIRECTORY . "81", 0775);
mkdir( _B_DIRECTORY . "91", 0775);
mkdir( _B_DIRECTORY . "a1", 0775);
mkdir( _B_DIRECTORY . "b1", 0775);
mkdir( _B_DIRECTORY . "c1", 0775);
mkdir( _B_DIRECTORY . "d1", 0775);
mkdir( _B_DIRECTORY . "e1", 0775);
mkdir( _B_DIRECTORY . "f1", 0775);
mkdir( _B_DIRECTORY . "02", 0775);
mkdir( _B_DIRECTORY . "12", 0775);
mkdir( _B_DIRECTORY . "22", 0775);
mkdir( _B_DIRECTORY . "32", 0775);
mkdir( _B_DIRECTORY . "42", 0775);
mkdir( _B_DIRECTORY . "52", 0775);
mkdir( _B_DIRECTORY . "62", 0775);
mkdir( _B_DIRECTORY . "72", 0775);
mkdir( _B_DIRECTORY . "82", 0775);
mkdir( _B_DIRECTORY . "92", 0775);
mkdir( _B_DIRECTORY . "a2", 0775);
mkdir( _B_DIRECTORY . "b2", 0775);
mkdir( _B_DIRECTORY . "c2", 0775);
mkdir( _B_DIRECTORY . "d2", 0775);
mkdir( _B_DIRECTORY . "e2", 0775);
mkdir( _B_DIRECTORY . "f2", 0775);
mkdir( _B_DIRECTORY . "03", 0775);
mkdir( _B_DIRECTORY . "13", 0775);
mkdir( _B_DIRECTORY . "23", 0775);
mkdir( _B_DIRECTORY . "33", 0775);
mkdir( _B_DIRECTORY . "43", 0775);
mkdir( _B_DIRECTORY . "53", 0775);
mkdir( _B_DIRECTORY . "63", 0775);
mkdir( _B_DIRECTORY . "73", 0775);
mkdir( _B_DIRECTORY . "83", 0775);
mkdir( _B_DIRECTORY . "93", 0775);
mkdir( _B_DIRECTORY . "a3", 0775);
mkdir( _B_DIRECTORY . "b3", 0775);
mkdir( _B_DIRECTORY . "c3", 0775);
mkdir( _B_DIRECTORY . "d3", 0775);
mkdir( _B_DIRECTORY . "e3", 0775);
mkdir( _B_DIRECTORY . "f3", 0775);
mkdir( _B_DIRECTORY . "04", 0775);
mkdir( _B_DIRECTORY . "14", 0775);
mkdir( _B_DIRECTORY . "24", 0775);
mkdir( _B_DIRECTORY . "34", 0775);
mkdir( _B_DIRECTORY . "44", 0775);
mkdir( _B_DIRECTORY . "54", 0775);
mkdir( _B_DIRECTORY . "64", 0775);
mkdir( _B_DIRECTORY . "74", 0775);
mkdir( _B_DIRECTORY . "84", 0775);
mkdir( _B_DIRECTORY . "94", 0775);
mkdir( _B_DIRECTORY . "a4", 0775);
mkdir( _B_DIRECTORY . "b4", 0775);
mkdir( _B_DIRECTORY . "c4", 0775);
mkdir( _B_DIRECTORY . "d4", 0775);
mkdir( _B_DIRECTORY . "e4", 0775);
mkdir( _B_DIRECTORY . "f4", 0775);
mkdir( _B_DIRECTORY . "05", 0775);
mkdir( _B_DIRECTORY . "15", 0775);
mkdir( _B_DIRECTORY . "25", 0775);
mkdir( _B_DIRECTORY . "35", 0775);
mkdir( _B_DIRECTORY . "45", 0775);
mkdir( _B_DIRECTORY . "55", 0775);
mkdir( _B_DIRECTORY . "65", 0775);
mkdir( _B_DIRECTORY . "75", 0775);
mkdir( _B_DIRECTORY . "85", 0775);
mkdir( _B_DIRECTORY . "95", 0775);
mkdir( _B_DIRECTORY . "a5", 0775);
mkdir( _B_DIRECTORY . "b5", 0775);
mkdir( _B_DIRECTORY . "c5", 0775);
mkdir( _B_DIRECTORY . "d5", 0775);
mkdir( _B_DIRECTORY . "e5", 0775);
mkdir( _B_DIRECTORY . "f5", 0775);
mkdir( _B_DIRECTORY . "06", 0775);
mkdir( _B_DIRECTORY . "16", 0775);
mkdir( _B_DIRECTORY . "26", 0775);
mkdir( _B_DIRECTORY . "36", 0775);
mkdir( _B_DIRECTORY . "46", 0775);
mkdir( _B_DIRECTORY . "56", 0775);
mkdir( _B_DIRECTORY . "66", 0775);
mkdir( _B_DIRECTORY . "76", 0775);
mkdir( _B_DIRECTORY . "86", 0775);
mkdir( _B_DIRECTORY . "96", 0775);
mkdir( _B_DIRECTORY . "a6", 0775);
mkdir( _B_DIRECTORY . "b6", 0775);
mkdir( _B_DIRECTORY . "c6", 0775);
mkdir( _B_DIRECTORY . "d6", 0775);
mkdir( _B_DIRECTORY . "e6", 0775);
mkdir( _B_DIRECTORY . "f6", 0775);
mkdir( _B_DIRECTORY . "07", 0775);
mkdir( _B_DIRECTORY . "17", 0775);
mkdir( _B_DIRECTORY . "27", 0775);
mkdir( _B_DIRECTORY . "37", 0775);
mkdir( _B_DIRECTORY . "47", 0775);
mkdir( _B_DIRECTORY . "57", 0775);
mkdir( _B_DIRECTORY . "67", 0775);
mkdir( _B_DIRECTORY . "77", 0775);
mkdir( _B_DIRECTORY . "87", 0775);
mkdir( _B_DIRECTORY . "97", 0775);
mkdir( _B_DIRECTORY . "a7", 0775);
mkdir( _B_DIRECTORY . "b7", 0775);
mkdir( _B_DIRECTORY . "c7", 0775);
mkdir( _B_DIRECTORY . "d7", 0775);
mkdir( _B_DIRECTORY . "e7", 0775);
mkdir( _B_DIRECTORY . "f7", 0775);
mkdir( _B_DIRECTORY . "08", 0775);
mkdir( _B_DIRECTORY . "18", 0775);
mkdir( _B_DIRECTORY . "28", 0775);
mkdir( _B_DIRECTORY . "38", 0775);
mkdir( _B_DIRECTORY . "48", 0775);
mkdir( _B_DIRECTORY . "58", 0775);
mkdir( _B_DIRECTORY . "68", 0775);
mkdir( _B_DIRECTORY . "78", 0775);
mkdir( _B_DIRECTORY . "88", 0775);
mkdir( _B_DIRECTORY . "98", 0775);
mkdir( _B_DIRECTORY . "a8", 0775);
mkdir( _B_DIRECTORY . "b8", 0775);
mkdir( _B_DIRECTORY . "c8", 0775);
mkdir( _B_DIRECTORY . "d8", 0775);
mkdir( _B_DIRECTORY . "e8", 0775);
mkdir( _B_DIRECTORY . "f8", 0775);
mkdir( _B_DIRECTORY . "09", 0775);
mkdir( _B_DIRECTORY . "19", 0775);
mkdir( _B_DIRECTORY . "29", 0775);
mkdir( _B_DIRECTORY . "39", 0775);
mkdir( _B_DIRECTORY . "49", 0775);
mkdir( _B_DIRECTORY . "59", 0775);
mkdir( _B_DIRECTORY . "69", 0775);
mkdir( _B_DIRECTORY . "79", 0775);
mkdir( _B_DIRECTORY . "89", 0775);
mkdir( _B_DIRECTORY . "99", 0775);
mkdir( _B_DIRECTORY . "a9", 0775);
mkdir( _B_DIRECTORY . "b9", 0775);
mkdir( _B_DIRECTORY . "c9", 0775);
mkdir( _B_DIRECTORY . "d9", 0775);
mkdir( _B_DIRECTORY . "e9", 0775);
mkdir( _B_DIRECTORY . "f9", 0775);
mkdir( _B_DIRECTORY . "0a", 0775);
mkdir( _B_DIRECTORY . "1a", 0775);
mkdir( _B_DIRECTORY . "2a", 0775);
mkdir( _B_DIRECTORY . "3a", 0775);
mkdir( _B_DIRECTORY . "4a", 0775);
mkdir( _B_DIRECTORY . "5a", 0775);
mkdir( _B_DIRECTORY . "6a", 0775);
mkdir( _B_DIRECTORY . "7a", 0775);
mkdir( _B_DIRECTORY . "8a", 0775);
mkdir( _B_DIRECTORY . "9a", 0775);
mkdir( _B_DIRECTORY . "aa", 0775);
mkdir( _B_DIRECTORY . "ba", 0775);
mkdir( _B_DIRECTORY . "ca", 0775);
mkdir( _B_DIRECTORY . "da", 0775);
mkdir( _B_DIRECTORY . "ea", 0775);
mkdir( _B_DIRECTORY . "fa", 0775);
mkdir( _B_DIRECTORY . "0b", 0775);
mkdir( _B_DIRECTORY . "1b", 0775);
mkdir( _B_DIRECTORY . "2b", 0775);
mkdir( _B_DIRECTORY . "3b", 0775);
mkdir( _B_DIRECTORY . "4b", 0775);
mkdir( _B_DIRECTORY . "5b", 0775);
mkdir( _B_DIRECTORY . "6b", 0775);
mkdir( _B_DIRECTORY . "7b", 0775);
mkdir( _B_DIRECTORY . "8b", 0775);
mkdir( _B_DIRECTORY . "9b", 0775);
mkdir( _B_DIRECTORY . "ab", 0775);
mkdir( _B_DIRECTORY . "bb", 0775);
mkdir( _B_DIRECTORY . "cb", 0775);
mkdir( _B_DIRECTORY . "db", 0775);
mkdir( _B_DIRECTORY . "eb", 0775);
mkdir( _B_DIRECTORY . "fb", 0775);
mkdir( _B_DIRECTORY . "0c", 0775);
mkdir( _B_DIRECTORY . "1c", 0775);
mkdir( _B_DIRECTORY . "2c", 0775);
mkdir( _B_DIRECTORY . "3c", 0775);
mkdir( _B_DIRECTORY . "4c", 0775);
mkdir( _B_DIRECTORY . "5c", 0775);
mkdir( _B_DIRECTORY . "6c", 0775);
mkdir( _B_DIRECTORY . "7c", 0775);
mkdir( _B_DIRECTORY . "8c", 0775);
mkdir( _B_DIRECTORY . "9c", 0775);
mkdir( _B_DIRECTORY . "ac", 0775);
mkdir( _B_DIRECTORY . "bc", 0775);
mkdir( _B_DIRECTORY . "cc", 0775);
mkdir( _B_DIRECTORY . "dc", 0775);
mkdir( _B_DIRECTORY . "ec", 0775);
mkdir( _B_DIRECTORY . "fc", 0775);
mkdir( _B_DIRECTORY . "0d", 0775);
mkdir( _B_DIRECTORY . "1d", 0775);
mkdir( _B_DIRECTORY . "2d", 0775);
mkdir( _B_DIRECTORY . "3d", 0775);
mkdir( _B_DIRECTORY . "4d", 0775);
mkdir( _B_DIRECTORY . "5d", 0775);
mkdir( _B_DIRECTORY . "6d", 0775);
mkdir( _B_DIRECTORY . "7d", 0775);
mkdir( _B_DIRECTORY . "8d", 0775);
mkdir( _B_DIRECTORY . "9d", 0775);
mkdir( _B_DIRECTORY . "ad", 0775);
mkdir( _B_DIRECTORY . "bd", 0775);
mkdir( _B_DIRECTORY . "cd", 0775);
mkdir( _B_DIRECTORY . "dd", 0775);
mkdir( _B_DIRECTORY . "ed", 0775);
mkdir( _B_DIRECTORY . "fd", 0775);
mkdir( _B_DIRECTORY . "0e", 0775);
mkdir( _B_DIRECTORY . "1e", 0775);
mkdir( _B_DIRECTORY . "2e", 0775);
mkdir( _B_DIRECTORY . "3e", 0775);
mkdir( _B_DIRECTORY . "4e", 0775);
mkdir( _B_DIRECTORY . "5e", 0775);
mkdir( _B_DIRECTORY . "6e", 0775);
mkdir( _B_DIRECTORY . "7e", 0775);
mkdir( _B_DIRECTORY . "8e", 0775);
mkdir( _B_DIRECTORY . "9e", 0775);
mkdir( _B_DIRECTORY . "ae", 0775);
mkdir( _B_DIRECTORY . "be", 0775);
mkdir( _B_DIRECTORY . "ce", 0775);
mkdir( _B_DIRECTORY . "de", 0775);
mkdir( _B_DIRECTORY . "ee", 0775);
mkdir( _B_DIRECTORY . "fe", 0775);
mkdir( _B_DIRECTORY . "0f", 0775);
mkdir( _B_DIRECTORY . "1f", 0775);
mkdir( _B_DIRECTORY . "2f", 0775);
mkdir( _B_DIRECTORY . "3f", 0775);
mkdir( _B_DIRECTORY . "4f", 0775);
mkdir( _B_DIRECTORY . "5f", 0775);
mkdir( _B_DIRECTORY . "6f", 0775);
mkdir( _B_DIRECTORY . "7f", 0775);
mkdir( _B_DIRECTORY . "8f", 0775);
mkdir( _B_DIRECTORY . "9f", 0775);
mkdir( _B_DIRECTORY . "af", 0775);
mkdir( _B_DIRECTORY . "bf", 0775);
mkdir( _B_DIRECTORY . "cf", 0775);
mkdir( _B_DIRECTORY . "df", 0775);
mkdir( _B_DIRECTORY . "ef", 0775);
mkdir( _B_DIRECTORY . "ff", 0775);
?>

AlexK

4:27 am on Jan 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've actually done it! (plus now uploaded).

[The phpBB2 folks have changed the auto_login_key algorithm, which has nicely screwed my interface-class. I am busy re-writing the sessions part of the wretched thing. Thank God it is not live yet (but should be). So, any excuse to get away from that. And I did not think that the code should take too long, and it did not.]

The code additions are pretty much the same as the previous message, but it checks for both $ipLength and whether a dir exists or not.

Q for inbound:

mkdir( _B_DIRECTORY . "00", 0775);

I've used "0700" rather than "0775" (paranoia) - that should not cause a problem, since the same owner-process that created it will be read/writing the contents. Agree?

inbound

5:06 am on Jan 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



0700 is probably better as not everyone will have the folders inaccessible to the web (although it should be).

I've been looking at IP restrictions for known problem countries, do you think that would be worthwhile?

I know that IP tables are a pain, I'm just talking about the bigger blocks that are allocated to bot-loving nations.

It makes sense for us in particular as we have 100% UK specific advertisers (being Local Search specialists) so traffic from China is of little interest to us.

AlexK

6:18 pm on Jan 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



inbound:
I've been looking at IP restrictions for known problem countries

Yup, that is how a little, useful routine becomes a behemoth!

Seriously, that's something that has also occurred to myself. If you look at the comments under the routine (begins line#243) you will see a simple, reasonably-quick algorithm for ID-ing any block of IP-numbers (thanks to Hanu) and line #254 has an even quicker means of ID-ing a single IP. By reversing the logic given in that routine (as written, it lets well-behaved bots through) and supplying a suitable error header() you can easily block any IP or IP-block that you wish.

Here is a more sophisticated method of obtaining the IP, which may be useful (note that this board converts all pipe (¦) entries into a broken-vertical line (¦), and they need converting back):

function _getUserIP() {// Obtain and encode users IP
// from phpBB2 (common.php + functions.php)
global $HTTP_SERVER_VARS, $HTTP_ENV_VARS, $REMOTE_ADDR;
if( getenv( 'HTTP_X_FORWARDED_FOR' )!= '' ) {
$client_ip = (!empty( $HTTP_SERVER_VARS[ 'REMOTE_ADDR' ]))
? $HTTP_SERVER_VARS[ 'REMOTE_ADDR' ]
: ((!empty( $HTTP_ENV_VARS[ 'REMOTE_ADDR' ]))
? $HTTP_ENV_VARS[ 'REMOTE_ADDR' ]
: $REMOTE_ADDR
);
$entries = explode( ',', getenv( 'HTTP_X_FORWARDED_FOR' ));
reset( $entries );
while( list(, $entry ) = each( $entries )) {
$entry = trim( $entry );
if( preg_match( "/^([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)/", $entry, $ip_list )) {
$private_ip = array( '/^0\./', '/^127\.0\.0\.1/',
'/^192\.168\..*/',
'/^172\.((1[6-9])¦(2[0-9])¦(3[0-1]))\..*/',
'/^10\..*/', '/^224\..*/',
'/^240\..*/'
);
$found_ip = preg_replace( $private_ip, $client_ip, $ip_list[ 1 ]);
if ( $client_ip!= $found_ip ) { $client_ip = $found_ip; break; }
}
}
} else {
$client_ip = (!empty( $HTTP_SERVER_VARS[ 'REMOTE_ADDR' ]))
? $HTTP_SERVER_VARS[ 'REMOTE_ADDR' ]
: ((!empty( $HTTP_ENV_VARS[ 'REMOTE_ADDR' ]))
? $HTTP_ENV_VARS[ 'REMOTE_ADDR' ]
: $REMOTE_ADDR
);
}// if( getenv( 'HTTP_X_FORWARDED_FOR')!= '' ) else
$ip_sep = explode( '.', $client_ip );
return sprintf( '%02x%02x%02x%02x', $ip_sep[ 0 ], $ip_sep[ 1 ], $ip_sep[ 2 ], $ip_sep[ 3 ]);
}// function BB2_Interface::_getUserIP()

It has occurred to me that a better means of doing that would be to integrate it into IPTables for the Firewall but, once again, any work on all of that is for the future. Much more pressing concerns for myself at this moment.

inbound

6:38 pm on Jan 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I agree that it's not the place for it. I had second thoughts after posting it.

Given that there are thousands (or tens of thousands) of ever-changing IP ranges to handle I think it was a daft suggestion, just one born out of 'it could do that too'.

incrediBILL

12:02 am on Jan 5, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Not sure it was completely daft as I blocked most of one country that seems to have no respect for copyright and appears to encourage sites to be downloaded and filtered for local consumption.

blend27

6:53 pm on Jan 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am going to try to write/convert this in ColdFusionMX, see what happens, very interesting. I've been running a self made script based on IP tables from MaxMind(GeoLite Country) and User Agent Strings, but not the "feakvancy" on visits, should be fun.

Blend27

AlexK

9:42 am on Jan 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



blend27:
I've been running a self made script based on IP tables from MaxMind(GeoLite Country)

Any code that you could send will be welcome. It is clearly better if that code makes use of facilities available on every computer (hence my consideration of iptables, even though that would be linux-only), but all help greatly welcomed.

If a fast- or a slow-scraper is to be blocked for a period of time it is much more efficient for that to be done at the firewall, probably with an error-message at the first block. It would require complete interface code for the firewall, I think, and the time that that would take causes me not to approach it at this moment. There is nothing to stop me acquiring some sample code for the future, however, and I would greatly appreciate any assistance in that.

Over to you.

AlexK

11:57 am on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How to add this code to HTML sites:

My current site is a .com; I have also got a .co.uk site which is both rather old (it is the original site) and full of .html pages. Checking through the Apache logs this morning I discovered the following:

217.160.75.202 - - [15/Jan/2006:12:03:40 +0000] "GET /robots.txt HTTP/1.1" 200 23 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:40 +0000] "GET / HTTP/1.1" 200 1417 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:40 +0000] "GET /toolbar.html HTTP/1.1" 200 2102 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:40 +0000] "GET /home.html HTTP/1.1" 200 11657 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:41 +0000] "GET / HTTP/1.1" 200 1417 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:41 +0000] "GET /mfc/index.html HTTP/1.1" 301 240 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:41 +0000] "GET /chips/index.html HTTP/1.1" 301 242 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:41 +0000] "GET /help/index.html HTTP/1.1" 301 241 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:41 +0000] "GET /modblame.html HTTP/1.1" 200 52104 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:03:43 +0000] "GET /help/56k.html HTTP/1.1" 200 71030 "-" "DTAAgent" In:- Out:-:-pct.
...
217.160.75.202 - - [15/Jan/2006:12:09:31 +0000] "GET /stats.php?page=chips/amb562.html&domain=1 HTTP/1.1" 200 2784 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:09:32 +0000] "GET /? HTTP/1.1" 200 1417 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:09:32 +0000] "GET /stats.php?page=help/form.html&silent=plain&domain=1 HTTP/1.1" 200 6 "-" "DTAAgent" In:- Out:-:-pct.
217.160.75.202 - - [15/Jan/2006:12:09:32 +0000] "GET /stats.php?page=help/form1.html&silent=plain&domain=1 HTTP/1.1" 200 6 "-" "DTAAgent" In:- Out:-:-pct.

(511 pages, at upto 8 pages/sec)

Damn them.

Not a lot of pages, but that will at a minimum slow the server down, plus I have to pay for that scrapers' bandwidth (the .com site has 100,000+ pages on it, which is why my main focus on blocking scrapers is with that site).

Damn them!

The bot code was put into it's own file (

block_bad_bots.php
) and the following added to the bottom of the VirtualHost directive for the .co.uk site in httpd.conf (Apache servers only):

# php directives
# 2006-01-19 added to block bad-bots -AK
#
# <IfModule mod_php4.c>
AddType application/x-httpd-php .html
php_value auto_prepend_file "/server/path/to/file/block_bad_bots.php"
# </IfModule>
# End of php directives

Then, a simple

# apachectl graceful
put it up and running.

Notes:
1 I know that php is running on my system, so the IfModules are commented out, and are there for reference only.
2 It is incredibly difficult to trip the block with a normal browser. I had to add an

echo "Yes, I am here, dumbo";
to the code to prove to myself that it was working.
3 Making the HTML files be parsed by PHP removes the normal Apache Content-Negotiation (supplying a 304, etc). My next post will show how to fix that.

Umbra

1:18 pm on Jan 19, 2006 (gmt 0)

10+ Year Member



I don't know anything about PHP, and I tried the script, but it didn't seem to work. It throws an error when I run it directly from the browser:

Warning: touch(): Unable to create file _B_DIRECTORYccd because Permission denied in [script path here] on line 114

I know this came up before here [webmasterworld.com] but the suggestions there don't seem to fix the problem.
- The defined directory definitely exists
- The directory has been chmod'ed 777
- Removing the $fileATime from line 114 doesn't remove the error

Any ideas? The only alternative was to use the old script, but it seems to be falling way behind since 2003.

AlexK

2:17 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Umbra:
Warning: touch(): Unable to create file _B_DIRECTORYccd

(best guess) You have not defined _B_DIRECTORY, so it is trying to create a file "_B_DIRECTORYccd" rather than a file "ccd" within the directory defined by "_B_DIRECTORY".

Change the error-reporting to show Notices, and you will get an Error-Notice about that if so.

The other obvious thing is to

echo()
out the value of
$ipFile
once it has been set, and test on the server whether you can
touch
it.
This 88 message thread spans 3 pages: 88