Blocking Badly Behaved Bots #3

Forum Moderators: coopster

Message Too Old, No Replies

Blocking Badly Behaved Bots #3

Small correction to a previously-posted routine

AlexK

1:22 pm on Oct 9, 2005 (gmt 0)

Well, this is a correction to an update [webmasterworld.com] (which itself had several corrections [webmasterworld.com]) to a posting [webmasterworld.com]. Sigh and double-sigh.

The original revised-routine works fine--in fact, so far since June, there have been 677 attempts by 20 different people to rip off my site; all blocked by the routine. However, I have noticed a mistake in the long-term scraper code:

should be:

if( file_exists( $ipFile )) {
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
$fileATime++;
$visits= $fileATime - $fileMTime;
$duration= $time - $fileMTime;// secs
// foll test also keeps tracking going to catch slow scrapers
if( $duration > 86400 ) {// 24 hours; start over
$fileMTime= $fileATime = $time;
$duration= $visits = 1;
} else if( $duration < 1 ) $duration = 1;
// test for slow scrapers
if( $visits >= $bTotVisit ) {

Rather than re-posting the whole (revised) routine all over again, for just a one-line change, the entire bot-block routine can be downloaded at this link [download.modem-help.co.uk]. I warmly endorse it to you.

[edited by: coopster at 1:57 pm (utc) on Aug. 7, 2008]

jatar_k

3:37 pm on Oct 11, 2005 (gmt 0)

thanks AlexK

AlexK

6:42 pm on Oct 14, 2005 (gmt 0)

A little feedback, having caught a couple of long-term scrapers.

The modified routine works extremely well. In fact, so well that you may need to carefully consider whether to use it. You see, the scrapers that it caught were the Adsense bot and (just a handful of times) the Yahoo! Slurp! bot.

How the corrected routine works:

Requests from a particular IP are counted, and reset every 24 hours. If the count exceeds $bTotVisit then the request is blocked with a 503 Service Temporarily Unavailable and will stay that way until the next reset.

Fast scrapers (the more-normal culprits) are blocked by checking whether their hit-rate exceeds $bMaxVisit / $bInterval. If so, the count period is reset to keep them out for at least $bPenalty secs.

This is what the block-logfile looks like for slow scrapers (notice that the Adsense bot makes a request every 10 secs). In normal use, this wretched bot is prone to making upto 6 requests for the same page in short succession.

Blocked IPs:
.
* 66.249.65.19 [ crawl-66-249-65-19.googlebot.com ] 1000 line(s)
.
1000 total lines in log-file.
Log Line Lines
66.249.65.19 13/10/2005 21:38:25 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:38:15 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:38:05 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:37:55 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:37:44 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:37:34 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:37:24 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:37:13 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:34:10 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:34:00 Mediapartners-Google/2.1 (slow scraper) 1

The first line above was the last block of that particular period (the count was given a 24-hour reset). This behaviour made no difference to my Adsense revenue for the blocked-period. In fact, the click-through rate was exceptionally good.

The Adsense-bot and the Yahoo Slurp!-bot are hitting my site like absolute maniacs. For the first 13 days of October:

humans: 64,877 pages
robots: 64,901 pages
Inktomi Slurp: 22,907 pages + 1,116 hits on robots.txt + a handful of 503s.
Google AdSense: 8,019 pages + 28 hits on robots.txt + 6,690 "503 Server-busy"

That's a lot.

It gets worse if we consider bandwidth.

The Yahoo Slurp! bot uses the If-Modified-Since request header [webmasterworld.com], and thus gets 304 Not Modified responses for many of it's requests. It also can handle compressed pages which reduces the bandwidth by up to 80%.

The Adsense bot does not use the If-Modified-Since request header (why not?) but, thankfully, does handle compressed pages.

The MSNBot, just for completeness, makes use of neither and, consequently, takes more bandwidth than all the other bots put together.

I am prepared to pay the price of this blocking, since I am sick and tired of being hit so hard by the search-bots, yet get so little in return. You may not be.

tito

5:17 am on Oct 16, 2005 (gmt 0)

Hello,

I'm running this beatiful script by adding the following to my .htaccess:

<IfModule mod_php4.c>
php_value auto_prepend_file "/path/to/file/block_bad.php"
</IfModule>

as i'm going to update PHP to V5.0.4 on the next days (loaded module: mod_php5), is it enough to change on my .htaccess: <IfModule mod_php4.c> to <IfModule mod_php5.c>?!?

also please: is this ultimate version of the script compatible with PHP V5.0.4?
i'm actually using the previous one [webmasterworld.com...]

Thanks in advance,
tito

AlexK

2:32 pm on Oct 16, 2005 (gmt 0)

tito:

also please: is this ultimate version of the script compatible with PHP V5.0.4?

I run php4, tito, so cannot give you the best answer possible (experience). However, the commands used are so standard that there should be no problem. Try it, and report back.

The main difference between this version and the previous routine is that:

it does not need the search-bots exclusion code.
it incorporates a logging routine

The previous routine had a logic-error which meant that every long-term hitter on a site (such as a search-bot) would get excluded.

For your reference, here is the well-behaved bots exclusion code that I previously employed:

$remote = $_SERVER[ 'REMOTE_ADDR' ];
if(( substr( $remote, 0, 10 ) == '66.249.64.' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
( substr( $remote, 0, 10 ) == '66.249.65.' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
( substr( $remote, 0, 10 ) == '66.249.66.' ) or// Google has blocks 72.14.192.0 - 72.14.207.255
( substr( $remote, 0, 9 ) == '216.239.3' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
( substr( $remote, 0, 9 ) == '216.239.4' ) or
( substr( $remote, 0, 9 ) == '216.239.5' ) or
( substr( $remote, 0, 10 ) == '65.54.188.' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
( substr( $remote, 0, 10 ) == '207.46.98.' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
( substr( $remote, 0, 13 ) == '66.194.55.242' )// Ocelli
) {
// let well-behaved bots through
} else {
// block routine
}

[edited by: coopster at 2:50 am (utc) on Aug. 6, 2008]

tito

4:07 pm on Oct 16, 2005 (gmt 0)

Thanks AlexK,

i'll test it on php5 as soon as i'll upgrade it and i will report it back.
Thanks for your explanations.

AlexK

6:01 am on Nov 2, 2005 (gmt 0)

The $bTotVisit (total visits allowed within a 24-hr period) default has been upped to 1,000, and the test-routine modified so that setting

$bTotVisit = 0

switches the slow-scraper code off.

Google Adsense + Slurp! seem to have re-programmed themselves, and the former problems no longer occur. MSNBot has stepped into this breach and gets blocked each 24 hours (for the last 7 days or more). The latter bot takes way too much bandwidth so, once again, I am happy with that.

I would be interested to hear any suggestions on making the block-value dynamic, rather than fixed.

grandpa

1:13 am on Nov 3, 2005 (gmt 0)

I have a couple of questions about this routine. I've added it by adding this code to my htaccess file:
<IfModule mod_php4.c>
php_value auto_prepend_file "/path/to/file/block_bad.php"
</IfModule>

No problem there. Should that document include the php start and end tags? I assumed the answer would be yes.

My log files are all zero byte files. Should I assume that no activity will be logged until the triggers are reached? And what cleans up those files after 24 hours? It looks like I need to write a cron job if I want to keep a clean directory.

I didn't look over the code closely enough to answer my own questions, opting instead to trust the code I found here and make the necessary changes for my setup.

AlexK

3:20 am on Nov 3, 2005 (gmt 0)

grandpa:

Should that document include the php start and end tags?

Yes.

grandpa:

Should I assume that no activity will be logged until the triggers are reached?

Correct.

And what cleans up those files after 24 hours? It looks like I need to write a cron job if I want to keep a clean directory.

They are never cleaned up, and there is no need to do so (though you can if you want). They are zero-byte files, after all, so there is no disk-space implication.

The A-time (access time) and M-time (modification time) are used to track activity with each file. It is these that is reset, either if the 24-hr limit has passed (both) or if there is continual activity (just one). This is the reason that the routine is so quick and occupies so little resources.

It may be a good idea to set up a little batch job to give yourself a simple way to delete the log-blocks file, should you ever want to (the routine auto-creates the log-blocks file if not existing, and rolls-over if the line-limit is reached, so even that is not necessary). A way to read it is always useful, though.

jdMorgan

3:47 am on Nov 3, 2005 (gmt 0)

Slightly OT to the thread, but on the general topic...

If otherwise-welcome 'bots are overrunning your site with requests and tripping the script, look into using the Crawl-delay directive in robots.txt. Yahoo, MSN, and Ask Jeeves/Teoma all support this directive, possibly others as well. Check the specific search engines' robot information page for up-to-date info.

Example robots.txt entry:


User-agent: msnbot
Crawl-delay: 90
Disallow: /cgi-bin

This limits msnbot to one request every ninety seconds (a bit less than 1000 requests per day).

This may help with some of the problems described above.

Jim

AlexK

6:18 am on Nov 3, 2005 (gmt 0)

Thanks, Jim. Very, very sensible, and I have just implemented it. Let's hope that it will be respected.

I notice that both MSN & Yahoo specify it as "Crawl-delay" whilst ASK-Jeeves (Teoma) has it as "Crawl-Delay" - let's hope that this will not cause issues although, to be fair, Teoma has never been at issue on my site, only the first two.

ergophobe

6:05 pm on Nov 3, 2005 (gmt 0)

Nice tip.

Continuing to be OT (sorry), I'd like to see a robots protocol 2.0 come out that would make some of the new directives part of a standard. Things like crawl-delay, noarchive and URL wildcards. Based on this thread, for example, Google does not recognize crawl-delay (but implements wildcards and noarchive). MSN has crawl-delay and noarchve, but not wildcards.

AlexK

2:34 am on Nov 4, 2005 (gmt 0)

Oh no! I have:

cat robots.txt
User-agent: *
Crawl-delay: 90
Disallow:

Are you telling me that MSN will not recognise that? And, if so, how on earth to specify it so that all the bots will? Is there a comma-operator?

ergophobe

4:59 pm on Nov 4, 2005 (gmt 0)

URL wildcards.

You're fine and your * for robots *is* part of the protocol and, in fact, Google uses it in their own robots.txt [google.com]. Google, however, has an extension that allows

Disallow: *.gif

That extension is not recognized, as far as I can see, by MSN. According to [google.com...]

Additionally, Google has introduced increased flexibility to the robots.txt file standard through the use asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name.
...
To remove all files of a specific file type (for example, .gif), you'd use the following robots.txt entry:
User-agent: Googlebot
Disallow: /*.gif$
To remove dynamically generated pages, you'd use this robots.txt entry:
User-agent: Googlebot
Disallow: /*?

It's that extension that I don't believe is observed by MSN or others.

AlexK

6:36 pm on Nov 4, 2005 (gmt 0)

Thanks, ergophobe (phew! wipes brow with relief).

AlexK

5:40 am on Nov 6, 2005 (gmt 0)

With all the angst that I have been expressing regarding some of the major search-bots being blocked by this routine, it is good to be able to express the positive side:

Dayo_UK, in the middle of one of the vast Google Jagger-update threads, reported how the G-Mozilla bot had brought his DB crashing to the ground with 20 page-requests/second across an extended period. This routine could have saved his site from going down.

Yesterday, I got a brand-new Japanese bot taking a page every 11 seconds for 18 hours. Here are the first few accesses:

133.9.238.95 - - [05/Nov/2005:10:47:03 +0000] "GET /robots.txt HTTP/1.1" 200 39 "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:04 +0000] "HEAD / HTTP/1.1" 200 - "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:04 +0000] "HEAD /index.html HTTP/1.1" 301 - "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:05 +0000] "HEAD /index.htm HTTP/1.1" 301 - "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:06 +0000] "HEAD /Default.htm HTTP/1.1" 404 - "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:16 +0000] "GET / HTTP/1.1" 200 38623 "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:28 +0000] "GET /favicon.ico HTTP/1.1" 200 1406 "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:38 +0000] "GET /style.css HTTP/1.1" 200 10325 "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:49 +0000] "GET /mfcs.php HTTP/1.1" 200 40931 "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.

After the first 1,000 pages the routine blocked it for a further 2,945 attempted-pages until it finally gave up at 03:53 this morning.

Hetta

7:20 am on Nov 14, 2005 (gmt 0)

Thanks for that routine, AlexK.

Found you while searching for data on the e-SocietyRobot ... it's crawled my site today. It did follow robots.txt, but it's still not polite to crawl ALL of a site, uninterrupted, at a rate of a page every 11 seconds.

I put the crawl-delay into my robots.txt file and will be adding your botblocking routine within the next few minutes.

Many thanks!

dbar

5:55 pm on Nov 25, 2005 (gmt 0)

AlexK and others who have been helping, thanks. I've been using this a few weeks now and blocking some bad stuff. Then also noticed like others some good bots were getting blocked which is why I'm back and of course it's already being addressed.

AlexK in your newest version (2005-11-20) thanks for pointing out // It is best if _B_DIRECTORY is not web-accessible as mine was.

I'm not a programmer and was wondering the difference between an ealier version (Oct. I think) which had the line $bTotBlock= 3600;// secs; period to block long-duration scrapers but the 2005-11-20 version does not does not use $bTotBlock?

thanks.

dbar

6:34 pm on Nov 25, 2005 (gmt 0)

ops just uploaded and had an error :
parse error, unexpected $ in /home/site/public_html/botblocker.php on line 56

so I used my older version and changed $bTotVisit to 1000 from 777. Could this be related to my "$bTotBlock" question above? line 56 is:

echo "$visits visits from your IP-Address within the last ". (( int ) $duration / 3600 ) ." hours. Please wait ". (( int ) ( 86400 - $duration ) / 3600 ) ." hours before retrying.</p></body></html>";

<edit>
FYI I'm talking about the 2005-11-20 version
</edit>

[edited by: coopster at 2:50 am (utc) on Aug. 6, 2008]

AlexK

2:23 am on Nov 26, 2005 (gmt 0)

First, my apologies for the delay in replying.

dbar:

ops just uploaded and had an error

I've just done a file comparison on my live code and the code snippet, and there are no differences. So, my apologies, dbar, but you have made an oops when you copied/edited the new code.

was wondering the difference between an ealier version ... the 2005-11-20 version does not does not use $bTotBlock?

The previous code blocked 2 slow scrapers (both "good" bots) and I then discovered that each bots' programmed behaviour was to make another GET request every 10 (whatever) secs. That was bad enough (8,400 wasted requests), but the previous behaviour was also to re-program $fileATime and $fileMTime in such a way that the slow-scapers would effectively have been blocked forever, or until $bTotBlock secs after they stopped.

So, I re-thought that bit out, and the new code now allows $bTotVisit visits from any IP within a 24 hour period. It does this by re-setting $fileMTime to the current time every 24 hours.

That works fine on my site - it is fairly busy (about 10,000 pages daily), and just one bot gets blocked occasionally:

Blocked IPs:
* 64.124.122.228 [ 64.124.122.228.gw.xigs.net ] 1000 line(s)
1000 total lines in log-file.
.
64.124.122.228 21/11/2005 02:59:26 RufusBot (Rufus Web Miner; [64.124.122.252...] (slow scraper) 1
64.124.122.228 21/11/2005 02:58:38 RufusBot (Rufus Web Miner; [64.124.122.252...] (slow scraper) 1
64.124.122.228 21/11/2005 02:57:50 RufusBot (Rufus Web Miner; [64.124.122.252...] (slow scraper) 1

I am content about that. You will likely want to adjust $bTotVisit to a suitable figure for your site. It would perhaps also be a good idea to allow the 24-hour period to be a variable. I shall change that soon-ish.

[edited by: coopster at 2:51 am (utc) on Aug. 6, 2008]

dbar

3:54 pm on Nov 26, 2005 (gmt 0)

AlexK thanks for replying and the explanation,
I uploaded the newer one again and it works fine. I think maybe my ftp program corrupted it or a combination of ftping while using a third party vpn service corrupting it(I've had this happen before because of timing out).

AlexK

4:16 pm on Nov 26, 2005 (gmt 0)

Excellent, dbar.

Your feedback is v useful in giving confirmation that it works outside of my own system.

dbar

1:27 am on Nov 28, 2005 (gmt 0)

Am I doing something wrong here? I put the script on a different site with less than 150 pages and it appears to be working in that I'm not getting any errors and the 3 digit files are being created in the \botblocker\ folder, but I can't trap myself when trying to test it.

I've set $bInterval = 20 and $bMaxVisit = 4 and keep hitting the refresh button and jumping to different pages, but I'm not getting banned. Shouldn't I be restricted to 4 in 20 seconds with this configeration? thanks.

Gibisan

1:47 am on Nov 28, 2005 (gmt 0)

I am just finishing up testing this on one of my sites and am about to put it on my two busiest sites, both of which regularly get caned by bots - and can confirm through plenty of testing that the script does work. (By the way - thanks a lot for this great script AlexK)

dbar, Two thing to make sure when testing the script.
The first is that you are refreshing your pages and not using cached ones - use ctrl+F5 to refresh the page. Second, delete your ipfile at the start of each test. If you leave the ipfile for a while - the longer the ip file is left for the less likely it meets the (( $visits / $duration ) > ( $bMaxVisit / $bInterval )) test

I did wonder initially why I could not get the fast scraper part to trip until I looked at what was actually happening with the $fileATime & $fileMTime values during a fast scrape. I realised the reason that it was not tripping was because I had left the ipfile for too long. I thought that this might be a problem with the script if someone had visited your site first and then came back later to scrape it but decided that they would still get caught by $bTotVisit page amount which for me is acceptable.

dbar

2:09 am on Nov 28, 2005 (gmt 0)

Gibisan,
Thanks I was able to trip it with ctrl+F5
<edit>
and by deleting the 3 digit files created in /botblocker/ - I assume those are the ipfile's?
</edit>

AlexK

7:11 am on Nov 28, 2005 (gmt 0)

Exactly as Gibisan has stated (and yes, dbar, those 3-digit files are

$ipFile

for each visit) ...

In practice, most accesses to your site--unless it is very busy--will cause a rollover of the dates (due to

$time - $fileMTime > 86,400

), causing

$fileATime = $fileMTime = $time

to be tripped, and thus a start-over.

The default trip-rate is 2 / sec (

$bMaxVisit / $bInterval = 14 / 7

) which means, at the absolute worst-case scenario, a fast scraper will get 999 pages in 500 secs. That is a lot, but a reasonable compromise without using some form of whitelist-exclusions (my attitude is a-scraper-is-a-scraper whether they are wearing an admittance-badge or not; it is their behaviour which determines their status).

This thread is very current to events on WebmasterWorld [webmasterworld.com] (the link, and foll pages, has some excellent alternative suggestions for those with ultra-busy sites).

dolcevita

1:45 pm on Dec 15, 2005 (gmt 0)

I did noticed some problem in this new version.
As first both routine (fast and slow bots) works very well but i was surprised in my testing that after fast bot (it was me) was blocked once for a 60 second with simple test routine

$bInterval= 3;// secs; check interval (best < 30 secs)
$bMaxVisit= 2;// Maximum visits allowed within $bInterval

After the time of banning was over i could refreshed pages even 20 times in second and routine to prevent fast bot does not seems to works.So routine (fast bots)works once and then not anymore.
Wonder why and could be somehting done tha this not happend?

Thanks

AlexK

11:20 pm on Dec 15, 2005 (gmt 0)

dolcevita:

So routine works once and then not anymore

I have caching on my site and so cannot easily reproduce any of your tests. The routine's algorithm does not bear out your worries, however:

With your values ($bInterval= 3, $bMaxVisit= 2), the trip ratio is 1.5. After being tripped, $ipFile would be set to:


$fileMTime = refTime - 3 
$fileATime = refTime + 97

On the next visit:


$fileMTime = refTime - 3 
$fileATime = refTime + 98 
$visits = 101 
$duration = 3

That means a wait of 65 secs (101/68 < 3/2). Assuming this:

Visit 1: 
$fileMTime = refTime - 3 
$fileATime = refTime + 98 
$visits = 101 
$duration = 68 
Visit 2: 
$fileMTime = refTime - 3 
$fileATime = refTime + 99 
$visits = 102 
$duration = 68 
Visit 3: 
$fileMTime = refTime - 3 
$fileATime = refTime + 100 
$visits = 103 
$duration = 68

103/68 = 1.51 and the block would be tripped again.

So, after being tripped, it would take just 3 more visits (in the same second) to trip the block again. I have no idea why you cannot do this on your system, although I would advise you that your value for $bMaxVisit is unreasonably low. 18 + 12 ($bInterval + $bMaxVisit) might be better, else you may be triggering blocks continually.

Do you have the latest code? It is marked "2005-11-20" at the top.

Finally, the max any scraper can take is $bTotVisit (default 1,000 pages) in any 24 hour period, slow or fast.

dolcevita

12:03 pm on Dec 16, 2005 (gmt 0)

Thanks it works right now.Probably yesterday was some mistake from my side.

ogletree

3:23 pm on Dec 16, 2005 (gmt 0)

So are you saying not to put this on an adsense site?

This 88 message thread spans 3 pages: 88