Forum Moderators: coopster

Message Too Old, No Replies

Blocking Badly Behaved Bots #3

Small correction to a previously-posted routine

         

AlexK

1:22 pm on Oct 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, this is a correction to an update [webmasterworld.com] (which itself had several corrections [webmasterworld.com]) to a posting [webmasterworld.com]. Sigh and double-sigh.

The original revised-routine works fine--in fact, so far since June, there have been 677 attempts by 20 different people to rip off my site; all blocked by the routine. However, I have noticed a mistake in the long-term scraper code:

should be:

if( file_exists( $ipFile )) {
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
$fileATime++;
$visits= $fileATime - $fileMTime;
$duration= $time - $fileMTime;// secs
// foll test also keeps tracking going to catch slow scrapers
if( $duration > 86400 ) {// 24 hours; start over
$fileMTime= $fileATime = $time;
$duration= $visits = 1;
} else if( $duration < 1 ) $duration = 1;
// test for slow scrapers
if( $visits >= $bTotVisit ) {

Rather than re-posting the whole (revised) routine all over again, for just a one-line change, the entire bot-block routine can be downloaded at this link [download.modem-help.co.uk]. I warmly endorse it to you.

[edited by: coopster at 1:57 pm (utc) on Aug. 7, 2008]

Umbra

3:07 pm on Jan 19, 2006 (gmt 0)

10+ Year Member



Thanks AlexK,

Liked I said, I'm a complete newbie to PHP. I discovered the function error_reporting(E_ALL); (maybe you should include this in the script for other PHP newbies?) and that clued me in.

I didn't realize that in PHP, the slash/asterik starts a multiline comment. I simply edited your example '/full/path/on/server/to/block/dir/' on that exact same line and assumed the Constants were now defined. I now realized I had to copy-and-paste the Constants below the line "Start blocking badly-behaved bots : top code"

I hope this will help other PHP newbies.

AlexK

4:01 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've only found one thing that really gets me to learn, and that is to make a mistake.

Glad you got it sorted, Umbra.

AlexK

4:41 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



2nd part of adding the bot-block-code to HTML pages:

The previous steps work fine in blocking fast- and slow-scrapers, but have the downside that now (since HTML pages are now actually PHP-pages) the default Apache Content-Negotiation behaviour of supplying a 304 for not-changed pages, etc, is gone. This post is to show how to fix that.

First, you will need a PHP-Class which can be found here [webmasterworld.com] (it is called "Conteg.include", written by me, and allows content-negotiation to be easily added to PHP-pages).

Next, add the following 2 lines to the bottom of the block_bad_bots.php file:

// -------------- Stop blocking badly-behaved bots : top code --------
ob_start();
require_once( '/server/path/to/file/Conteg.include' );
?>

Next, create the following block_bad_bots_bot.php file:

<?php
new Conteg( array(
'modified'=> filemtime( ${$_SERVER_ARRAY}[ 'PATH_TRANSLATED' ]),
'use_etag'=> TRUE,
'use_apache_notes'=> TRUE
));
?>

Finally, after uploading both files, add the following line to httpd.conf (or .htaccess) (Apache servers only):

# php directives
# 2006-01-19 added to block bad-bots -AK
#
# <IfModule mod_php4.c>
AddType application/x-httpd-php .html
php_value auto_prepend_file "/server/path/to/file/block_bad_bots.php"
php_value auto_append_file "/server/path/to/file/block_bad_bots_bot.php"
# </IfModule>
# End of php directives
(added line emboldened)

...and restart Apache as before.

Notes:
1 Both block_bad_bots.php and block_bad_bots_bot.php want to be outside of the web-directory (inaccessible from the web) if at all possible.
2 'PATH_TRANSLATED' worked on my server, but the PHP Manual [php.net] does state that some Apache 2 users may need to use

AcceptPathInfo = On
inside httpd.conf (to define PATH_INFO). Probably best to check first.
3 The
'use_apache_notes'=> TRUE
line within the parameter-array is to allow compression stats to be reported within the access-log. It is not essential. If you want to use it, the following will need to be added to httpd.conf (Apache2):

<IfModule mod_deflate.c>
AddOutputFilterByType DEFLATE text/html text/plain text/css text/xml application/xml application/xhtml+xml
#
# accommodate Netscape 4.x (next line) + 4.06-4.08 (2nd line)
BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4\.0[678] no-gzip
#
# bug in mod_setenvif up to Apache 2.0.48; foll regex OK
BrowserMatch \bMSI[E]!no-gzip!gzip-only-text/html
#
# 2005-09-06 Don't compress images or add UA-Vary
# the ``no-gzip' should be superfluous
SetEnvIfNoCase Request_URI \.(?:gif¦jpe?g¦png)$ no-gzip dont-vary
#
# Make sure proxies don't deliver the wrong content
<IfModule mod_headers.c>
Header append Vary User-Agent env=!dont-vary
</IfModule>
#
# 2005-09-06 Info to put deflate stats into logs
DeflateFilterNote Input instream
DeflateFilterNote Output outstream
DeflateFilterNote Ratio ratio
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" In:%{instream}n Out:%{outstream}n:%{ratio}npct." deflate
</IfModule>
(the important lines follow the emboldened-line) (note that this board converts pipe-chars to broken-vertical-line (¦) chars, and you need to convert them back)

So, am I good to you, or what?

AlexK

5:04 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Umbra:
Your suggestion added (though not yet uploaded).

StupidScript

5:42 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So, am I good to you, or what?

Oh, Alex ... you're very very good to ALL of us! ;)

Umbra

7:17 pm on Jan 19, 2006 (gmt 0)

10+ Year Member



That update works for me!

P.S. Another webpage suggests adding "XBitHack full" to .htaccess, although I haven't got that to work yet as an alternative to Conteg.include

P.P.S. Any danger that search engines may detect this as a form of cloaking, or does Conteg.include make these pages absolutely indistinguishable from any other static HTML page?

AlexK

8:17 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Umbra:
Any danger that search engines may detect this as a form of cloaking

You could actually leave out the 2nd part of the post, and still no SE would find it to be cloaked-content.

Remember, Apache implements Content-Negotiation only because the config-statements within

httpd.conf
tell it to. If those statements are missing, then so is the Content-Negotiation. The point is that the default httpd.conf supplied with Apache already has (most of) Content-Negotiation implemented.

does Conteg.include make these pages absolutely indistinguishable from any other static HTML page?

Almost.

The array parameters for Conteg as shown above implements weak ETags, whereas Apache would (by default) implement strong ETags.

The default for the Class is weak-ETags, since both the Class and weak-ETags were designed with dynamic pages in mind. Using my own site as a good example, a central body of content (which changes rarely) surrounded by site-relevant, but not content-relevant, material which changes often (page-stats, etc). Strong-ETags were designed for pages which, when the ETag is the same, the entire page is byte-for-byte the same. With weak ETags, it says that the principal content is the same, though other, less important, content may have changed.

The principal difference between the two is that strong-ETags (but not weak) will also allow the use of Range & If-Range, which can allow further bandwidth savings. One further consideration is CPU-load and speed. The Class implements strong-ETags by MD5-ing the entire page content, and that can take some time with extremely large pages.

I actually could have used strong-ETags on my site, since the HTML pages do not have any dynamic content within them (look at the documentation within the Class (also header for

setup()
) if you wish to use this; it is not difficult, and only means including some extra array elements within the Class declaration - the rest is then handled by the Class). Within this forum, however, I thought it best to keep it to weak ETags, since that will safely handle all cases.

If you want to use Ranges, add:

array(
'weak_etag' => FALSE,
'use_accept_ranges' => TRUE
)

It is also worth mentioning that the following (defaults) need checking for your pages:

array(
'charset' => 'ISO-8859-1', // check is accurate
'lang' => 'en' // check is accurate
)

Umbra

8:33 pm on Jan 19, 2006 (gmt 0)

10+ Year Member



Yes, I did notice the Etag format was different. According to the server header checker (although every header checker seems to show slightly differences?) there were also some additional headers:
X-Powered-By: PHP/4.4.0
Expires: [date]
Vary: User-Agent,Accept-Encoding
Content-Encoding: gzip
X-Content-Encoded-By: class.Conteg.0.10

In our case, static HTML pages show the header "Accept-Ranges: bytes" which don't appear with Conteg.include

Do you have any input on this? Many thanks for your time!

AlexK

8:53 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Umbra:
there were also some additional headers:
X-Powered-By: PHP/4.4.0

That one comes from your Apache, the rest are from the Class.

Vary: User-Agent,Accept-Encoding
Content-Encoding: gzip

You get (dynamic, load-balanced) Compression by default. Use
'use_accept_encode' => FALSE
if you do not want it.

adding a default charset may not be a good idea

Not adding a charset is an even worse idea! Use
'charset' => ''
if you do not want it.

All other questions are already answered in msg #:67 ("Accept-Ranges: bytes" is switched on with

'use_accept_ranges' => TRUE
).

Xuefer

11:49 am on Jan 20, 2006 (gmt 0)

10+ Year Member



i have modified the script slightly to use $time + bantime for "ban", $fileATime(hits) is reset to 0. check if the recorded time is > $time on subsequence request to know if it's already banned.
how do u think about it?

AlexK

12:23 pm on Jan 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Xuefer:
Post a snippet of changed code; highlight the specific changes in bold, perhaps.

I am always interested in improvements, as long as the whole thing does not threaten to become unmanageable.

Xuefer

2:33 am on Jan 21, 2006 (gmt 0)

10+ Year Member



here we go:

diff -u bot-block.php.old bot-block.php
--- bot-block.php.old 2006-01-21 10:20:16.000000000 +0800
+++ bot-block.php 2006-01-21 10:26:51.000000000 +0800
@@ -58,6 +58,19 @@
$fileATime++;
$visits = $fileATime - $fileMTime;
$duration = $time - $fileMTime; // secs
+ if ($duration < 0) {
+ $bantime = -$duration;
+ // keep banning if the client ignore 503/errormessage
+ $fileMTime = max($bantime, $bPenalty);
+ touch($ipFile, $fileMTime, $fileATime);
+
+ header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
+ header( 'Connection: close' );
+ header( 'Content-Type: text/html' );
+ echo "<html><body><p><b>Server under load</b><br />";
+ echo "please do NOT try again with $bantime seconds";
+ exit;
+ }
if( $duration > $bStartOver ) { // default 24 hours; restart tracking
$fileMTime = $fileATime = $time;
$duration = $visits = 1;
@@ -76,13 +89,14 @@
echo "<html><body><p><b>Server under undue load</b><br />";
echo "$visits visits from your IP-Address within the last ". (( int ) $duration / 3600 ) ." hours. Please wait ". (( int ) ( $bStartOver - $duration ) / 3600 ) ." hours before retrying.</p></body></html>";
$bLogLine = "$ipRemote ". date( 'd/m/Y H:i:s' ) ." $useragent (slow scraper)\n";
+ $bantime = $bStartOver - $duration;
+ $fileMTime = $time + $bantime
+ $fileATime = $time;
// test for fast scrapers
} elseif(
( $visits >= $bMaxVisit ) and
(( $visits / $duration ) > ( $bMaxVisit / $bInterval ))
) {
- $fileMTime = $time = $time - $bInterval;
- $fileATime = $time + $bMaxVisit + (( $bMaxVisit * $bPenalty ) / $bInterval );
$useragent = ( isset( ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]))
? ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
@@ -92,6 +106,8 @@
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "$visits visits from your IP-Address within the last $duration secs. Please wait at least $bPenalty secs before retrying.</p></body></html>";
$bLogLine = "$ipRemote ". date( 'd/m/Y H:i:s' ) ." $useragent (fast scraper)\n";
+ $fileMTime = $time + $bPenalty;
+ $fileATime = $time;
}
// log badly-behaved bots, then nuke 'em
if( $bLogLine ) {

isn't much improvement, it true that it lost some info, but with this way, a bot can be banned forever if it keeps crawling not noticing the error message without hitting the $bStartOver (if someone have a much lower value)
on the coding sight, it's more easy to decide how many seconds to ban.

btw, i don't really like the name of $fileMTime $fileATime, why not name it something like $startTime $hits, just convert it when stat/touch

AlexK

2:24 pm on Jan 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the diff file, Xuefer.

+ if ($duration < 0) {
+ $bantime = -$duration;
+ // keep banning if the client ignore 503/errormessage

There is a problem here.

From experience, slow-scrapers just slow down (they all seem programmed, once blocked, to re-attempt a page-load once every 10 secs/1 minute/whatever, and will keep that up forever, or until

$bStartOver
is hit). Fast-scrapers, of course, pay no attention to either the 503-header nor to the html error-msg. In all the time running these scripts (approaching 3 years) there has been just 1 occasion in which a human has tripped the (slow-scraper) block, and has then desisted from further scrapes. If I understand your code correctly (not at all certain!) it will ban slow bots forever.

Again from experience, my suspicion is that the slow-scraper block error msgs *have* been read by the SE operatives, and that their bots have been re-programmed accordingly. I say this because in the early days of using the (more recent) code (msg#3) [webmasterworld.com] it caught many 'good' bots repetitively, but then not, and I did not make any changes, so they must have. There is no means for me to confirm this, however, apart from the fact that I know that the SE-operatives read these boards.

i don't really like the name of $fileMTime $fileATime, why not name it something like $startTime $hits

That is a good idea; consider it done (although not uploaded yet) for $fileMTime; I cannot yet find a good name for $fileATime.

// test for fast scrapers
...
+ $fileMTime = $time + $bPenalty;
+ $fileATime = $time;

The fast-scraper reset algorithm does need changing. I had a go, and got my mind twisted up, yet again (this routine does that to me all the time).

I am ploughing through an iptables tutorial [iptables-tutorial.frozentux.net]at this instant. When that is done, I shall find a better name for $fileATime (can you suggest one?), and fix the fast-scraper reset algorithm, then upload the whole routine.

AlexK

12:44 am on Jan 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Still only halfway through the iptables tutorial, but after a little nap (I'm getting old) have solved the fast-scraper reset:

$startTime= $time;
$accessTime= $time + (( $bMaxVisit * $bPenalty ) / $bInterval );

You will also note that
$fileATime
is now named
$accessTime
, which seems a reasonable nomenclature.

The reasoning is based on the fast-scraper test:

(( $visits / $duration ) > ( $bMaxVisit / $bInterval ))

The blocking then needs to stop when:
(( $visits / $duration ) == ( $bMaxVisit / $bInterval ))

(and of course
$duration
is
$bPenalty
, which allows us to solve for
$visits
- very neat)

Upgraded script now uplifted.

Xuefer

1:53 am on Jan 24, 2006 (gmt 0)

10+ Year Member



cool, your points is very valuable. i'll start thinking roll back to chmod(), the way i used before $time+$bantime patch.
the reason i have to mark it "banned" status is because i don't like logging each time the client get a ban message, but instead, only 1 log until the ban is reset.

for the name:
i've made a suggestion already, why name the variable relatived to the storage not usage? do you name your user/password/email/... as field1,field2,field3... in mysql just because they are actually fields in the view point of mysql?
it's fileatime/filemtime, but we used it as $startTime, $hits


$startTime = filemtime($ipfile);
$visits = fileatime($ipfile) - $startTime;
$duration = $time - $startTime;
// never ever reference to atime/mtime below this line until touch()
if ($duration > 86400) {// 24 hours; start over
$startTime = $time;
$duration = $visits = 1;
}
.....
touch($ipfile, $startTime, $startTime + $hits);

etc...

AlexK

1:20 pm on Jan 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Xuefer:
for the name: ... why name the variable relatived to the storage not usage?

Because
$accessTime
is not hits;
$accessTime - $startTime
is hits.

My apologies, Xuefer, but I am subject to an extremely anal-retentive, literal, mind which just will not let me rest if I get the wrong name.

$hitsOffsetTime
would have been accurate, but long-winded, so in the end I settled with "
$accessTime
" which is accurate, if not that helpful.

My apologies again - I know that it is irritating. It will be no comfort to you, but try and imagine what it must be like for me; I have to live with this mind 24 hours a day! And I built it, so cannot blame anyone else.

Tell you what - how about we compromise with "

$hitsTime
"? I think I could live with that (unless you have better).

i don't like logging each time the client get a ban message

Then alter the log-routine.

Search for

$bLogLine
inside
$log
, and (perhaps) increment a counter in the logline. (You will obviously have to alter any changing-variables in the log, or only search for IP)

Xuefer

2:42 am on Jan 25, 2006 (gmt 0)

10+ Year Member



i know $hits/$visits!= $accessTime
so my recomended code is: $visits = fileatime($ipfile) - $startTime; after this line we get $visits not $accessTime (not just name changed) untill we save it back by touch()
but never mind, i was just making suggestion, you can sure keep your design pattern
i can live with it as it's small enough for me to read if i take some time. :)

btw, i've studied so much from your 3 threads

AlexK

4:14 am on Feb 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Have just discovered the availability of the Retry-After header (see the W3 HTTP/1.1 Header definition [w3.org]), which is specifically designed for 503 Responses, and have added it to the routine (and my site).

Maybe now the wretched Adsense-bot will stop it's multi-thousand requests each day.

Give an hour or so for me to upload the changes.

AlexK

2:25 am on Feb 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Neither the Google Mozilla-bot nor the G Adsense-bot pay any attention to Retry-After.

Yesterday both were going crackers on my site [webmasterworld.com] on the same IP and got blocked at 21:56:38 GMT. Since then the 2 bots between them have made another 2,000 attempted accesses. Ignorant little blighters.

# egrep -c "66\.249\.66\.147 - - \[16/Feb/2006:2[123](.*) 503 " access_log
1216
# egrep -c "66\.249\.66\.147 - - \[17/Feb/2006(.*) 503 " access_log
733

wheel

7:11 pm on Feb 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Two quick questions:
- would this benefit from being converted to a compiled program? I'm likely going to implement this on a site where I both hope to have lots of user traffic, and fully expect a ton of bots/scrapers. So I'm concered about speed. Not sure what kind of hit this causes, if it would be at all noticeable I can have our developer convert it.
- would this be used instead of a comprehensive htaccess list? or in conjunction with? I'm wondering if I need to investigate both areas, or if just this script will do the trick.

AlexK

10:36 pm on Feb 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



wheel:
would this benefit from being converted to a compiled program?

This routine is one of the few where compilation would offer little or no benefit in program timings.

2 reasons:

  1. There are hardly any lines of code in normal operation.
  2. One single line of code consumes most of the op-time, and compilation is unlikely to offer savings there.

if( file_exists( $ipFile )) {

That line of code involves a disk access. The very first access will consume a (relatively) large amount of time. After that point, the system OS will cache the file, and thus the full routine op-time will depend heavily upon the efficiency of your server's OS and sub-systems.

would this be used instead of a comprehensive htaccess list?

The greatest advantage (to my mind) of this routine is that it is, just like the lady on top of Justice Hall, blind in it's treatment of each page-access. It treats every one exactly the same. That is also why it is so fast. Of course, if you wish to tip the scales for your server (just like Anubis [si.umich.edu] in the famous paintings of Ancient Egypt) that is your right.

If you wish for the very best program speed, then I would advise that you investigate using PHP Accelerator [php-accelerator.co.uk] (or something similar). PHPA works exactly as described above. The very first access takes, relatively, a long time, since it involves a disk access plus compilation time. After that point the (compiled) script is cached, and vast time savings are achieved for each access.

AlexK

12:40 pm on Mar 6, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yet another small addition has been made to the routine (thanks incrediBILL), plus uploaded (just now).

The additions extend the (optional) WhiteList routines. For reference, this is now the full whitelist-exclusion:

ipIsInNet( $ipRemote, '64.62.128.0/20' ) or// Gigablast has blocks 64.62.128.0 - 64.62.255.255
ipIsInNet( $ipRemote, '66.154.100.0/22' ) or// Gigablast has blocks 66.154.100.0 - 66.154.103.255
ipIsInNet( $ipRemote, '64.233.160.0/19' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
ipIsInNet( $ipRemote, '66.249.64.0/19' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
ipIsInNet( $ipRemote, '72.14.192.0/19' ) or// Google has blocks 72.14.192.0 - 72.14.239.255
ipIsInNet( $ipRemote, '72.14.224.0/20' ) or
ipIsInNet( $ipRemote, '216.239.32.0/19' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
ipIsInNet( $ipRemote, '66.196.64.0/18' ) or// Inktomi has blocks 66.196.64.0 - 66.196.127.255
ipIsInNet( $ipRemote, '66.228.160.0/19' ) or// Overture has blocks 66.228.160.0 - 66.228.191.255
ipIsInNet( $ipRemote, '68.142.192.0/18' ) or// Inktomi has blocks 68.142.192.0 - 68.142.255.255
ipIsInNet( $ipRemote, '72.30.0.0/16' ) or// Inktomi has blocks 72.30.0.0 - 72.30.255.255
ipIsInNet( $ipRemote, '64.4.0.0/18' ) or// MS-Hotmail has blocks 64.4.0.0 - 64.4.63.255
ipIsInNet( $ipRemote, '65.52.0.0/14' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
ipIsInNet( $ipRemote, '207.46.0.0/16' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
ipIsInNet( $ipRemote, '207.68.128.0/18' ) or// MS has blocks 207.68.128.0 - 207.68.207.255
ipIsInNet( $ipRemote, '207.68.192.0/20' ) or
ipIsInNet( $ipRemote, '65.192.0.0/11' ) or// Teoma has blocks 65.192.0.0 - 65.223.255.255
( substr( $ipRemote, 0, 13 ) == '66.194.55.242' )// Ocelli

AlexK

9:09 pm on Mar 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For those making use of Conteg, please note that there is a small bugfix, and it is now v0.11 [webmasterworld.com].

AlexK

9:52 pm on Mar 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yet another small bugfix to the routine (a download link is in the first message).

Google Mozilla-bot has been going absolutely bonkers on my site [webmasterworld.com] and constantly tripping the slow-scraper block for the past 28 days (over 200,000 503 Server-busy, if you can believe that - I find it difficult). With such a lot of activity there have been instances of 2 threads intefering, and the log-file has grown to be more lines than it should (s/b 1,000 lines, is 1,003). Thus, array_shift has been modified to auto-fix that.

The modified routine will be uploaded after Prison Break has finished on Channel 5, OK?

steemar

3:48 pm on Mar 28, 2006 (gmt 0)



Been using [original] bot-blocker for some time now. Works well. Now I have rediscovered the thread, will certainly make the many changes to make things slicker and/or more effective.

Recently also discovered bad-behavior which was originally designed as a Wordpress/Drupal mod to stop spamming but works stand alone. It investigates browser headers etc and blocks if things are not as they should be.

Anyone else had a look at it?

interbuy

8:50 pm on Apr 8, 2006 (gmt 0)

10+ Year Member



Is it supposed to place a zero file for each IP that access your site. Because mine only puts in a couple an hour and I have hundreds of accesses an hour on my site.

-rw-r--r-- 1 apache apache 0 Apr 8 12:10 28a

Also so far no iplog.

I have hammered my site with Webcopier and it did not stop or slow me down.

I have auto prpended the script in .htacces.

Any help is appreciated.

AlexK

1:55 pm on Apr 26, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



interbuy:
Is it supposed to place a zero file for each IP that access your site

First, many apologies for taking so long to answer your question - your post somehow slipped by me. Now the answer...

The answer, really, is both yes and no (how helpful!).

It depends on the value of

$ipLength
. As one example, if
$ipLength=2
, then
_B_DIRECTORY
will fill with a maximum of 255 files. You will appreciate that there are rather more IPs than that and, therefore, each
$ipFile
will be shared amongst IPs. Once the full number are created, that is it (access- and modification-time are updated, but no more are created).

$ipLength=2
is a good value for a small site,
$ipLength=3
for a medium site, and higher values for much busier sites.

Also so far no iplog.
.
I have hammered my site with Webcopier and it did not stop or slow me down.

The routine will not slow down web-accesses (God forbid!) but it will stop over-active or long-term scrapers. My
$ipLogFile
is continually filled with accesses from Mediapartners-Google and Mozilla/5.0 (compatible; Googlebot sharing the same IP (64,459 Server busy so far this month), so - try to believe me, it works.

PS
Try DTAAgent - that is the last fast scraper that I caught (26/04/2006 08:53:24 - hitting the site at upto 20 times/sec).

[edited by: AlexK at 2:09 pm (utc) on April 26, 2006]

AlexK

2:07 pm on Apr 26, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I recommend that the value of
$bTotVisit
be upped to 1500.

The AdSense-bot ("Mediapartners-Google") and the current Google-bot ("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)") customarily share the same IP. That is leading to continual slow-scraper entries in the log-file (although my AdSense earnings are unaffected - I wish that Google would sort this out).

PS:
The old Googlebot is no more. It is defunct, has joined the choir invisible, etc etc.



continued here
[webmasterworld.com...]

[edited by: jatar_k at 5:22 pm (utc) on Oct. 19, 2006]

This 88 message thread spans 3 pages: 88