Forum Moderators: coopster
The original revised-routine works fine--in fact, so far since June, there have been 677 attempts by 20 different people to rip off my site; all blocked by the routine. However, I have noticed a mistake in the long-term scraper code:
should be:
if( file_exists( $ipFile )) {
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
$fileATime++;
$visits= $fileATime - $fileMTime;
$duration= $time - $fileMTime;// secs
// foll test also keeps tracking going to catch slow scrapers
if( $duration > 86400 ) {// 24 hours; start over
$fileMTime= $fileATime = $time;
$duration= $visits = 1;
} else if( $duration < 1 ) $duration = 1;
// test for slow scrapers
if( $visits >= $bTotVisit ) {
[edited by: coopster at 1:57 pm (utc) on Aug. 7, 2008]
Liked I said, I'm a complete newbie to PHP. I discovered the function error_reporting(E_ALL); (maybe you should include this in the script for other PHP newbies?) and that clued me in.
I didn't realize that in PHP, the slash/asterik starts a multiline comment. I simply edited your example '/full/path/on/server/to/block/dir/' on that exact same line and assumed the Constants were now defined. I now realized I had to copy-and-paste the Constants below the line "Start blocking badly-behaved bots : top code"
I hope this will help other PHP newbies.
The previous steps work fine in blocking fast- and slow-scrapers, but have the downside that now (since HTML pages are now actually PHP-pages) the default Apache Content-Negotiation behaviour of supplying a 304 for not-changed pages, etc, is gone. This post is to show how to fix that.
First, you will need a PHP-Class which can be found here [webmasterworld.com] (it is called "Conteg.include", written by me, and allows content-negotiation to be easily added to PHP-pages).
Next, add the following 2 lines to the bottom of the block_bad_bots.php file:
// -------------- Stop blocking badly-behaved bots : top code --------
ob_start();
require_once( '/server/path/to/file/Conteg.include' );
?>
<?php
new Conteg( array(
'modified'=> filemtime( ${$_SERVER_ARRAY}[ 'PATH_TRANSLATED' ]),
'use_etag'=> TRUE,
'use_apache_notes'=> TRUE
));
?>
# php directives(added line emboldened)
# 2006-01-19 added to block bad-bots -AK
#
# <IfModule mod_php4.c>
AddType application/x-httpd-php .html
php_value auto_prepend_file "/server/path/to/file/block_bad_bots.php"
php_value auto_append_file "/server/path/to/file/block_bad_bots_bot.php"
# </IfModule>
# End of php directives
...and restart Apache as before.
Notes:
1 Both block_bad_bots.php and block_bad_bots_bot.php want to be outside of the web-directory (inaccessible from the web) if at all possible.
2 'PATH_TRANSLATED' worked on my server, but the PHP Manual [php.net] does state that some Apache 2 users may need to use
AcceptPathInfo = Oninside httpd.conf (to define PATH_INFO). Probably best to check first.
'use_apache_notes'=> TRUEline within the parameter-array is to allow compression stats to be reported within the access-log. It is not essential. If you want to use it, the following will need to be added to httpd.conf (Apache2):
<IfModule mod_deflate.c>(the important lines follow the emboldened-line) (note that this board converts pipe-chars to broken-vertical-line (¦) chars, and you need to convert them back)
AddOutputFilterByType DEFLATE text/html text/plain text/css text/xml application/xml application/xhtml+xml
#
# accommodate Netscape 4.x (next line) + 4.06-4.08 (2nd line)
BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4\.0[678] no-gzip
#
# bug in mod_setenvif up to Apache 2.0.48; foll regex OK
BrowserMatch \bMSI[E]!no-gzip!gzip-only-text/html
#
# 2005-09-06 Don't compress images or add UA-Vary
# the ``no-gzip' should be superfluous
SetEnvIfNoCase Request_URI \.(?:gif¦jpe?g¦png)$ no-gzip dont-vary
#
# Make sure proxies don't deliver the wrong content
<IfModule mod_headers.c>
Header append Vary User-Agent env=!dont-vary
</IfModule>
#
# 2005-09-06 Info to put deflate stats into logs
DeflateFilterNote Input instream
DeflateFilterNote Output outstream
DeflateFilterNote Ratio ratio
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" In:%{instream}n Out:%{outstream}n:%{ratio}npct." deflate
</IfModule>
So, am I good to you, or what?
P.S. Another webpage suggests adding "XBitHack full" to .htaccess, although I haven't got that to work yet as an alternative to Conteg.include
P.P.S. Any danger that search engines may detect this as a form of cloaking, or does Conteg.include make these pages absolutely indistinguishable from any other static HTML page?
Any danger that search engines may detect this as a form of cloaking
Remember, Apache implements Content-Negotiation only because the config-statements within
httpd.conftell it to. If those statements are missing, then so is the Content-Negotiation. The point is that the default httpd.conf supplied with Apache already has (most of) Content-Negotiation implemented.
does Conteg.include make these pages absolutely indistinguishable from any other static HTML page?
The array parameters for Conteg as shown above implements weak ETags, whereas Apache would (by default) implement strong ETags.
The default for the Class is weak-ETags, since both the Class and weak-ETags were designed with dynamic pages in mind. Using my own site as a good example, a central body of content (which changes rarely) surrounded by site-relevant, but not content-relevant, material which changes often (page-stats, etc). Strong-ETags were designed for pages which, when the ETag is the same, the entire page is byte-for-byte the same. With weak ETags, it says that the principal content is the same, though other, less important, content may have changed.
The principal difference between the two is that strong-ETags (but not weak) will also allow the use of Range & If-Range, which can allow further bandwidth savings. One further consideration is CPU-load and speed. The Class implements strong-ETags by MD5-ing the entire page content, and that can take some time with extremely large pages.
I actually could have used strong-ETags on my site, since the HTML pages do not have any dynamic content within them (look at the documentation within the Class (also header for
setup()) if you wish to use this; it is not difficult, and only means including some extra array elements within the Class declaration - the rest is then handled by the Class). Within this forum, however, I thought it best to keep it to weak ETags, since that will safely handle all cases.
If you want to use Ranges, add:
array(
'weak_etag' => FALSE,
'use_accept_ranges' => TRUE
)
It is also worth mentioning that the following (defaults) need checking for your pages:
array(
'charset' => 'ISO-8859-1', // check is accurate
'lang' => 'en' // check is accurate
)
In our case, static HTML pages show the header "Accept-Ranges: bytes" which don't appear with Conteg.include
Do you have any input on this? Many thanks for your time!
there were also some additional headers:
X-Powered-By: PHP/4.4.0
Vary: User-Agent,Accept-Encoding
Content-Encoding: gzip
'use_accept_encode' => FALSEif you do not want it.
adding a default charset may not be a good idea
'charset' => ''if you do not want it.
All other questions are already answered in msg #:67 ("Accept-Ranges: bytes" is switched on with
'use_accept_ranges' => TRUE).
diff -u bot-block.php.old bot-block.php
--- bot-block.php.old 2006-01-21 10:20:16.000000000 +0800
+++ bot-block.php 2006-01-21 10:26:51.000000000 +0800
@@ -58,6 +58,19 @@
$fileATime++;
$visits = $fileATime - $fileMTime;
$duration = $time - $fileMTime; // secs
+ if ($duration < 0) {
+ $bantime = -$duration;
+ // keep banning if the client ignore 503/errormessage
+ $fileMTime = max($bantime, $bPenalty);
+ touch($ipFile, $fileMTime, $fileATime);
+
+ header( 'HTTP/1.0 503 Service Temporarily Unavailable' );
+ header( 'Connection: close' );
+ header( 'Content-Type: text/html' );
+ echo "<html><body><p><b>Server under load</b><br />";
+ echo "please do NOT try again with $bantime seconds";
+ exit;
+ }
if( $duration > $bStartOver ) { // default 24 hours; restart tracking
$fileMTime = $fileATime = $time;
$duration = $visits = 1;
@@ -76,13 +89,14 @@
echo "<html><body><p><b>Server under undue load</b><br />";
echo "$visits visits from your IP-Address within the last ". (( int ) $duration / 3600 ) ." hours. Please wait ". (( int ) ( $bStartOver - $duration ) / 3600 ) ." hours before retrying.</p></body></html>";
$bLogLine = "$ipRemote ". date( 'd/m/Y H:i:s' ) ." $useragent (slow scraper)\n";
+ $bantime = $bStartOver - $duration;
+ $fileMTime = $time + $bantime
+ $fileATime = $time;
// test for fast scrapers
} elseif(
( $visits >= $bMaxVisit ) and
(( $visits / $duration ) > ( $bMaxVisit / $bInterval ))
) {
- $fileMTime = $time = $time - $bInterval;
- $fileATime = $time + $bMaxVisit + (( $bMaxVisit * $bPenalty ) / $bInterval );
$useragent = ( isset( ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]))
? ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';
@@ -92,6 +106,8 @@
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "$visits visits from your IP-Address within the last $duration secs. Please wait at least $bPenalty secs before retrying.</p></body></html>";
$bLogLine = "$ipRemote ". date( 'd/m/Y H:i:s' ) ." $useragent (fast scraper)\n";
+ $fileMTime = $time + $bPenalty;
+ $fileATime = $time;
}
// log badly-behaved bots, then nuke 'em
if( $bLogLine ) {
isn't much improvement, it true that it lost some info, but with this way, a bot can be banned forever if it keeps crawling not noticing the error message without hitting the $bStartOver (if someone have a much lower value)
on the coding sight, it's more easy to decide how many seconds to ban.
btw, i don't really like the name of $fileMTime $fileATime, why not name it something like $startTime $hits, just convert it when stat/touch
+ if ($duration < 0) {
+ $bantime = -$duration;
+ // keep banning if the client ignore 503/errormessage
From experience, slow-scrapers just slow down (they all seem programmed, once blocked, to re-attempt a page-load once every 10 secs/1 minute/whatever, and will keep that up forever, or until
$bStartOveris hit). Fast-scrapers, of course, pay no attention to either the 503-header nor to the html error-msg. In all the time running these scripts (approaching 3 years) there has been just 1 occasion in which a human has tripped the (slow-scraper) block, and has then desisted from further scrapes. If I understand your code correctly (not at all certain!) it will ban slow bots forever.
Again from experience, my suspicion is that the slow-scraper block error msgs *have* been read by the SE operatives, and that their bots have been re-programmed accordingly. I say this because in the early days of using the (more recent) code (msg#3) [webmasterworld.com] it caught many 'good' bots repetitively, but then not, and I did not make any changes, so they must have. There is no means for me to confirm this, however, apart from the fact that I know that the SE-operatives read these boards.
i don't really like the name of $fileMTime $fileATime, why not name it something like $startTime $hits
// test for fast scrapers
...
+ $fileMTime = $time + $bPenalty;
+ $fileATime = $time;
I am ploughing through an iptables tutorial [iptables-tutorial.frozentux.net]at this instant. When that is done, I shall find a better name for $fileATime (can you suggest one?), and fix the fast-scraper reset algorithm, then upload the whole routine.
$startTime= $time;
$accessTime= $time + (( $bMaxVisit * $bPenalty ) / $bInterval );
$fileATimeis now named
$accessTime, which seems a reasonable nomenclature.
The reasoning is based on the fast-scraper test:
(( $visits / $duration ) > ( $bMaxVisit / $bInterval ))
(( $visits / $duration ) == ( $bMaxVisit / $bInterval ))
$durationis
$bPenalty, which allows us to solve for
$visits- very neat)
Upgraded script now uplifted.
for the name:
i've made a suggestion already, why name the variable relatived to the storage not usage? do you name your user/password/email/... as field1,field2,field3... in mysql just because they are actually fields in the view point of mysql?
it's fileatime/filemtime, but we used it as $startTime, $hits
$startTime = filemtime($ipfile);
$visits = fileatime($ipfile) - $startTime;
$duration = $time - $startTime;
// never ever reference to atime/mtime below this line until touch()
if ($duration > 86400) {// 24 hours; start over
$startTime = $time;
$duration = $visits = 1;
}
.....
touch($ipfile, $startTime, $startTime + $hits);
for the name: ... why name the variable relatived to the storage not usage?
$accessTimeis not hits;
$accessTime - $startTimeis hits.
My apologies, Xuefer, but I am subject to an extremely anal-retentive, literal, mind which just will not let me rest if I get the wrong name.
$hitsOffsetTimewould have been accurate, but long-winded, so in the end I settled with "
$accessTime" which is accurate, if not that helpful.
My apologies again - I know that it is irritating. It will be no comfort to you, but try and imagine what it must be like for me; I have to live with this mind 24 hours a day! And I built it, so cannot blame anyone else.
Tell you what - how about we compromise with "$hitsTime
"? I think I could live with that (unless you have better).
i don't like logging each time the client get a ban message
Search for
$bLogLineinside
$log, and (perhaps) increment a counter in the logline. (You will obviously have to alter any changing-variables in the log, or only search for IP)
btw, i've studied so much from your 3 threads
Maybe now the wretched Adsense-bot will stop it's multi-thousand requests each day.
Give an hour or so for me to upload the changes.
Yesterday both were going crackers on my site [webmasterworld.com] on the same IP and got blocked at 21:56:38 GMT. Since then the 2 bots between them have made another 2,000 attempted accesses. Ignorant little blighters.
# egrep -c "66\.249\.66\.147 - - \[16/Feb/2006:2[123](.*) 503 " access_log
1216
# egrep -c "66\.249\.66\.147 - - \[17/Feb/2006(.*) 503 " access_log
733
would this benefit from being converted to a compiled program?
2 reasons:
if( file_exists( $ipFile )) {
would this be used instead of a comprehensive htaccess list?
If you wish for the very best program speed, then I would advise that you investigate using PHP Accelerator [php-accelerator.co.uk] (or something similar). PHPA works exactly as described above. The very first access takes, relatively, a long time, since it involves a disk access plus compilation time. After that point the (compiled) script is cached, and vast time savings are achieved for each access.
The additions extend the (optional) WhiteList routines. For reference, this is now the full whitelist-exclusion:
ipIsInNet( $ipRemote, '64.62.128.0/20' ) or// Gigablast has blocks 64.62.128.0 - 64.62.255.255
ipIsInNet( $ipRemote, '66.154.100.0/22' ) or// Gigablast has blocks 66.154.100.0 - 66.154.103.255
ipIsInNet( $ipRemote, '64.233.160.0/19' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
ipIsInNet( $ipRemote, '66.249.64.0/19' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
ipIsInNet( $ipRemote, '72.14.192.0/19' ) or// Google has blocks 72.14.192.0 - 72.14.239.255
ipIsInNet( $ipRemote, '72.14.224.0/20' ) or
ipIsInNet( $ipRemote, '216.239.32.0/19' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
ipIsInNet( $ipRemote, '66.196.64.0/18' ) or// Inktomi has blocks 66.196.64.0 - 66.196.127.255
ipIsInNet( $ipRemote, '66.228.160.0/19' ) or// Overture has blocks 66.228.160.0 - 66.228.191.255
ipIsInNet( $ipRemote, '68.142.192.0/18' ) or// Inktomi has blocks 68.142.192.0 - 68.142.255.255
ipIsInNet( $ipRemote, '72.30.0.0/16' ) or// Inktomi has blocks 72.30.0.0 - 72.30.255.255
ipIsInNet( $ipRemote, '64.4.0.0/18' ) or// MS-Hotmail has blocks 64.4.0.0 - 64.4.63.255
ipIsInNet( $ipRemote, '65.52.0.0/14' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
ipIsInNet( $ipRemote, '207.46.0.0/16' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
ipIsInNet( $ipRemote, '207.68.128.0/18' ) or// MS has blocks 207.68.128.0 - 207.68.207.255
ipIsInNet( $ipRemote, '207.68.192.0/20' ) or
ipIsInNet( $ipRemote, '65.192.0.0/11' ) or// Teoma has blocks 65.192.0.0 - 65.223.255.255
( substr( $ipRemote, 0, 13 ) == '66.194.55.242' )// Ocelli
Google Mozilla-bot has been going absolutely bonkers on my site [webmasterworld.com] and constantly tripping the slow-scraper block for the past 28 days (over 200,000 503 Server-busy, if you can believe that - I find it difficult). With such a lot of activity there have been instances of 2 threads intefering, and the log-file has grown to be more lines than it should (s/b 1,000 lines, is 1,003). Thus, array_shift has been modified to auto-fix that.
The modified routine will be uploaded after Prison Break has finished on Channel 5, OK?
Recently also discovered bad-behavior which was originally designed as a Wordpress/Drupal mod to stop spamming but works stand alone. It investigates browser headers etc and blocks if things are not as they should be.
Anyone else had a look at it?
-rw-r--r-- 1 apache apache 0 Apr 8 12:10 28a
Also so far no iplog.
I have hammered my site with Webcopier and it did not stop or slow me down.
I have auto prpended the script in .htacces.
Any help is appreciated.
Is it supposed to place a zero file for each IP that access your site
The answer, really, is both yes and no (how helpful!).
It depends on the value of
$ipLength. As one example, if
$ipLength=2, then
_B_DIRECTORYwill fill with a maximum of 255 files. You will appreciate that there are rather more IPs than that and, therefore, each
$ipFilewill be shared amongst IPs. Once the full number are created, that is it (access- and modification-time are updated, but no more are created).
$ipLength=2is a good value for a small site,
$ipLength=3for a medium site, and higher values for much busier sites.
Also so far no iplog.
.
I have hammered my site with Webcopier and it did not stop or slow me down.
$ipLogFileis continually filled with accesses from Mediapartners-Google and Mozilla/5.0 (compatible; Googlebot sharing the same IP (64,459 Server busy so far this month), so - try to believe me, it works.
PS
Try DTAAgent - that is the last fast scraper that I caught (26/04/2006 08:53:24 - hitting the site at upto 20 times/sec).
[edited by: AlexK at 2:09 pm (utc) on April 26, 2006]
$bTotVisitbe upped to 1500.
The AdSense-bot ("Mediapartners-Google") and the current Google-bot ("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)") customarily share the same IP. That is leading to continual slow-scraper entries in the log-file (although my AdSense earnings are unaffected - I wish that Google would sort this out).
PS:
The old Googlebot is no more. It is defunct, has joined the choir invisible, etc etc.
[edited by: jatar_k at 5:22 pm (utc) on Oct. 19, 2006]