Forum Moderators: coopster
The original revised-routine works fine--in fact, so far since June, there have been 677 attempts by 20 different people to rip off my site; all blocked by the routine. However, I have noticed a mistake in the long-term scraper code:
should be:
if( file_exists( $ipFile )) {
$fileATime= fileatime( $ipFile );
$fileMTime= filemtime( $ipFile );
$fileATime++;
$visits= $fileATime - $fileMTime;
$duration= $time - $fileMTime;// secs
// foll test also keeps tracking going to catch slow scrapers
if( $duration > 86400 ) {// 24 hours; start over
$fileMTime= $fileATime = $time;
$duration= $visits = 1;
} else if( $duration < 1 ) $duration = 1;
// test for slow scrapers
if( $visits >= $bTotVisit ) {
[edited by: coopster at 1:57 pm (utc) on Aug. 7, 2008]
The modified routine works extremely well. In fact, so well that you may need to carefully consider whether to use it. You see, the scrapers that it caught were the Adsense bot and (just a handful of times) the Yahoo! Slurp! bot.
How the corrected routine works:
Requests from a particular IP are counted, and reset every 24 hours. If the count exceeds $bTotVisit then the request is blocked with a 503 Service Temporarily Unavailable and will stay that way until the next reset.
Fast scrapers (the more-normal culprits) are blocked by checking whether their hit-rate exceeds $bMaxVisit / $bInterval. If so, the count period is reset to keep them out for at least $bPenalty secs.
This is what the block-logfile looks like for slow scrapers (notice that the Adsense bot makes a request every 10 secs). In normal use, this wretched bot is prone to making upto 6 requests for the same page in short succession.
Blocked IPs:The first line above was the last block of that particular period (the count was given a 24-hour reset). This behaviour made no difference to my Adsense revenue for the blocked-period. In fact, the click-through rate was exceptionally good.
.
* 66.249.65.19 [ crawl-66-249-65-19.googlebot.com ] 1000 line(s)
.
1000 total lines in log-file.
Log Line Lines
66.249.65.19 13/10/2005 21:38:25 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:38:15 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:38:05 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:37:55 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:37:44 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:37:34 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:37:24 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:37:13 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:34:10 Mediapartners-Google/2.1 (slow scraper) 1
66.249.65.19 13/10/2005 21:34:00 Mediapartners-Google/2.1 (slow scraper) 1
The Adsense-bot and the Yahoo Slurp!-bot are hitting my site like absolute maniacs. For the first 13 days of October:
humans: 64,877 pages
robots: 64,901 pages
Inktomi Slurp: 22,907 pages + 1,116 hits on robots.txt + a handful of 503s.
Google AdSense: 8,019 pages + 28 hits on robots.txt + 6,690 "503 Server-busy"
It gets worse if we consider bandwidth.
The Yahoo Slurp! bot uses the If-Modified-Since request header [webmasterworld.com], and thus gets 304 Not Modified responses for many of it's requests. It also can handle compressed pages which reduces the bandwidth by up to 80%.
The Adsense bot does not use the If-Modified-Since request header (why not?) but, thankfully, does handle compressed pages.
The MSNBot, just for completeness, makes use of neither and, consequently, takes more bandwidth than all the other bots put together.
I am prepared to pay the price of this blocking, since I am sick and tired of being hit so hard by the search-bots, yet get so little in return. You may not be.
I'm running this beatiful script by adding the following to my .htaccess:
<IfModule mod_php4.c>
php_value auto_prepend_file "/path/to/file/block_bad.php"
</IfModule>
as i'm going to update PHP to V5.0.4 on the next days (loaded module: mod_php5), is it enough to change on my .htaccess: <IfModule mod_php4.c> to <IfModule mod_php5.c>?!?
also please: is this ultimate version of the script compatible with PHP V5.0.4?
i'm actually using the previous one [webmasterworld.com...]
Thanks in advance,
tito
also please: is this ultimate version of the script compatible with PHP V5.0.4?
The main difference between this version and the previous routine is that:
For your reference, here is the well-behaved bots exclusion code that I previously employed:
$remote = $_SERVER[ 'REMOTE_ADDR' ];
if(( substr( $remote, 0, 10 ) == '66.249.64.' ) or// Google has blocks 64.233.160.0 - 64.233.191.255
( substr( $remote, 0, 10 ) == '66.249.65.' ) or// Google has blocks 66.249.64.0 - 66.249.95.255
( substr( $remote, 0, 10 ) == '66.249.66.' ) or// Google has blocks 72.14.192.0 - 72.14.207.255
( substr( $remote, 0, 9 ) == '216.239.3' ) or// Google has blocks 216.239.32.0 - 216.239.63.255
( substr( $remote, 0, 9 ) == '216.239.4' ) or
( substr( $remote, 0, 9 ) == '216.239.5' ) or
( substr( $remote, 0, 10 ) == '65.54.188.' ) or// MS has blocks 65.52.0.0 - 65.55.255.255
( substr( $remote, 0, 10 ) == '207.46.98.' ) or// MS has blocks 207.46.0.0 - 207.46.255.255
( substr( $remote, 0, 13 ) == '66.194.55.242' )// Ocelli
) {
// let well-behaved bots through
} else {
// block routine
}
[edited by: coopster at 2:50 am (utc) on Aug. 6, 2008]
$bTotVisit = 0switches the slow-scraper code off.
Google Adsense + Slurp! seem to have re-programmed themselves, and the former problems no longer occur. MSNBot has stepped into this breach and gets blocked each 24 hours (for the last 7 days or more). The latter bot takes way too much bandwidth so, once again, I am happy with that.
I would be interested to hear any suggestions on making the block-value dynamic, rather than fixed.
No problem there. Should that document include the php start and end tags? I assumed the answer would be yes.
My log files are all zero byte files. Should I assume that no activity will be logged until the triggers are reached? And what cleans up those files after 24 hours? It looks like I need to write a cron job if I want to keep a clean directory.
I didn't look over the code closely enough to answer my own questions, opting instead to trust the code I found here and make the necessary changes for my setup.
Should that document include the php start and end tags?
grandpa:
Should I assume that no activity will be logged until the triggers are reached?
And what cleans up those files after 24 hours? It looks like I need to write a cron job if I want to keep a clean directory.
The A-time (access time) and M-time (modification time) are used to track activity with each file. It is these that is reset, either if the 24-hr limit has passed (both) or if there is continual activity (just one). This is the reason that the routine is so quick and occupies so little resources.
It may be a good idea to set up a little batch job to give yourself a simple way to delete the log-blocks file, should you ever want to (the routine auto-creates the log-blocks file if not existing, and rolls-over if the line-limit is reached, so even that is not necessary). A way to read it is always useful, though.
If otherwise-welcome 'bots are overrunning your site with requests and tripping the script, look into using the Crawl-delay directive in robots.txt. Yahoo, MSN, and Ask Jeeves/Teoma all support this directive, possibly others as well. Check the specific search engines' robot information page for up-to-date info.
Example robots.txt entry:
User-agent: msnbot
Crawl-delay: 90
Disallow: /cgi-bin
This may help with some of the problems described above.
Jim
I notice that both MSN & Yahoo specify it as "Crawl-delay" whilst ASK-Jeeves (Teoma) has it as "Crawl-Delay" - let's hope that this will not cause issues although, to be fair, Teoma has never been at issue on my site, only the first two.
Continuing to be OT (sorry), I'd like to see a robots protocol 2.0 come out that would make some of the new directives part of a standard. Things like crawl-delay, noarchive and URL wildcards. Based on this thread, for example, Google does not recognize crawl-delay (but implements wildcards and noarchive). MSN has crawl-delay and noarchve, but not wildcards.
URL wildcards.
You're fine and your * for robots *is* part of the protocol and, in fact, Google uses it in their own robots.txt [google.com]. Google, however, has an extension that allows
Disallow: *.gif
That extension is not recognized, as far as I can see, by MSN. According to [google.com...]
Additionally, Google has introduced increased flexibility to the robots.txt file standard through the use asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name....
To remove all files of a specific file type (for example, .gif), you'd use the following robots.txt entry:
User-agent: Googlebot
Disallow: /*.gif$To remove dynamically generated pages, you'd use this robots.txt entry:
User-agent: Googlebot
Disallow: /*?
It's that extension that I don't believe is observed by MSN or others.
Dayo_UK, in the middle of one of the vast Google Jagger-update threads, reported how the G-Mozilla bot had brought his DB crashing to the ground with 20 page-requests/second across an extended period. This routine could have saved his site from going down.
Yesterday, I got a brand-new Japanese bot taking a page every 11 seconds for 18 hours. Here are the first few accesses:
133.9.238.95 - - [05/Nov/2005:10:47:03 +0000] "GET /robots.txt HTTP/1.1" 200 39 "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:04 +0000] "HEAD / HTTP/1.1" 200 - "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:04 +0000] "HEAD /index.html HTTP/1.1" 301 - "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:05 +0000] "HEAD /index.htm HTTP/1.1" 301 - "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:06 +0000] "HEAD /Default.htm HTTP/1.1" 404 - "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:16 +0000] "GET / HTTP/1.1" 200 38623 "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:28 +0000] "GET /favicon.ico HTTP/1.1" 200 1406 "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:38 +0000] "GET /style.css HTTP/1.1" 200 10325 "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
133.9.238.95 - - [05/Nov/2005:10:47:49 +0000] "GET /mfcs.php HTTP/1.1" 200 40931 "-" "e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)" In:- Out:-:-pct.
Found you while searching for data on the e-SocietyRobot ... it's crawled my site today. It did follow robots.txt, but it's still not polite to crawl ALL of a site, uninterrupted, at a rate of a page every 11 seconds.
I put the crawl-delay into my robots.txt file and will be adding your botblocking routine within the next few minutes.
Many thanks!
AlexK in your newest version (2005-11-20) thanks for pointing out // It is best if _B_DIRECTORY is not web-accessible as mine was.
I'm not a programmer and was wondering the difference between an ealier version (Oct. I think) which had the line $bTotBlock= 3600;// secs; period to block long-duration scrapers but the 2005-11-20 version does not does not use $bTotBlock?
thanks.
so I used my older version and changed $bTotVisit to 1000 from 777. Could this be related to my "$bTotBlock" question above? line 56 is:
echo "$visits visits from your IP-Address within the last ". (( int ) $duration / 3600 ) ." hours. Please wait ". (( int ) ( 86400 - $duration ) / 3600 ) ." hours before retrying.</p></body></html>";
<edit>
FYI I'm talking about the 2005-11-20 version
</edit>
[edited by: coopster at 2:50 am (utc) on Aug. 6, 2008]
dbar:
ops just uploaded and had an error
I've just done a file comparison on my live code and the code snippet, and there are no differences. So, my apologies, dbar, but you have made an oops when you copied/edited the new code.
was wondering the difference between an ealier version ... the 2005-11-20 version does not does not use $bTotBlock?
The previous code blocked 2 slow scrapers (both "good" bots) and I then discovered that each bots' programmed behaviour was to make another GET request every 10 (whatever) secs. That was bad enough (8,400 wasted requests), but the previous behaviour was also to re-program $fileATime and $fileMTime in such a way that the slow-scapers would effectively have been blocked forever, or until $bTotBlock secs after they stopped.
So, I re-thought that bit out, and the new code now allows $bTotVisit visits from any IP within a 24 hour period. It does this by re-setting $fileMTime to the current time every 24 hours.
That works fine on my site - it is fairly busy (about 10,000 pages daily), and just one bot gets blocked occasionally:
Blocked IPs:
* 64.124.122.228 [ 64.124.122.228.gw.xigs.net ] 1000 line(s)
1000 total lines in log-file.
.
64.124.122.228 21/11/2005 02:59:26 RufusBot (Rufus Web Miner; [64.124.122.252...] (slow scraper) 1
64.124.122.228 21/11/2005 02:58:38 RufusBot (Rufus Web Miner; [64.124.122.252...] (slow scraper) 1
64.124.122.228 21/11/2005 02:57:50 RufusBot (Rufus Web Miner; [64.124.122.252...] (slow scraper) 1
I am content about that. You will likely want to adjust $bTotVisit to a suitable figure for your site. It would perhaps also be a good idea to allow the 24-hour period to be a variable. I shall change that soon-ish.
[edited by: coopster at 2:51 am (utc) on Aug. 6, 2008]
I've set $bInterval = 20 and $bMaxVisit = 4 and keep hitting the refresh button and jumping to different pages, but I'm not getting banned. Shouldn't I be restricted to 4 in 20 seconds with this configeration? thanks.
dbar, Two thing to make sure when testing the script.
The first is that you are refreshing your pages and not using cached ones - use ctrl+F5 to refresh the page. Second, delete your ipfile at the start of each test. If you leave the ipfile for a while - the longer the ip file is left for the less likely it meets the (( $visits / $duration ) > ( $bMaxVisit / $bInterval )) test
I did wonder initially why I could not get the fast scraper part to trip until I looked at what was actually happening with the $fileATime & $fileMTime values during a fast scrape. I realised the reason that it was not tripping was because I had left the ipfile for too long. I thought that this might be a problem with the script if someone had visited your site first and then came back later to scrape it but decided that they would still get caught by $bTotVisit page amount which for me is acceptable.
$ipFilefor each visit) ...
In practice, most accesses to your site--unless it is very busy--will cause a rollover of the dates (due to
$time - $fileMTime > 86,400), causing
$fileATime = $fileMTime = $timeto be tripped, and thus a start-over.
The default trip-rate is 2 / sec (
$bMaxVisit / $bInterval = 14 / 7) which means, at the absolute worst-case scenario, a fast scraper will get 999 pages in 500 secs. That is a lot, but a reasonable compromise without using some form of whitelist-exclusions (my attitude is a-scraper-is-a-scraper whether they are wearing an admittance-badge or not; it is their behaviour which determines their status).
PS
This thread is very current to events on WebmasterWorld [webmasterworld.com] (the link, and foll pages, has some excellent alternative suggestions for those with ultra-busy sites).
$bInterval= 3;// secs; check interval (best < 30 secs)
$bMaxVisit= 2;// Maximum visits allowed within $bInterval
After the time of banning was over i could refreshed pages even 20 times in second and routine to prevent fast bot does not seems to works.So routine (fast bots)works once and then not anymore.
Wonder why and could be somehting done tha this not happend?
Thanks
So routine works once and then not anymore
With your values ($bInterval= 3, $bMaxVisit= 2), the trip ratio is 1.5. After being tripped, $ipFile would be set to:
$fileMTime = refTime - 3
$fileATime = refTime + 97
$fileMTime = refTime - 3
$fileATime = refTime + 98
$visits = 101
$duration = 3
Visit 1:
$fileMTime = refTime - 3
$fileATime = refTime + 98
$visits = 101
$duration = 68
Visit 2:
$fileMTime = refTime - 3
$fileATime = refTime + 99
$visits = 102
$duration = 68
Visit 3:
$fileMTime = refTime - 3
$fileATime = refTime + 100
$visits = 103
$duration = 68
So, after being tripped, it would take just 3 more visits (in the same second) to trip the block again. I have no idea why you cannot do this on your system, although I would advise you that your value for $bMaxVisit is unreasonably low. 18 + 12 ($bInterval + $bMaxVisit) might be better, else you may be triggering blocks continually.
Do you have the latest code? It is marked "2005-11-20" at the top.
Finally, the max any scraper can take is $bTotVisit (default 1,000 pages) in any 24 hour period, slow or fast.