Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.
Feel free to use this on your own site and start blocking bots too.
(the top part is left out)<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]
I test for these agents in global.asa in the Session_OnStart event and send them to an explanation page that has no links it can follow.
Then I use a browscap.ini file that you can get from my website that has a special section for website strippers and other nasties.
You can get this browscap.ini file and soon some sample code from my personal website.
Oh, this is wonderful. I've just been to your web site. I don't think that blocking by I.P. would work, since I want to block some e-mail harvesters. I suppose I could keep an eye on their I.P. address and if it's always the same, then block it in IIS. But I'm definitely going to look into the browscap.ini on your site and go from there. What a relief !
Thanks again,
Snark
Before I add this to my .htaccess I want to make sure this last entry from SuperMan is valid and that there are no valid search engine robots among them.
I am most interested in stopping any bad bots and email harvesters.
Does this list stop most of the major players ? Does it also stop that atomic energy iaea ? I need to also check that it does not stop any potential search engines including Alexa.
[1]RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Siphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F]
I noticed today that 95.6% of my visitors are using Explorer and Netscape and the google bot consumes about 2% (total 97.6%). I'm beginning to wonder if it would be easyier to only allow folks whom are using selected browsers to visit my site instead of trying to block all the undesired ones. Maybe I would redirect the unacceptable browsers users to a page telling them I only support Explorer and Netscape.
Thoughts on this?
I don´t think that´s a good idea. I do unterstand the need to ban those email harvesters and offline browsers, but allowing only known browsers is not the way to go.
could someone please provide a nice htaccess list and lets say
update it here every on ore two month?
by the way - there are two other bots im concerned about -
one is called turnitinbot from turnitin.com
and the other one was also from one of these brand control bots -
i see them showing up more and more -
shouldnt we include them as well?
[1]RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Siphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F]
Nick
A couple of points first, though...
The [1] at the beginning of the first line is spurious, and should be removed, leaving the line reading:
RewriteEngine On
As it stands, this example code will generate a 403-Forbidden response. You can also configure it to respond with other error codes, or with permanent or temporary redirects to other pages on your own site or elsewhere. I strongly encourage you to read the documentation [httpd.apache.org] on mod_rewrite, whether you plan to "tweak" this example code or not. As states in the documentation, mod-rewrite is a poweful tool; and as such, it is also a dnagerous tool. Some time spend "reading the fine manual" may save you a lot of grief in the future.
By changing the final line to:
RewriteRule ^.* - [F,L]
you can minimize interactions with following rewrite rulesets, and also minimize CPU overhead for processing. The "L" tells mod-rewrite that this is the last rule that needs to be processed in this case, and to stop rewriting as soon as it is processed.
You can customize the 403-Forbidden page returned to the bad-bot (keeping in mind that at some time, as you modify this, you might introduce an error and catch an innocent person instead) to explain what happened and what to do about it. To do this, add:
ErrorDocument 403 /my403.html
at the beginning of the example code, and then create a custom 403 error page (called "my403.html" in this example.)
All RewriteCond's in this example are case-sensitive. This leaves it open to a few more errors as you maintain the file. To make the pattern-match case-insensitive, change the [OR] flag at the end of each line to [NC,OR]. Note also that the [OR] must not be included on the very last RewriteCond - the one directly preceding the RewriteRule. If it is, you'll lock up your server, and you and your users will get 500-Server Error responses to all requests. (After changing anything in your .htaccess file, it's a very good idea to access your own site, and make sure it still works!)
All RewritesCond's in this example assume that the user-agent starts with the pattern of characters shown (That's what the "^" character means). Some user-agent strings do not start with the "bad-bot" user agent string; they start with something common like "Mozilla/3.01" and then contain the bad-bot identification further on in the string. To catch these guys, you will need to remove the starting text anchor "^" from the pattern match string. This makes the pattern matching less efficient, and should only be done if necessary.
Here's one example that I know needs to be changed:
RewriteCond %{HTTP_USER_AGENT} Indy.Library [NC,OR]
Note that I removed the starting "^", so that it will ban any user-agent with "Indy Library" anywhere in its user-agent string, and that I will accept any character - including a space - after "Indy".
Again - Yes, you can cut 'n paste this into your .htaccess file - at your own risk. I recommend that you minimize this risk by reading the mod_rewrite documentation.
Hope this helps,
Jim
I've posted this before, but just in case:
mod-rewrite (and many related Apache modules) depend on "regular expressions" for pattern-matching. You can find a short and useful tutorial here [etext.lib.virginia.edu] on the University of Virginia Library Web site.
This is a big help in figuring out ^(what\ all\ the\ strange\ characters\ in\ mod_rewrite\ directives\ mean¦how\ to\ write\ them\ correctly)\.$
;)
Jim
This is a big help in figuring out ^(what\ all\ the\ strange\ characters\ in\ mod_rewrite\ directives\ mean¦how\ to\ write\ them\ correctly)\.$
Shouldn´t it read: This is a big help in figuring out (?:(?:^.*what\ all\ the\ strange\ characters\ in\ mod_rewrite\ directives\ mean.*how\ to\ write\ them\ correctly)¦(?:^.*how\ to\ write\ them\ correctly.*what\ all\ the\ strange\ characters\ in\ mod_rewrite\ directives\ mean))\.$
Setup
Server: Apache/1.3.26 (Unix) mod_ssl/2.8.10
OpenSSL/0.9.6d PHP/4.2.1
Linux version 2.2.19-7.0.16
Detected 467741 kHz processor.
Memory: 257496k/262080k available (1076k kernel code,
416k reserved, 3020k data, 72k init, 0k bigmem)
128K L2 cache (4 way)
CPU: L2 Cache: 128K
CPU: Intel Celeron (Mendocino) stepping 05
#!/usr/bin/perluse LWP::UserAgent;
use LWP::Simple;
use Time::HiRes qw(gettimeofday);$url = "http://server/root/test.html";
foreach $agent (qw(BlackWidow Zeus AaronCarter)) {
for($j=0;$j<10;$j++) {
$ua = new LWP::UserAgent;
$ua->agent($agent);
$t0 = gettimeofday;# Request document and parse it as it arrives
for(my $i=1;$i < 100;$i++) {
$res = $ua->request(HTTP::Request->new(GET => $url),
sub { });
}
$t{$agent} += gettimeofday-$t0;
}
$t{$agent} = $t{$agent}/($j+1);
}print map { $_,' needed ', $t{$_}, ' seconds.', "\n"} sort keys %t;
#!/usr/bin/perluse LWP::UserAgent;
$url = "http://server/www.pension-schafspelz.de/";
$ua = new LWP::UserAgent;# Request document and parse it as it arrives
while(true) {
$res = $ua->request(HTTP::Request->new(GET => $url),
sub { });
}
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [OR]
RewriteCond %{HTTP_USER_AGENT} Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} larbin [OR]
RewriteCond %{HTTP_USER_AGENT} LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} Siphon [OR]
RewriteCond %{HTTP_USER_AGENT} SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} Wget [OR]
RewriteCond %{HTTP_USER_AGENT} Widow [OR]
RewriteCond %{HTTP_USER_AGENT} Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} Zeus
RewriteRule .* - [F,L]
Results
[af@server SuchmaschinenTricks]$ ./bm
AaronCarter needed 2.03904145414179 seconds.
BlackWidow needed 1.89269917661493 seconds.
Zeus needed 1.90201771259308 seconds.
[af@server SuchmaschinenTricks]$ ./bm
AaronCarter needed 2.05427220734683 seconds.
BlackWidow needed 1.90449017828161 seconds.
Zeus needed 1.91795318776911 seconds.
[af@server SuchmaschinenTricks]$ ./bm
AaronCarter needed 2.04453534429724 seconds.
BlackWidow needed 1.89828474955125 seconds.
Zeus needed 1.90684572133151 seconds.
[li][b]httpd.conf, multiple RewriteCond, idle server[/b]
[af@server SuchmaschinenTricks]$ ./bm
AaronCarter needed 1.48258856209842 seconds.
BlackWidow needed 1.41852938045155 seconds.
Zeus needed 1.4474944526499 seconds.
[af@server SuchmaschinenTricks]$ ./bm
AaronCarter needed 1.47204467383298 seconds.
BlackWidow needed 1.40937690301375 seconds.
Zeus needed 1.42638698491183 seconds.
[af@server SuchmaschinenTricks]$ ./bm
AaronCarter needed 1.49393899874254 seconds.
BlackWidow needed 1.42769226160916 seconds.
Zeus needed 1.44513262401928 seconds.
[af@server SuchmaschinenTricks]$ ./bm
AaronCarter needed 1.77475028688257 seconds.
BlackWidow needed 1.69021615115079 seconds.
Zeus needed 1.59830655834892 seconds.
[af@server SuchmaschinenTricks]$ ./bm
AaronCarter needed 1.76538353616541 seconds.
BlackWidow needed 1.68273590911518 seconds.
Zeus needed 1.5909228108146 seconds.
[af@server SuchmaschinenTricks]$ ./bm
AaronCarter needed 1.77456200122833 seconds.
BlackWidow needed 1.69423974644054 seconds.
Zeus needed 1.60414087772369 seconds.
[af@server SuchmaschinenTricks]$ ./bm
AaronCarter needed 1.50137218562039 seconds.
BlackWidow needed 1.43611000884663 seconds.
Zeus needed 1.45189526948062 seconds.
[af@server SuchmaschinenTricks]$ ./bm
AaronCarter needed 1.48307919502258 seconds.
BlackWidow needed 1.41925389116461 seconds.
Zeus needed 1.43655191768299 seconds.
[af@server SuchmaschinenTricks]$ ./bm
AaronCarter needed 1.49346754767678 seconds.
BlackWidow needed 1.43283208933744 seconds.
Zeus needed 1.44978855956684 seconds.
K:\SuchmaschinenTricks>perl bm
AaronCarter needed 6.25990908796137 seconds.
BlackWidow needed 5.3813636302948 seconds.
Zeus needed 5.76372727480802 seconds.
K:\SuchmaschinenTricks>perl bm
AaronCarter needed 5.58627272735943 seconds.
BlackWidow needed 5.23381818424572 seconds.
Zeus needed 5.10827272588556 seconds.
K:\SuchmaschinenTricks>perl bm
AaronCarter needed 6.03227272900668 seconds.
BlackWidow needed 5.1229090907357 seconds.
Zeus needed 5.46418181332675 seconds.
K:\SuchmaschinenTricks>perl bm
AaronCarter needed 5.32499999349768 seconds.
BlackWidow needed 4.55199999159033 seconds.
Zeus needed 5.22927273403515 seconds.
Conclusion
.htaccess files have to be read each and every time a request is made. It takes longer to parse and compile the multiple RewriteCond directives than the one with the longer, single regular expression. The parsing the is the bottle neck in this scenario.
The httpd.conf file is read once the server starts up. The regular expressions are compiled once. Multiple short and simple REs execute faster than the one single, complex one. Execution time is the factor that matters most in this case.
[edited by: jatar_k at 4:35 pm (utc) on Sep. 22, 2002]
[edit reason] stopped side scroll [/edit]
Excellent data...
I wonder what the result would be performance-wise with four RewriteCond lines: One each for start-anchored, end-anchored, fully-anchored, and unanchored pattern strings. I use this method to keep the patterns organized by type, and to keep them neat and easy to maintain.
Thanks for the test - Very useful.
Jim
If I'm reading your code correctly, then you're demonstrating two things:
a) The response time difference between the two .htaccess files makes roughly 10%.
b) Each call takes between 0.002 and 0.006 seconds, depending on the load of the machine.
Combining those two, we're talking about an average additional overhead caused by multiple RewriteConds of 0.0004 seconds (1/2500 seconds) per request, on a somewhat aged machine.
Since you're always fetching the same file from the disk cache, the remaining server side overhead is very low (in contrast to a real life situation, where each request may cause a different set of files to be loaded from disk first), so we can assume that the difference is indeed caused by the different rule sets. The comments in your code talk about parsing the HTML, but I don't understand enough Perl to see if that really happens.
In summary, I'd prefer the maintenance friendly multiple-rule version any time, if it only costs me such a small price in terms of response time. Of course, I'm not serving millions of requests per day, so your mileage may vary.
comments in your code talk about parsing the HTML, but I don't understand enough Perl to see if that really happens.
No, no parsing going on since its irrelevant for the benchmarking.
on a somewhat aged machine
Don´t insult my trusted old linux box. I cannot guarantie for any DoS attacks it will launch on its own ;)
Since you're always fetching the same file from the disk cache, the remaining server side overhead is very low [...], so we can assume that the difference is indeed caused by the different rule sets
That was the idea behind the admittedly artificial setup.
I'd prefer the maintenance friendly multiple-rule version any time
As is quite often the case, it´s a tradeoff between speed and maintainability.
I followed this thread for quite some time since our server gets harassed by those "evil bots" as well. However, I couldn't quite decide to take action until the posting of jdMorgan which was almost a how-to.
However, having done this I run into the first trouble because my Apache 1.3 says:
Options FollowSymLinks or SymLinksIfOwnerMatch is off which implies that RewriteRule directive is forbidden
Simply adding a "FollowSymLinks on" on top of the htaccess doesn't work.
Any advice?
On a sidenote: I found some packages like "Sugerplum" and "robotcop" which promise to automize some of the functions intended by this htaccess. Any expereinces with these?
Options [httpd.apache.org] +FollowSymLinks
For that to works you need to have at least
AllowOverride [httpd.apache.org] Optionsprivileges. Those are set in the server config, virtual host context.
when spam comes in, i check the actual company site to get their genuine email addresses then manually add their email addresses to a mysql database. this means only genuine mailboxes get listed on the page and not the yahoo or hotmail addresses the spam is often sent from.
i've also added a simple browser check to the php page so that if an IE / Netscape / Opera user visits the page they will only see a normal forbidden message .... well, that's the theory, but i've not been able to test it yet. just need a "browser" or something that will let me set the UA to whatever i want .....
Thanks for the browscap.ini & sample IIS code, which I got from your site! I want to make sure I understand their use. I implement the browscap.ini (or whatever parts of it I want), and then I implement the code in global.asa for each robot I want to ban? Or just the one block of code gets revised to include each robot to be banned? (Or -- is every robot in the browscap.ini banned, so I should only include those which I wish to ban?) My partner is much more of a web programmer than I am and would take care of this, but I want to make sure I understand what needs to be done first!
Thanks a lot,
Snark
ErrorDocument 404 /404.htm
ErrorDocument 400 /404.htm
ErrorDocument 403 /404.htm
ErrorDocument 501 /404.htm
ErrorDocument 502 /404.htm
ErrorDocument 503 /404.htm
<FilesMatch "htm([l])*$">
ForceType application/x-httpd-php
</FilesMatch>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^InternetSeer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*Indy [OR]
RewriteCond %{HTTP_USER_AGENT} ^MSFrontPage [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^Ping [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^webcollage [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Siphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]
Thanks
Anni
First post!
Since I run a virtual server with a number of different domains, it seems to me it would make more sense to put my list of forbidden UAs in the httpd.conf file, rather than try to replicate them in .htaccess on each domain's document root. Are there any caveats or special directions I should follow before I proceed?
Thanks!
Randy
[edited by: jatar_k at 12:04 am (utc) on Sep. 16, 2002]
[edit reason] no sigs please [/edit]
As shown in post #77 [webmasterworld.com] putting your RewriteRules in httpd.conf is indeed faster and the way to go when you have access to it.
However, this will not solve the problem of applying those rules to all virtual servers. You cannot just put the rewriting code in the main section and expect it to work for all virtual servers. For an explanation on this see API Phases [httpd.apache.org] in the mod_rewrite URL Rewriting Engine documentation.
So, after [...] Apache has determined the corresponding server (or virtual server) the rewriting engine starts processing of all mod_rewrite directives from the per-server configuration in the URL-to-filename phase.
my emphasis
There´s also a thread How (and Where) best to control access [webmasterworld.com] that you might want to read on this topic. If you have mod_perl you might want to use the solution mentioned in this thread. Ask carfac [webmasterworld.com] for the modified version of BlockAgent.
And as a sidenote. Do not drop any URLs. Do not use a signature.
# -FrontPage-
IndexIgnore .htaccess */.?* *~ *# */HEADER* */README* */_vti*
DirectoryIndex index.html index.htm index.php index.phtml index.php3
# AddType application/x-httpd-php .phtml
# AddType application/x-httpd-php .php3
# AddType application/x-httpd-php .php
#
# Action application/x-httpd-php "/php/php.exe"
# Action application/x-httpd-php-source "/php/php.exe"
# AddType application/x-httpd-php-source .phps
<Limit GET POST>
order deny,allow
deny from all
allow from all
</Limit>
<Limit PUT DELETE>
order deny,allow
deny from all
</Limit>
AuthName www.XXXXXX.com
AuthUserFile /www/XXXXXX/_vti_pvt/service.pwd
AuthGroupFile /www/XXXXXX/_vti_pvt/service.grp
Should I remove this before pasting bans or simply add?
Thank you
Can I use this Rewrite stuff to block FrontPage from downloading my site? (I know the educators can still get my stuff from their browser's cache, etc, etc, but it would be nice to make them work at stealing, rather than having it be so easy, ya know?)
Thanks!
[edited by: jatar_k at 4:44 pm (utc) on Mar. 13, 2003]