Cut and paste this file and save as ban_bot.cgi and place in your cgi-bin.
####################### ######################## ######################## ############
#!/usr/local/bin/perl
# Browser Agents Banned
@browser = ("Wget/1.6","Zeus","EmailSiphon"); # List of Banned Agents - Add as many as you like
# Get Browser Agent Info
$get_agent = $ENV{'HTTP_USER_AGENT'}; # Get Browser Agent - Requires SSI
# Check Against Ban List
foreach $ban (@browser) {
if ($get_agent =~ /$ban/) {
$punish = 1;
}
}
if ($punish == 1) { # Banned agent is placed into an infinite loop
while (1) {
$x++;
}
}
else {
print "Content-type: text/html\015\012\015\012\n\n"; # The innocent are set free
}
# End Script
Place <!--#include virtual="/cgi-bin/ban_bot.cgi"--> at the top of each page on your site and that's it!
I've tested this with my best snooping software and it works great. You may freely use and modify this script as you see fit although I would ask that users post any improvements made so that the rest of us can benefit from them.
I inserted the script in to my ax.pl file (AXS) and let it run overnight. I threw an e-mail auto responder into it so I could see the results. I got one hit from a bot:
The following session generated a banned browser agent error:
Host: 65.165.71.2
Agent: Wget/1.6
Referrer: [none]
Document: [mydomain.com...]
Time: Saturday, 09-Jun-2001 05:02:54 EDT
Notice that Wget gave up after one hit?
>>>How about replacing the infinite loop with a 'sleep' command to delay delivery of the page long enough so that the spider loses interest, but not so long as to leave too many processes hanging around? <<<
Definitely worth a shot. I'll test it out. Also, I let about three of my own spiders rip against this script simultaneously and the script did not inhibit real traffic in any noticeable way.
>>>it looks like you can with mod rewrite:<<<
Not on two servers I tried. I tested dozens of possibilities with .htaccess (I do know .htaccess commands) but to no avail. If the host server won't allow it, it ain't happening.
>>>How about redirecting the spider to, say ... Google ;)<<<
It's possible. The only drawback is that it will only work with a real browser and not the snooping software I was throwing at it. :)
The sleep() won't consume CPU cycles, but it keeps the connection alive, which means one of the httpd forks is used up needlessly.
Why not spit out "Server too busy" and exit? You free up your resources. Of course, you probably feel that this also frees up the bad bot to try another page. True, but I still think that overall, it consumes more of the bot's resources than your resources.
If you have any keywords in your URL, on some bots you still might get some benefit from being listed, even with a page that the bot thinks is only "Server too busy." If someone using the bot's index sees your URL and clicks on it, you still get the hit and they get the real page. If the bot shows the "Server too busy" under the URL on their SERP, it looks like it was the bot's fault, not yours, and if the URL is interesting a searcher might still click on it.
Here is an example of blocking by referrer:
SetEnvIfNoCase Referer "^http://www.iaea.org" spam_ref=1
<FilesMatch "(.*)">
Order Allow,Deny
Allow from all
Deny from env=spam_ref
</FilesMatch>
A U_A block:
SetEnvIfNoCase User-Agent "Wget" no_spam=1
<FilesMatch "(.*)">
Order Allow,Deny
Allow from all
Deny from env=no_spam
</FilesMatch>
Often I just give the bots I don't like a blank page with no links to fallow. The logic behind that is that my server load is more critical than playing with the bad bots.
I specifically asked for it because its so useful for many things.
I tried that Littleman but it wouldnt work. Also tried BrowserMatch and the rest. Gives me a 500 error.
Have you been hit by MSProxy/2.0 yet? It's a LEECH. It hits the same page over and over and over...well you get an idea- Hundreds of times over the course of a day and I'm tired of banning the IP's.
I've run searches on Google for new tricks to this problem and it appears I'm not alone. Others have the same problems with .htaccess modifications. I assume it's control by the host.
Leading angle brackets have been converted to braces for purposes of posting; I recommend a single space for the title:
Content-type: text/html
[blank line required here]
[a flush of standard output is recommended here; it seems to help in some cases with timing and keeps Apache happy on Linux]
{html>{head>{title> {/title>{/head>{body>
{br>Server too busy{/br>{/body>{/html>
[send a newline to stdout if you haven't already]
[exit the script]
# Browser Agents Banned # Allow E-mail # Get Browser Agent Info # Check Against Ban List
#!/usr/local/bin/perl
@browser = ("Wget/1.6","Webster Pro V2.9 Win32","EmailSiphon","MSProxy/2.0"); # List of Banned Agents
$email = 1; # 0 = E-mail off; 1 = E-mail on
$get_agent = $ENV{'HTTP_USER_AGENT'}; # Get Browser Agent - Requires SSI
foreach $ban (@browser) {
if ($get_agent =~ /$ban/) {
$punish = 1; # Deny Speedy Access To Browser Agent
if ($email == 1) {
open (MAIL,"¦ /usr/lib/sendmail -t"); # Path to mail bin (may be different)
print MAIL "To: webmaster\@mydomain.com\n";
print MAIL "From: ban_bot\@mydomain.com\n";
print MAIL "Subject: BANNED BROWSER AGENT ERROR\n";
print MAIL "Reply-to: webmaster\@mydomain.com\n";
print MAIL "X-Priority: 1 (Highest)\n\n";
print MAIL "\n";
print MAIL "The following session generated a banned browser agent error:\n";
print MAIL "\n";
print MAIL "Host: $ENV{'REMOTE_ADDR'}\n";
print MAIL "Agent: $get_agent\n";
print MAIL "Referrer: $ENV{'HTTP_REFERER'}\n";
print MAIL "Document: $ENV{'SERVER_NAME'}$ENV{'DOCUMENT_URI'}\n";
print MAIL "Time: $ENV{'DATE_LOCAL'}\n";
print MAIL "\n";
print MAIL "------------------------------------------------------\n";
close (MAIL);
}
}
}
if ($punish == 1) { # Banned agent is put to sleep
sleep(60);
}
else {
print "Content-type: text/html\015\012\015\012\n\n"; # The innocent are set free
}
############
# End Script
The -t option must be included for either of these work. Eample "/usr/lib/sendmail -t"
If you don't have a pop3 email account replace the To:, From:, and Reply To: email addresses with an email address you are allowed to use.
Here is an update on the effectiveness of the script. I have been hit by each of the following agents in at least 5 different sessions by seperate IP addresses. I now have a better understanding how each bot behaves.
Wget gives up after one request.
lwp-trivial gives up after 3 requests.
Zeus Webster Pro V2.9 Win32 waits it out and will aquire the page. A solution I am trying is increasing the sleep(60) to sleep(120).
I'm still exploring different ways to punish a banned agent and will post improvements as I find them.
Be careful what you put in the the banned browser list. This script uses string matching which if not used correctly could easily effect innocent users. E.g. if you ban "Zeus" you may end up up banning zeus.com employees. A better solution would be to ban "Zeus Webster Pro V". This way you can be reasonably assured that only the bot is being banned.
<Added> It's been a lot more quiet lately!:) </added>
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Telesoft [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3.Mozilla/2.01 [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.*$ /badspammer.html [L]
The basic idea is ... redirect a harvester to a page like this, the page generates hundreds of random e-mail addresses which the harvester has to
waste time processing and adding. There are also links to other pages at the top of the page that lead to other dynamic pages with hundreds of bogus
addresses, and so on, and so on...
The goal is to corrupt the harvester with lots of bogeys, and if you have a link to this at the top & bottom of your page, along with an .htaccess
redirect for the appropriate agents, they may never actually get your real e-mail addresses.
There is a mild server load with this one, but not much, considering it's a small perl script with no file/db access, and no number crunching.
2) Server redirect for your e-mail addresses. This one is _really_ elegant, I learned this one from Julian Haight (Spam Cop):
Look at :
[julianhaight.com...] and click on his e-mail address, notice the link before you click on it.
The link is a page link which generates a 301 error (page moved), and the new address is the mailto: e-mail link.
I don't think too many spam harvesters would be smart enough to figure this one out.
Keep in mind both agents WERE blocked...
Emailwolf - returned error code 302 - no e-mail
Wget - returned 200 (ok ) but did block it - did get email notification..odd...anyone?
Just to make sure, I decided to download EmailWolf and test it myself. I blocked agent "Emailwolf" and unleashed it on my site. I got the following e-mail:
The following session generated a banned browser agent error:
Host: [My IP Omitted]
Agent: EmailWolf 1.00
Referrer:
Document: [mydomain.com...]
Time: Sunday, 24-Jun-2001 10:46:26 EDT
The EmailWolf software was unable to retrieve any e-mail addresses and gave up on one attempt.
Were you hit by a different version of EmailWolf?