Forum Moderators: coopster & phranque

Message Too Old, No Replies

Ban User Agents With Perl/SSI -- Pesky Bots Will Go Away

Heres how to do it.

         

Froggyman

4:16 am on Jun 9, 2001 (gmt 0)



A lot of people have been asking how to block an agent name so I decided to write my own. For those of us that can't block an Agent through .htaccess here is a simple (yet efficient) way of doing it with SSI.

Cut and paste this file and save as ban_bot.cgi and place in your cgi-bin.


#!/usr/local/bin/perl

#######################
# Browser Agents Banned
@browser = ("Wget/1.6","Zeus","EmailSiphon"); # List of Banned Agents - Add as many as you like

########################
# Get Browser Agent Info
$get_agent = $ENV{'HTTP_USER_AGENT'}; # Get Browser Agent - Requires SSI

########################
# Check Against Ban List
foreach $ban (@browser) {
if ($get_agent =~ /$ban/) {
$punish = 1;
}
}
if ($punish == 1) { # Banned agent is placed into an infinite loop
while (1) {
$x++;
}
}
else {
print "Content-type: text/html\015\012\015\012\n\n"; # The innocent are set free
}

############
# End Script

Place <!--#include virtual="/cgi-bin/ban_bot.cgi"--> at the top of each page on your site and that's it!

I've tested this with my best snooping software and it works great. You may freely use and modify this script as you see fit although I would ask that users post any improvements made so that the rest of us can benefit from them.

theperlyking

11:40 am on Jun 9, 2001 (gmt 0)

10+ Year Member



Wouldnt this clog up the web server? If you get a lot of wget hits (e.g wget as it tends to be aggressive with many simultaneous hits) you could take down the server since all the perl scripts that are spawned never end.

sugarkane

1:26 pm on Jun 9, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>clog up the web server

How about replacing the infinite loop with a 'sleep' command to delay delivery of the page long enough so that the spider loses interest, but not so long as to leave too many processes hanging around?

Air

2:44 pm on Jun 9, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How about redirecting the spider to, say ... Google ;)

Just kidding of course

toolman

3:22 pm on Jun 9, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Air! That's a brilliant idea. Send them on their way to your competition's site ;)

theperlyking

3:28 pm on Jun 9, 2001 (gmt 0)

10+ Year Member



If its running in an SSI you couldnt set a new location but that would be fun.

Is it actually possible to block user agent(i.e instead of ip) from .htaccess? Froggyman you mention it but I wasnt able to find out the syntax for it.

theperlyking

3:36 pm on Jun 9, 2001 (gmt 0)

10+ Year Member



Hm, answering my own question here - it looks like you can with mod rewrite:

RewriteCond %{HTTP_USER_AGENT} ^(Wget¦Zeus¦EmailSiphon) [NC]
RewriteRule .* - [F]

Gives the UA a "403 forbidden"

Froggyman

4:23 pm on Jun 9, 2001 (gmt 0)



My current bot list:
@browser = ("Wget/1.6","Webster Pro V2.9 Win32","EmailSiphon","MSProxy/2.0");

I inserted the script in to my ax.pl file (AXS) and let it run overnight. I threw an e-mail auto responder into it so I could see the results. I got one hit from a bot:

The following session generated a banned browser agent error:

Host: 65.165.71.2
Agent: Wget/1.6
Referrer: [none]
Document: [mydomain.com...]
Time: Saturday, 09-Jun-2001 05:02:54 EDT

Notice that Wget gave up after one hit?

>>>How about replacing the infinite loop with a 'sleep' command to delay delivery of the page long enough so that the spider loses interest, but not so long as to leave too many processes hanging around? <<<

Definitely worth a shot. I'll test it out. Also, I let about three of my own spiders rip against this script simultaneously and the script did not inhibit real traffic in any noticeable way.

>>>it looks like you can with mod rewrite:<<<

Not on two servers I tried. I tested dozens of possibilities with .htaccess (I do know .htaccess commands) but to no avail. If the host server won't allow it, it ain't happening.

>>>How about redirecting the spider to, say ... Google ;)<<<

It's possible. The only drawback is that it will only work with a real browser and not the snooping software I was throwing at it. :)

theperlyking

4:29 pm on Jun 9, 2001 (gmt 0)

10+ Year Member



The mod rewrite solution does work if you have mod rewrite available.

The loop of

while (1) {
$x++;
}

uses 100% cpu (or as much as it can) which is not a good idea on a web server so sleep would be a good idea :)

Everyman

4:50 pm on Jun 9, 2001 (gmt 0)



The while(1) loop is a disaster; it will drive up your server load substantially, and everything slows down (at least on Linux).

The sleep() won't consume CPU cycles, but it keeps the connection alive, which means one of the httpd forks is used up needlessly.

Why not spit out "Server too busy" and exit? You free up your resources. Of course, you probably feel that this also frees up the bad bot to try another page. True, but I still think that overall, it consumes more of the bot's resources than your resources.

If you have any keywords in your URL, on some bots you still might get some benefit from being listed, even with a page that the bot thinks is only "Server too busy." If someone using the bot's index sees your URL and clicks on it, you still get the hit and they get the real page. If the bot shows the "Server too busy" under the URL on their SERP, it looks like it was the bot's fault, not yours, and if the URL is interesting a searcher might still click on it.

Froggyman

4:58 pm on Jun 9, 2001 (gmt 0)



Everyman, how do I spit out a "Server too busy" and exit?

sugarkane

5:01 pm on Jun 9, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Assuming you mean a 'server too busy' HTTP status code, you can't via SSI - the headers will already have been passed to the browser by the time the SSI is called.

theperlyking

5:07 pm on Jun 9, 2001 (gmt 0)

10+ Year Member



You could reverse cloak the page, encase everything in conditional SSI directives, then bots get a blank page and normal users dont.
Theres still a server hit but probably no worse than putting a server busy message up via cgi.

Edited by: theperlyking

littleman

5:08 pm on Jun 9, 2001 (gmt 0)



If you do have apache but you do not have mod_rewrite you will have mod_setenvif it is compiled in by default. You still could do a lot with it, though it takes more lines of code.

Here is an example of blocking by referrer:
SetEnvIfNoCase Referer "^http://www.iaea.org" spam_ref=1
<FilesMatch "(.*)">
Order Allow,Deny
Allow from all
Deny from env=spam_ref
</FilesMatch>

A U_A block:

SetEnvIfNoCase User-Agent "Wget" no_spam=1
<FilesMatch "(.*)">
Order Allow,Deny
Allow from all
Deny from env=no_spam
</FilesMatch>

Often I just give the bots I don't like a blank page with no links to fallow. The logic behind that is that my server load is more critical than playing with the bad bots.

toolman

5:13 pm on Jun 9, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does mod_rewrite have to be installed on the server or can it be run like an rpm in your individual domain?

theperlyking

5:16 pm on Jun 9, 2001 (gmt 0)

10+ Year Member



Has to be installed I think, having said that the last host I asked said it was easy enough and doesnt require a recompile (i'd thought you have to compile apache with it in), they had it available by the time my account was active an hour later.

I specifically asked for it because its so useful for many things.

Froggyman

5:17 pm on Jun 9, 2001 (gmt 0)



>>>A U_A block:<<<

I tried that Littleman but it wouldnt work. Also tried BrowserMatch and the rest. Gives me a 500 error.

Have you been hit by MSProxy/2.0 yet? It's a LEECH. It hits the same page over and over and over...well you get an idea- Hundreds of times over the course of a day and I'm tired of banning the IP's.

I've run searches on Google for new tricks to this problem and it appears I'm not alone. Others have the same problems with .htaccess modifications. I assume it's control by the host.

Everyman

5:25 pm on Jun 9, 2001 (gmt 0)



You're already in a CGI script, so you can spit out a dynamic page to stdout (standard output):

Leading angle brackets have been converted to braces for purposes of posting; I recommend a single space for the title:

Content-type: text/html

[blank line required here]

[a flush of standard output is recommended here; it seems to help in some cases with timing and keeps Apache happy on Linux]

{html>{head>{title> {/title>{/head>{body>

{br>Server too busy{/br>{/body>{/html>

[send a newline to stdout if you haven't already]

[exit the script]

littleman

5:25 pm on Jun 9, 2001 (gmt 0)



That could be it Froggy, you could be being blocked from these modifications. I've had to deal with that in the past. That sucks.

Froggyman

5:28 pm on Jun 9, 2001 (gmt 0)



Updated Script


#!/usr/local/bin/perl

# Browser Agents Banned
@browser = ("Wget/1.6","Webster Pro V2.9 Win32","EmailSiphon","MSProxy/2.0"); # List of Banned Agents

# Allow E-mail
$email = 1; # 0 = E-mail off; 1 = E-mail on

# Get Browser Agent Info
$get_agent = $ENV{'HTTP_USER_AGENT'}; # Get Browser Agent - Requires SSI

# Check Against Ban List
foreach $ban (@browser) {
if ($get_agent =~ /$ban/) {
$punish = 1; # Deny Speedy Access To Browser Agent
if ($email == 1) {
open (MAIL,"¦ /usr/lib/sendmail -t"); # Path to mail bin (may be different)
print MAIL "To: webmaster\@mydomain.com\n";
print MAIL "From: ban_bot\@mydomain.com\n";
print MAIL "Subject: BANNED BROWSER AGENT ERROR\n";
print MAIL "Reply-to: webmaster\@mydomain.com\n";
print MAIL "X-Priority: 1 (Highest)\n\n";
print MAIL "\n";
print MAIL "The following session generated a banned browser agent error:\n";
print MAIL "\n";
print MAIL "Host: $ENV{'REMOTE_ADDR'}\n";
print MAIL "Agent: $get_agent\n";
print MAIL "Referrer: $ENV{'HTTP_REFERER'}\n";
print MAIL "Document: $ENV{'SERVER_NAME'}$ENV{'DOCUMENT_URI'}\n";
print MAIL "Time: $ENV{'DATE_LOCAL'}\n";
print MAIL "\n";
print MAIL "------------------------------------------------------\n";
close (MAIL);
}
}
}
if ($punish == 1) { # Banned agent is put to sleep
sleep(60);
}
else {
print "Content-type: text/html\015\012\015\012\n\n"; # The innocent are set free
}
############
# End Script

Froggyman

7:08 pm on Jun 9, 2001 (gmt 0)



Any thoughts on using <STDIN> instead of sleep?

bartek

8:54 pm on Jun 16, 2001 (gmt 0)

10+ Year Member



Nice one, Froggyman... why wouldn't email work though? Path is good, any ideas? Relaying?

Froggyman

10:12 pm on Jun 16, 2001 (gmt 0)



Most servers will use: /usr/lib/sendmail
Some servers will use: /usr/sbin/sendmail

The -t option must be included for either of these work. Eample "/usr/lib/sendmail -t"

If you don't have a pop3 email account replace the To:, From:, and Reply To: email addresses with an email address you are allowed to use.

Here is an update on the effectiveness of the script. I have been hit by each of the following agents in at least 5 different sessions by seperate IP addresses. I now have a better understanding how each bot behaves.

Wget gives up after one request.

lwp-trivial gives up after 3 requests.

Zeus Webster Pro V2.9 Win32 waits it out and will aquire the page. A solution I am trying is increasing the sleep(60) to sleep(120).

I'm still exploring different ways to punish a banned agent and will post improvements as I find them.

Be careful what you put in the the banned browser list. This script uses string matching which if not used correctly could easily effect innocent users. E.g. if you ban "Zeus" you may end up up banning zeus.com employees. A better solution would be to ban "Zeus Webster Pro V". This way you can be reasonably assured that only the bot is being banned.

<Added> It's been a lot more quiet lately!:) </added>

john316

9:48 pm on Jun 17, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Could you send an email to the agent? Most domains have a catchall and I don't know if you can send an email to an IP as opposed to a domain, but that might be cool, for every unwanted hit, they get an email.

Froggyman

10:52 pm on Jun 17, 2001 (gmt 0)



Most of the pesky bots are the home use type. It wouldn't be possible to extract an e-mail address from the IP of a dial up user to send them a message.

icehousedesigns

1:26 am on Jun 18, 2001 (gmt 0)




RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Telesoft [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3.Mozilla/2.01 [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.*$ /badspammer.html [L]

kapow

3:56 pm on Jun 18, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A few ideas from a news group I use (this relates to email harvesting):
-------
1) HoneyPot
If you'd really like to screw up the e-mail harvesters without bringing down the web server, you might want to set up a honeypot.
you'll get the idea by looking here:
[awtrey.com...]

The basic idea is ... redirect a harvester to a page like this, the page generates hundreds of random e-mail addresses which the harvester has to
waste time processing and adding. There are also links to other pages at the top of the page that lead to other dynamic pages with hundreds of bogus
addresses, and so on, and so on...

The goal is to corrupt the harvester with lots of bogeys, and if you have a link to this at the top & bottom of your page, along with an .htaccess
redirect for the appropriate agents, they may never actually get your real e-mail addresses.

There is a mild server load with this one, but not much, considering it's a small perl script with no file/db access, and no number crunching.

2) Server redirect for your e-mail addresses. This one is _really_ elegant, I learned this one from Julian Haight (Spam Cop):
Look at :
[julianhaight.com...] and click on his e-mail address, notice the link before you click on it.

The link is a page link which generates a 301 error (page moved), and the new address is the mailto: e-mail link.

I don't think too many spam harvesters would be smart enough to figure this one out.

icehousedesigns

4:58 am on Jun 24, 2001 (gmt 0)



Froggy, excellent script. In regards to not getting mail notification..I've noticed you do...depending on the server error code returned...not sure why...here is the skinny:

Keep in mind both agents WERE blocked...

Emailwolf - returned error code 302 - no e-mail

Wget - returned 200 (ok ) but did block it - did get email notification..odd...anyone?

Froggyman

2:56 pm on Jun 24, 2001 (gmt 0)



A 302 is a moved temporarily (redirect). Maybe Emailwolf was requesting something that wasn't there.

Just to make sure, I decided to download EmailWolf and test it myself. I blocked agent "Emailwolf" and unleashed it on my site. I got the following e-mail:

The following session generated a banned browser agent error:

Host: [My IP Omitted]
Agent: EmailWolf 1.00
Referrer:
Document: [mydomain.com...]
Time: Sunday, 24-Jun-2001 10:46:26 EDT

The EmailWolf software was unable to retrieve any e-mail addresses and gave up on one attempt.

Were you hit by a different version of EmailWolf?

icehousedesigns

3:09 pm on Jun 24, 2001 (gmt 0)



Never mind froggy I got it to work. For some reason the contact page I was testing it on that had my e-mail address was causing the EmailWolf agent to return a 302..no idea why. ( it still doesn't harvest the address ). Other pages it works great..e-mail notification and all. :) Awesome script man.
This 45 message thread spans 2 pages: 45