Forum Moderators: phranque
Whee - what a great discussion.
[edited by: Marcia at 11:23 pm (utc) on Oct. 13, 2003]
[edited by: jdMorgan at 12:24 am (utc) on Nov. 19, 2003]
[edit reason] Corrected URL [/edit]
For example if I use this, is there some way to pass to the email that trap sends me what actually happened? (last few lines of the log?)
# Forbid requests for exploits & annoyances - and TRAP
#
# Bad requests
RewriteCond %{REQUEST_METHOD}!^(GET¦HEAD¦POST) [NC,OR]
# CodeRed
RewriteCond %{REQUEST_URI} ^/default\.(ida¦idq) [NC,OR]
RewriteCond %{REQUEST_URI} ^/.*\.printer$ [NC,OR]
RewriteCond %{REQUEST_URI} (mail.?form¦form¦form.?mail¦mail¦mailto)\.(cgi¦exe¦pl¦asp¦php)$ [NC,OR]
# GuestBook
RewriteCond %{REQUEST_URI} (guestbook)\.(cgi¦exe¦pl¦asp¦php)$ [NC,OR]
# MSOffice
RewriteCond %{REQUEST_URI} ^/(MSOffice¦_vti) [NC,OR]
# Nimda
RewriteCond %{REQUEST_URI} /(admin¦cmd¦httpodbc¦nsiislog¦root¦shell)\.(dll¦exe) [NC,OR]
# Various
RewriteCond %{REQUEST_URI} ^/(bin/¦cgi/¦cgi\-local/¦sumthin) [NC,OR]
RewriteCond %{THE_REQUEST} ^GET\ http [NC,OR]
RewriteCond %{REQUEST_URI} /sensepost\.exe [NC]
# RewriteRule .* - [F]
RewriteRule .* /cgi-bin/trap.cgi [L]
By all means it's a great thread and it has tons of good information. Go ahead and use it - just don't copy anything that you are not entirely sure about. Do research. All kinds of people have posted - they all face different issues, and they all have separate reasons for posting what they did. If you are not 100% sure about a thing - ask. Either here or in the other relevant forums.
This thread generally concentrates on making it work. There's not many questions asked about the motives and reasons. Some people will want to ban things other people (competitors) make their living from - and not all bots are bad for everyone. A question is never stupid; a copy can easily become very stupid. Don't just copy and paste, unless:
(1)
you know exactly what you are doing, and what you're not doing [AND] you know that what you are doing is also the right thing to do [AND] you know that what you are [i]not[/i] doing is also the right thing to avoid. Even though there's nothing questionable or unusual at all, you still need to know exactly what you are doing and why. Otherwise you could cause trouble for yourself and others. That is: Every single line needs to have a reason, and you need to know that reason personally. You also need to know that this line applies to your specific situation.
There is no such thing as a one-size-fits-all.
-------------------------------
RewriteCond %{REQUEST_URI} ^/(bin/¦cgi/¦cgi\-local/¦sumthin) [NC,OR] The quite normal directory name "
/cgi-bin/" is not matched here. In stead, the unusual name "/sumthin" is matched. What does this mean? It means, that whoever uses this line does not even know the name of his/her own cgi-folder(!?). No, it tells me that this is example code that is not adapted to specific use. This ("sumthin") must be an example - just like writing "www.example.com" when referring to some domain. Don't copy. Adapt to your own specific use in stead.
If you have a folder named "
/sumthin/", or a file named "/sumthing-else.html", then it's pretty obvious that visitors requesting that one will get banned. But there's something else. It's probably intentional here to match the special cases of cgi-folders that are not on this particular domain. So, if you copy this, and run valid scripts on your own server form a folder named
/cgi/ then you'll be putting legitimate visitors into the bot-trap. And you don't want to do that. Further, if your bot trap is also located in that folder, then you'll be messing up seriously, and you seriously don't want to do that. Now, that was an innocent example. As this thread is now in three parts you will find some that are worse, even much worse. So; Don't copy blindly. Do research. Always adapt to your own needs.
/claus
From this morning's "loser" log:
61.78.109.21 - - [18/Oct/2003:09:45:19 -0400] "GET /sumthin HTTP/1.0" 403 234 "-" "-"
...Served it a nice, tasty, low-calorie 403-Forbidden. :)
<added>Actually, to turn this post around to a more "reinforcing" direction, the code claus cites above *cannot* be used on one of my servers, because the "standard" location for user scripts is /cgi-local, which would match one of the patterns in the posted code. If I installed that line as originally posted, it might (depending on the contents and placement of the corresponding RewriteRule in my .htaccess file) actually disable one of my most important scripts! So, again, claus' advice is very sound: Don't just copy and paste this stuff if you don't know what each line means.</added>
Jim
So using these new great condensed rules to detect, block and optionally trap the bad bots, is there a way to pass to the trap.cgi what set of rules caused the trap to trip?
For example if I use this, is there some way to pass to the email that trap sends me what actually happened? (last few lines of the log?)
#!/usr/bin/perl -w$remreq = $ENV{REQUEST_URI};
$remaddr = $ENV{REMOTE_ADDR};
$usragnt = $ENV{HTTP_USER_AGENT} ¦¦ "The UA is blank";
$referer = $ENV{'HTTP_REFERER'} ¦¦ "there is no referer";
$date = scalar localtime(time);
$remmeth = $ENV{REQUEST_METHOD};
$remhost = $ENV{'HTTP_HOST'};
open(MAIL, "¦/usr/sbin/sendmail -t") ¦¦ die "Content-type: text/text\n\nCan't open /usr/sbin/sendmail!";
print MAIL "To: xxx\@yyy\.zzz\n";
print MAIL "From: xxx\@yyy\.zzz\n";
print MAIL "Subject: You caught another one!\n\n";
print MAIL "The following 'intruder' was caught by the \"Bot Trap\" and has been added to the ban env in .htaccess:\n\n";
print MAIL "The ip address: $remaddr was listed on $date \n";
print MAIL "The file requested was: $remreq\n";
print MAIL "The method used was: $remmeth\n";
print MAIL "The intruder's user agent was: $usragnt\n";
print MAIL "The document was referred by: $referer\n";
print MAIL "The Host Server is was $remhost\n";
close(MAIL);
exit;
This sends me and email as soon as the trap is sprung, which includes the date and time, the intruder's IP, the name of the file requested, the method (GET, POST, CONNECT, etc), the intruder's User Agent (or if blank), the referrer or blank, and the host from which the email was sent.
I hope this helps. You may have different paths to Perl and Sendmail. I also obfuscated my to and from email addresses in the example. You will need to input your own. Also, the vertical pipes (¦¦) are broken on this forum and should be retyped with your keyboard.
Wiz
That was just one great example of the importance of research :)
>> is there some way to pass to the email that trap sends me what actually happened?
AFAIK, when you do an internal rewrite like the example you posted above, the Environment Variables for the request will get passed on. So you could just get the Environment Variables in the script.
Here's a little snippet that prints all of them out alphabetically as raw text; one variable-value pair on each line. You could include this in the relevant section of the bot-trap you are using, ie. print these in the email:
--------------------------------------------------------------------------
foreach $key (sort keys(%ENV)) {print "$key: $ENV{$key}\n";} /claus
$date = scalar localtime(time);
print MAIL "Timestamp: $date\n";
foreach $key (sort keys(%ENV)) {
print MAIL "$key: $ENV{$key}\n";
} Before I used the code, not only did I search for "sumthin" but I also Googled for "cgi-local" here before I used that script, and found they were valid terms to block for (in my case ;) ).
What you both missed, but I saw and left in there is "sensepost" which both Google and the internal search cannot find anything on (in webmasterworld). I chose to leave it in there because "it couldn't hurt".
In fact I did customize the script. The original script had nothing about "guestbook" in there, which many sites don't use but are scanned for, and now that I posted it, I realize I can also add ¦htm to the guestbook line. Can I add "¦htm?" or should I use "¦htm¦html"?
But keep up the great work and sharing. I still learned more even from the reaction!
By the way I have found $ENV{'REMOTE_HOST'}; never seems to work on my server (I get a blank response) but I found that this code works for me:
$remote_addr = $ENV{'REMOTE_ADDR'};
use Socket;
$iaddr = inet_aton("$remote_addr");
$remote_host = gethostbyaddr($iaddr, AF_INET);
$remote_addr =~ s/\./\\\./gi;
that way I get the reverse dns for REMOTE_ADDR (before it reformats it for the htaccess file)
None of what I wrote was directed at you, but rather at people who pick up information in this thread out of context; in that case, I believe it *is* important to make the point that while some of the user-agents in the list are "definitely bad," others must be viewed in perspective: They may be good or bad, depending on the specific Web site, the market segment it's in, etc.
Best,
Jim
>> guestbook... (htm/l)?
First,
htm will not catch html, as you have an end anchor in that line ("$"), so you will need both. Second, it might be better to use this line in stead: RewriteCond %{REQUEST_URI} guestbook [NC,OR] Why? Because you intend to match requests for an url containing the word "guestbook", so you should focus on the important part. Otherwise you would still miss the ".shtml" extension, and then you would still miss the ".php4" extension, and then you would still miss the ".jsp" extension, and then you would still miss the ".cfm" extension, and then.... This example matches "guestbook" anywhere in the URL, extensions does not matter.
For others: This line implies that if you request a guestbook url, then you will get on the "banned bad-bots" list. This might not be a good idea for everyone, especially not if you run a guestbook.
>> mod_rewrite should actually make the original information available because its not a real redirect (right?)
Right. See post #6 - you can even use the snippet i provided to check it.
Just add the perl shebang line (#!/path/to/perl) before it, a file name ending in ".pl" or ".cgi" and chmod it to 755. Then make an internal rewrite from some odd filename to this file. Enter the odd filename in your browser address bar to get a list of all environment variables for this request.
/claus
what I (re)did in the end was just extend the msoffice line
RewriteCond %{REQUEST_URI} ^/(MSOffice¦_vti¦guestbook) [NC,OR]
(btw even if someone has a guestbook, they would be foolish to call it "guestbook" for the html page or the cgi, because its like keeping your formmail cgi named "formmail.cgi" and being surprised when you get spambot attacks)
congrats on 1000 posts, I bet at least 90% or more of them really helped folks... thanks! (is claus short for santa-claus? ;) )
REQUEST_URI is offset from the domain name; it includes the slash in front of it. If you want a start anchor, you should include the slash like this: RewriteCond %{REQUEST_URI} ^/guestbook [NC,OR] This will also match directories like this: http //example.com/guestbook/index.php
If you just want to match "guestbook+dot+some extension" then you could include the dot in the rewrite condition:
RewriteCond %{REQUEST_URI} ^/guestbook\. [NC,OR] /claus
I wasn't creative enough to find a nick, and i keep forgetting those anyway whenever i use them ;)
# Forbid if blank (or "-") Referer *and* UA
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule .* - [F]
where I need an exception either for one specific cgi (not prefered solution but acceptable) or based on the rDNS of the visitor IP (my host doesn't seem to provide the env Remote_Host name, is that common because of overhead for rDNS or am I doing something wrong?)
you can get the background on why I have this problem here [webmasterworld.com]
And the 404-error script then can decide
what to do? Guess this should reduce server load
a little ... or am I wrong?
One question.
I have a problem with a bot on an html page located on a geo cities html page. How do I block requests from that page?
For example, let's say the bot is placed on :
http://www.example.com/annoyingsite/bot.html
What would be the syntax in my htaccess to block any request coming from that html page?
Thank you,
DrJOnes
[EDIT] While I'm at it, I have another question:
If I place a general htaccess file in my website root (where the main index.html file is located), do the rules apply to all subfolders, even password protected folders? For instance, I have added set of rules that block bots and site grabbers in my .htaccess in my root. I have another .htaccess file located in a member directory (/members/.htaccess). Do I need to re-insert all block rules in that .htaccess file as well? Thanks.
[edited by: jdMorgan at 8:02 pm (utc) on Nov. 2, 2003]
[edit reason] Examplified URL and delinked [/edit]
For example, let's say the bot is placed on :
http://www.example.com/annoyingsite/bot.htmlWhat would be the syntax in my htaccess to block any request coming from that html page?
I believe that would be something like this (but I could be mistaken):
RewriteRule ^http://www\.example\.com/annoyingsite/bot\.html$ - [F,L]
Wiz
[edited by: jdMorgan at 8:04 pm (utc) on Nov. 2, 2003]
[edit reason] Examplified and delinked URL [/edit]
Just to make sure I understood correctly, the ban rules inserted in the root .htaccess will also be effective in all subdirectories, even user protected subdirectories that contain a new .htaccess, correct?
DrJOnes
Just to make sure I understood correctly, the ban rules inserted in the root .htaccess will also be effective in all subdirectories, even user protected subdirectories that contain a new .htaccess, correct?DrJOnes
I usually experiment with rules in specially created subdirectories to test my brand new rules, then add them to the root .htaccess when they are proved to be safe. Before learning that I have blocked access to my website with bad commands in .htaccess. An example of a simple mistake that can break your website is to forget to escape blank spaces in RewriteCond statements, such as this example:
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3\.0 (compatible)$ [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3\.0\ \(compatible\)$ [NC,OR] Wiz
RewriteCond %{HTTP_USER_AGENT} ^[CDEFPRS](Browse¦Eval¦Surf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Demo¦Full.?Web¦Lite¦Production¦Franklin¦Missauga¦Missigua).?(Bot¦Locat) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (efp@gmx\.net¦hhjhj@yahoo\.com¦lerly\.net¦mapfeatures\.net¦metacarta\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Industry¦Internet¦IUFW¦Lincoln¦Missouri¦Program).?(Program¦Explore¦Web¦State¦College¦Shareware) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Mac¦Ram¦Educate¦WEP).?(Finder¦Search) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa¦MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NaverRobot [NC]
RewriteRule .* - [F]
Is it as simple as adding
RewriteCond %{REMOTE_ADDR}!^123\.123\.123\.123$
at the top of the list? (without any [OR] )
Is it as simple as adding
RewriteCond %{REMOTE_ADDR} !^123\.123\.123\.123$
at the top of the list? (without any [OR] )
Wiz
In other-other words,
NOT(someIP) AND (ua1 OR ua2 OR ua3...) is equivalent to
(ua1 OR ua2 OR ua3...) AND NOT(someIP)
I usually like to put RewriteCond exclusions in the order most likely to stop Rule processing soonest, so that unnecessary RewriteCond testing does not take place. I put the most "selective" RewriteConds first as a speed-up, in other words.
Jim
It made my brain hurt until I experimented with it and figured out the operator and per-line precedence... and I've worked using complex boolean logic on a daily basis for almost 30 years. :)
Jim
amznVibe quoted verbatim my .htaccess file, so I thought I'd point out a couple of things as well...
> /sumthin
As earlier noted, "sumthin" isn't an example but an exploit check. I was hit three times by folks looking for, well, something, and decided to add that to my "directories that don't exist" line in .htaccess. Subsequent WebmasterWorld research and discussion lead to the opinion that someone was checking the server response headers for something to exploit...
> sensepost.exe
I remember that I had two different visitors drop by looking for this mystery file in the space of a couple of days, but since I didn't like the look of it (or could find anything about it), I added it to my .htaccess. This was some time ago and I don't think anyone has come looking for it since... No real reason why it should be left/added in your, the reader's, .htaccess file.
I checked my stats and it blocked pretty much everything that needed blocking.
Except, I still get one UNKNOWN BROWSER in my server stats. I fear that this unknown browser could be a grabber.
Is there a safe way to block unknown browsers without blocking legit browsers?
Thanks,
DrJOnes666
MY BLOCK LIST IN HTACCESS:
--------------------------
(PS: if you copy/paste this block list for your htaccess, don't forget to change all ¦ for the correct vertical bar on your keyboard!)
--------------------------
RewriteEngine On
# Forbid requests for exploits & annoyances
# Bad requests
RewriteCond %{REQUEST_METHOD}!^(GET¦HEAD¦POST) [NC,OR]
# CodeRed
RewriteCond %{REQUEST_URI} ^/default\.(ida¦idq) [NC,OR]
RewriteCond %{REQUEST_URI} ^/.*\.printer$ [NC,OR]
# Email
RewriteCond %{REQUEST_URI} (mail.?form¦form¦form.?mail¦mail¦mailto)\.(cgi¦exe¦pl)$ [NC,OR]
# MSOffice
RewriteCond %{REQUEST_URI} ^/(MSOffice¦_vti) [NC,OR]
# Nimda
RewriteCond %{REQUEST_URI} /(admin¦cmd¦httpodbc¦nsiislog¦root¦shell)\.(dll¦exe) [NC,OR]
# Various
RewriteCond %{REQUEST_URI} ^/(bin/¦cgi/¦cgi\-local/¦sumthin) [NC,OR]
RewriteCond %{THE_REQUEST} ^GET\ http [NC,OR]
RewriteCond %{REQUEST_URI} /sensepost\.exe [NC]
RewriteRule .* - [F]
# Forbid if blank (or "-") Referer *and* UA
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule .* - [F]
# Banning BOTS bellow
# Address harvesters
RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider¦ExtractorPro) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^E?Mail.?(Collect¦Harvest¦Magnet¦Reaper¦Siphon¦Sweeper¦Wolf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DTS.?Agent¦Email.?Extrac) [NC,OR]
RewriteCond %{HTTP_REFERER} iaea\.org [NC,OR]
# Download managers
RewriteCond %{HTTP_USER_AGENT} ^(Alligator¦DA.?[0-9]¦DC\-Sakura¦Download.?(Demon¦Express¦Master¦Wonder)¦FileHound) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Flash¦Leech)Get [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Fresh¦Lightning¦Mass¦Real¦Smart¦Speed¦Star).?Download(er)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Gamespy¦Go!Zilla¦iGetter¦JetCar¦Net(Ants¦Pumper)¦SiteSnagger¦Teleport.?Pro¦WebReaper) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(My)?GetRight [NC,OR]
# Image-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(AcoiRobot¦FlickBot¦webcollage) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Express¦Mister¦Web).?(Web¦Pix¦Image).?(Pictures¦Collector)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image.?(fetch¦Stripper¦Sucker) [NC,OR]
# "Gray-hats"
RewriteCond %{HTTP_USER_AGENT} ^(Atomz¦BlackWidow¦BlogBot¦EasyDL¦Marketwave¦Sqworm¦SurveyBot¦Webclipping\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (girafa\.com¦gossamer\-threads\.com¦grub\-client¦Netcraft¦Nutch) [NC,OR]
# Site-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(eCatch¦(Get¦Super)Bot¦Kapere¦HTTrack¦JOC¦Offline¦UtilMind¦Xaldon) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(Auto¦Cop¦dup¦Fetch¦Filter¦Gather¦Go¦Leach¦Mine¦Mirror¦Pix¦QL¦RACE¦Sauger) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(site.?(eXtractor¦Quester)¦Snake¦ster¦Strip¦Suck¦vac¦walk¦Whacker¦ZIP) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCapture [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo\ Pump [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [NC,OR]
# Tools
RewriteCond %{HTTP_USER_AGENT} ^(curl¦Dart.?Communications¦Enfish¦htdig¦Java¦larbin) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (FrontPage¦Indy.?Library¦RPT\-HTTPClient) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(libwww¦lwp¦PHP¦Python¦www\.thatrobotsite\.com¦webbandit¦Wget¦Zeus) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Microsoft¦MFC).(Data¦Internet¦URL¦WebDAV¦Foundation).(Access¦Explorer¦Control¦MiniRedir¦Class) [NC,OR]
# Unknown
RewriteCond %{HTTP_USER_AGENT} ^(Crawl_Application¦Lachesis¦Nutscrape) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[CDEFPRS](Browse¦Eval¦Surf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Demo¦Full.?Web¦Lite¦Production¦Franklin¦Missauga¦Missigua).?(Bot¦Locat) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (efp@gmx\.net¦hhjhj@yahoo\.com¦lerly\.net¦mapfeatures\.net¦metacarta\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Industry¦Internet¦IUFW¦Lincoln¦Missouri¦Program).?(Program¦Explore¦Web¦State¦College¦Shareware) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Mac¦Ram¦Educate¦WEP).?(Finder¦Search) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa¦MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NaverRobot [NC]
RewriteRule .* - [F]