Forum Moderators: open

Message Too Old, No Replies

Blocking Bad Bots.

This is not working for me.

         

Elgoog

12:55 pm on Dec 23, 2002 (gmt 0)



I have this in my .htaccess file. It doesn't seem to be working. I still see this hitting my two hundred sites daily.

65.217.13.8 - - [23/Dec/2002:06:03:16 -0700] "GET / HTTP/1.1" 200 115932 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt; DTS Agent"

I don't want to block by IP because it shows different from day to day.
##########################

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible;\ MSIE\ 5\.0;\ Windows\ NT;\ DigExt;\ DTS\ Agent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DTS\ Agent$ [NC]
RewriteCond %{HTTP_USER_AGENT} Mozilla/4\.0\ \(compatible;\ MSIE\ 5\.0;\ Windows\ NT;\ DigExt;\ DTS\ Agent$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} DTS\ Agent$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Siphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.*$ [fastsearchrs.net...] [L,R]

#RewriteCond %{HTTP_USER_AGENT} FAST-WebCrawler/3\.6\ Agent$ [OR]
#RewriteCond %{HTTP_USER_AGENT} DTS\ Agent$ [NC,OR]
#RewriteCond %{REMOTE_ADDR} ^218\.5\.77\.71$
#RewriteRule .* - [F]
#RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
#RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]

bird

2:44 pm on Dec 23, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



RewriteCond %{HTTP_USER_AGENT} ^DTS\ Agent$ [NC]

You forgot the OR flag on this one. The correct entry would look like this:

RewriteCond %{HTTP_USER_AGENT} ^DTS\ Agent$ [NC,OR]

Orange_XL

4:51 pm on Dec 23, 2002 (gmt 0)

10+ Year Member



Does anyone has a more manageble sollution for blocking bad IP's/agents? Having all those entries in my httpd.conf or vhosts.conf is a bit much (.htaccess is not an option since I want it serverwide, and without much performance loss).

I was looking to use rewritemaps. Also, can you block ip-ranges with rewritemaps? Is there a website somewhere with a complete list of bad agents and IP's?

jdMorgan

5:21 pm on Dec 23, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does anyone has a more manageble sollution for blocking bad IP's/agents? Having all those entries in my httpd.conf or vhosts.conf is a bit much (.htaccess is not an option since I want it serverwide, and without much performance loss).

There are many solutions - mod_rewrite in .htaccess is just the most powerful solution available to Webmasters who do not have root priveleges - the majority. The performance hit of using mod_rewrite is comparitively small, considering the number of sites using complex and lengthy scripts to serve pages.

mod_rewrite in httpd.conf is more efficient.

The above blocks can be made much more efficient by combining the patterns of several RewriteConds to save on overhead. That is instead of:

RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]

use:

 RewriteCond %{HTTP_USER_AGENT} ^Web(\ Image\ Collector¦\ Sucker¦Auto¦Copier¦Fetch) [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web(Reaper¦Sauger¦site\ eXtractor¦Stripper¦Whacker¦ZIP) [OR]

The two lines above can be concatenated as well - I just wanted to avoid a wrapping problem in this post. Some favor using a single RewriteCond containing patterns for all of their "bad bots", while others prefer to organize them alphabetically and by required-anchor type to ease maintenance. In the above example, the patterns are alphabetic, and all are start-anchored.

Remember to replace the broken vertical pipe "¦" characters with solid vertical pipe characters before attempting to use the RewriteConds above.

Is there a website somewhere with a complete list of bad agents and IP's?

No, for two reasons: First, there is no universally-agreed definition of "bad agents," and second, new ones appear every day, making a comprehensive list impossible. A good approach is to combine a static ban list like the one above with an active approach - a spider trap - that will detect and block UAs and IPs which do not honor robots.txt. There are other traps you might implement as well, e.g. bandwidth hog detectors, etc.

I'll leave your question about IP-range blocking with rewrite maps for someone who's actually done it.

Jim

FoodPlaces

6:53 pm on Dec 23, 2002 (gmt 0)

10+ Year Member



You might want to php your sites and do some code in php. I am not all that familiar with php, but we do the same thing in REXX [which will also run under your OS (I'm quessing, since it seems to run on every platform known to computerkind)].

The easiest route is to write a REXX script to handle all HTML requests... all it needs to do it read the file and send - namely send whatever you want, based off IP address, user agent or any other http header info you want. Very simple to do for any Apache, Domino, DominoGo or WebSphere server. The entirety of the script would probably be 10-15 lines and should be cross platform for any of those web servers. (Dont ask me about IIS... we no longer touch it, not after a client ignored our suggestions to NOT use it on their server (singular) and ended up needing a farm of dual CPU servers (9) to handle the traffic one DominoGo/WarpServer single CPU machine did. So, never tried a REXX implementation under IIS and dont know if it works).

php should also run on just about everything as well, including all the web servers I mentioned (and I am pretty sure, also IIS). It shouldnt require much scripting to do the same thing either.

(For your own purposes, you can substitute php for REXX in this paragraph since they should be equally as capable in this respect) I use REXX to do the dynamic page sending, none of the pages ever need to look dynamic to a web browser (in the URL field) or to a bot/search engine (url or links)... (including ones that are run by other scripts and programs, unless I want them to)... and I can also log all such requests on the fly, including real time updating of "should be blocked", "auto-blocked", and "normal" requests.

Because content on our sites is often dynamically served based off cookies or referrers, this, for us, is even more important when it comes to log analysis. There is no log tool I know of that can do what we do because I have yet to find one that can know what content is served because a script put together a page differently based off any combination of (1) cookie(s), (2) referrer, (3) referrer search query parsing (for instance, when people jump to our site from google), (4) browser's UA string, (5) time of day, specific "next served based off info from #1-4 above", (6) affiliate content that gets included dynamically based off a variety of factors, or any combination thereof. REXX (php) can, since you determine what gets logged, how, where, when or if.

Supposedly Apache will soon (or does in the newer releases - or platform specific?) allow module loading... (ie: load once, use for any requests that handle it). That generally may make using something like REXX/php a much mroe suitable alternative for dynamic blocking than having Apache need to read the .htaccess file.

Also, this method, as I noted, will also allow you to dynamically generate your pages for normal users as well AND handle attack requests too... (we pass all requests to formmail.pl, "Latest-MS-IIS-hole-of-the-week-exploit" and so on to a dynamic page, or for logging purposes and some "fun" to a REXX script or REXX exe that is designed to send back neat, custom chosen data to the attacking machine(s)).

Just my one cent...

- Rob

Orange_XL

3:40 pm on Dec 24, 2002 (gmt 0)

10+ Year Member



Thanks for the suggestion, but PHP is no option. (it is a php-less server). Also, the overhead that PHP generats is too much for such a basic feature.

Compression of Rewriterules is an option (used it already) but less manageble as something like a RewriteMap.

FoodPlaces

9:16 pm on Dec 24, 2002 (gmt 0)

10+ Year Member



Hi Orange,

Perhaps REXX is then. It uses negligble overhead and runs on virtually every OS I can think of. I havent used it much on any other platform, though I do know that on Warp and eComStation it is amazingly fast. On most platforms, it can be compiled to Java/NetREXX or to an exe. On many, it can be compiled to a DLL file that some web servers can call.

Robert

carfac

6:17 am on Dec 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Elgoog:

I do not know if this is an option for you or not.... but for me it works. I took a basic Apache::Block_Agent and installed that in every virtual server. I modded it to block IP's, too, and installed that. Now I have one master text bad_agent file, and one master bad_ip file for all my servers.

dave