Forum Moderators: phranque
# Forbid if blank (or "-") Referer *and* UA, except for HEAD requests from caching proxies (such as AOL)
RewriteCond %{REQUEST_METHOD}!^HEAD$
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteCond %{REQUEST_URI}!^.*robots\.txt$
RewriteCond %{REQUEST_URI}!^.*\.ico$
RewriteCond %{REQUEST_URI}!/getout\.php$
RewriteRule .* /getout.php [L]
Thank you very much in advance.
Sincerely
Joe Belmaati
Copenhagen Denmark
# Forbid if blank (or "-") Referer *and* UA, except for HEAD requests from caching proxies (such as AOL)
RewriteCond %{REQUEST_METHOD} !^HEAD$
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteCond %{REQUEST_URI} !\.ico$
RewriteCond %{REQUEST_URI} !^/custom403\.html$
RewriteRule .* - [F]
# Forbid if blank (or "-") Referer *and* UA, except for HEAD requests from caching proxies (such as AOL)
RewriteCond %{REQUEST_METHOD} !^HEAD$
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteCond %{REQUEST_URI} !(^/robots\.txt¦\.ico¦^/custom403\.html)$
RewriteRule .* - [F]
Jim
Sincerely,
Joe Belmaati
Copenhagen Denmark
SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
# Don't look in my htaccess file
SetEnvIf Request_URI "^\.ht" getout
# This ip can do what it wants (disregard the #'s - they are real numbers in my htaccess file)
SetEnvIf Remote_Addr "^##\.##\.###\.###$" allowsome
#
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>
Options +FollowSymLinks
RewriteEngine on
RewriteBase /
# Don't look in my htaccess file
RewriteRule ^\.ht - [F]
RewriteCond %{REMOTE_ADDR} ^80\.196\.101\.240$
RewriteRule .* - [L]
# Various bots
RewriteCond %{HTTP_USER_AGENT} ^WinHttp\.WinHttpRequest\.\d+ [NC,OR]
# Address harvesters
RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider¦ExtractorPro) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^E?Mail.?(Collect¦Harvest¦Magnet¦Reaper¦Siphon¦Sweeper¦Wolf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DTS.?Agent¦Email.?Extrac) [NC,OR]
RewriteCond %{HTTP_REFERER} iaea\.org [NC,OR]
# Download managers
RewriteCond %{HTTP_USER_AGENT} ^(Alligator¦DA.?[0-9]¦DC\-Sakura¦Download.?(Demon¦Express¦Master¦Wonder)¦FileHound) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Flash¦Leech)Get [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Fresh¦Lightning¦Mass¦Real¦Smart¦Speed¦Star).?Download(er)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Gamespy¦Go!Zilla¦iGetter¦JetCar¦Net(Ants¦Pumper)¦SiteSnagger¦Teleport.?Pro¦WebReaper) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(My)?GetRight [NC,OR]
# Image-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(AcoiRobot¦FlickBot¦webcollage) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Express¦Mister¦Web).?(Web¦Pix¦Image).?(Pictures¦Collector)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image.?(fetch¦Stripper¦Sucker) [NC,OR]
# "Gray-hats"
RewriteCond %{HTTP_USER_AGENT} ^(Atomz¦BlackWidow¦BlogBot¦EasyDL¦Marketwave¦Sqworm¦SurveyBot¦Webclipping\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (girafa\.com¦gossamer\-threads\.com¦grub\-client¦Netcraft¦Nutch) [NC,OR]
# Site-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(eCatch¦(Get¦Super)Bot¦Kapere¦HTTrack¦JOC¦Offline¦UtilMind¦Xaldon) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(Auto¦Cop¦dup¦Fetch¦Filter¦Gather¦Go¦Leach¦Mine¦Mirror¦Pix¦QL¦RACE¦Sauger) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(site.?(eXtractor¦Quester)¦Snake¦ster¦Strip¦Suck¦vac¦walk¦Whacker¦ZIP) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCapture [NC,OR]
# Tools
RewriteCond %{HTTP_USER_AGENT} ^(curl¦Dart.?Communications¦Enfish¦htdig¦Java¦larbin) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (FrontPage¦Indy.?Library¦RPT\-HTTPClient) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (libwww¦lwp¦PHP¦Python¦www\.thatrobotsite\.com¦webbandit¦Wget¦Zeus) [NC,OR]
# Unknown
RewriteCond %{HTTP_USER_AGENT} ^(Crawl_Application¦Lachesis¦Nutscrape) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[CDEFPRS](Browse¦Eval¦Surf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Demo¦Full.?Web¦Lite¦Production¦Franklin¦Missauga¦Missigua).?(Bot¦Locat) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (efp@gmx\.net¦hhjhj@yahoo\.com¦lerly\.net¦mapfeatures\.net¦metacarta\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Industry¦Internet¦IUFW¦Lincoln¦Missouri¦Program).?(Program¦Explore¦Web¦State¦College¦Shareware) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Mac¦Ram¦Educate¦WEP).?(Finder¦Search) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa¦MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NaverRobot [NC,OR]
# Email
RewriteCond %{REQUEST_URI} (mail.?form¦form¦form.?mail¦mail¦mailto)\.(cgi¦exe¦pl)$ [NC,OR]
# Various
RewriteCond %{REQUEST_URI} ^/(bin/¦cgi/¦cgi\-local/¦sumthin) [NC,OR]
RewriteCond %{THE_REQUEST} ^GET\ /?http [NC,OR]
# Forbid if UA is a single word - case-insensitive, A-Z only
RewriteCond %{HTTP_USER_AGENT} ^[a-z]+$ [NC]
RewriteCond %{REQUEST_URI}!/getout\.php$
RewriteRule .* /getout.php [L]
# Forbid if blank (or "-") Referer *and* UA, except for HEAD requests from caching proxies (such as AOL)
RewriteCond %{REQUEST_METHOD}!^HEAD$
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteCond %{REQUEST_URI}!(^/robots\.txt¦\.ico¦^/custom403\.html)$
RewriteRule .* - [F]
# Frontpage Office etc
#RewriteCond %{REQUEST_URI} ^/(MSOffice¦_vti) [NC,OR]
#RewriteCond .* - [F]
Make sure you have a space between each "}" and "!", and make sure that you have changed all broken pipe "¦" characters to solid pipe characters.
This rule will block "MARTINI" which is a robot from LookSmart:
# Forbid if UA is a single word - case-insensitive, A-Z only
RewriteCond %{HTTP_USER_AGENT} ^[a-z]+$ [NC]
Jim
This rule will block "MARTINI" [...] If that is important to you, create an exception using a RewriteCond.
This is how I handle this situation:
# Forbid visitor if UA is a single word - case-insensitive, A-Z only
RewriteCond %{HTTP_USER_AGENT} ^[a-z]+$ [NC]
# ...some exemptions though...
RewriteCond %{HTTP_USER_AGENT} !^DeepIndex$
RewriteCond %{HTTP_USER_AGENT} !^FavOrg$
RewriteCond %{HTTP_USER_AGENT} !^MantraAgent$
RewriteCond %{HTTP_USER_AGENT} !^MARTINI$
RewriteRule !403\.html$ - [F] I see that Joe is using an .htaccess that I posted some time ago... A nice, little ego boost to see that "code" keep popping up.
Always happy when I can offer some help. Here's another (not so?) little tidbit you'll want to know about if you are in, or are trying to get into, the DMOZ directory...
From what I've been able to gather, the editors of DMOZ have a custom-made link-checking program named "TulipChain" that they use to verify the existence of sites in the directory. It's written in Java and uses other "toolbox" software. Here's the UA (or a recent version thereof):
TulipChain/6.02 (http://ostermiller.org/tulipchain/) Java/1.4.0_03 (http://java.sun.com/) Windows_XP/5.1 RPT-HTTPClient/0.3-3
It's important to note the "RPT-HTTPClient/0.3-3" part of the UA, since RPT-HTTPClient is contained in the second RewriteCond of the "Tools" section in the .htaccess posted in message 5, above. Specifically:
RewriteCond %{HTTP_USER_AGENT} (FrontPage¦Indy.?Library¦RPT\-HTTPClient) [NC,OR]
I've had trouble with "visitors" using RPT-HTTPClient (which, to be honest, I can't quite remember what it is), but I don't want to ban DMOZ, so near the top of my .htaccess I have:
RewriteCond %{HTTP_USER_AGENT} ^TulipChain
RewriteRule (.*) - [L] If you are concerned about this, I would add the above two lines after your "Don't look in my htaccess file" section and before your "Various bots" section.
Also worth noting is that "Java" is part of the first RewriteCond in the "Tools" section and it also appears in the TulipChain UA. Since that RewriteCond requires that the UA start with Java (or the other expressions it tests for), it will not stop TulipChain, but if that important caret (^) is ever removed and the extra two lines I offered above aren't added to your .htaccess, then that RewriteCond will ban TulipChain as well (or more accurately, first).
Jim
Thank you once again :)
Sincerely,
Joe Belmaati
Copenhagen Denmark
Sure enough, despite setting an off-limits area in the robots.txt file over a month ago, and despite waiting until this week to set up the trap, within the first 24 hours, three bots were snagged, two of which are related to blogs and/or PDA usage:
IP address: 64.157.224.100
Domain name: sync00.avantgo.com
User agent: Mozilla/4.0 (compatible; AvantGo 5.2; FreeBSD)
IP address: 198.87.83.123
Domain name: www.syndic8.com
User agent: Syndic8/1.0 (http://www.syndic8.com/ )
Lesson learned! Dumb bots, but not devious. Thus, I'll be sticking with manual htaccess banning, even though that's a little extra work.
You can simply add an exclusion to the script or to the htaccess code you use to redirect to the script, in order to avoid banning WAP requests or anything else you wish to permit.
As an example, let's assume you have cloaked the bad-bot.pl script using mod_rewrite. Instead of Disallowing bad-bot.pl in robots.txt and putting links to it in your pages, you use "all-private.html" instead. Then, the .htaccess code might look like this:
# Redirect bad-bot bait files to IP banning script. Exclusions are to avoid banning search engines and
# AvantGo WAP proxies, Google proxies, and WebTV. AvantGo, Google WAP proxy, and WebTV may display the
# link to the spider trap, so users may click on it. Search engines should not attempt to fetch files
# starting with "all_private" because this is disallowed in robots.txt. However, the following
# exclusion list is a "safety net."
RewriteCond %{HTTP_USER_AGENT} !(Ask\ Jeeves¦FAST-.*WebCrawler/¦Fluffy¦GalaxyBot/¦Gigabot/¦Googlebot/) [NC]
RewriteCond %{HTTP_USER_AGENT} !(ia_archiver¦MARTINI¦Mercator¦msnbot/¦Overture-WebCrawler/¦Robozilla) [NC]
RewriteCond %{HTTP_USER_AGENT} !(Scooter/¦Scrubby/¦Slurp¦Steeler/¦Submission\ Spider\) [NC]
RewriteCond %{HTTP_USER_AGENT} !(Teoma¦Vagabondo/¦VoilaBot¦Zealbot¦ZyBorg/) [NC]
RewriteCond %{HTTP_USER_AGENT} !(AvantGo¦Blazer¦Google\ .*\ Proxy¦Tulipchain¦WebTV¦Xenu) [NC]
RewriteRule ^all_private /cgi/bad-bot.pl [L]
# /private is an empty directory which is password-protected. User agents excluded above will get a 401
# authentication required response if they ignore robots.txt and attempt to fetch all-private.
RewriteRule ^all-private /private/login.html [L]
Jim