Forum Moderators: coopster & phranque

Message Too Old, No Replies

A Close to perfect .htaccess ban list

         

toolman

3:30 am on Oct 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's the latest rendition of my favorite ongoing artwork....my beloved .htaccess file. I've become quite fond of my little buddy, the .htaccess file, and I love the power it allows me to exclude vermin, pestoids and undesirable entities from my web sites

Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.

Feel free to use this on your own site and start blocking bots too.

(the top part is left out)

<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

SomeCallMeTim

12:56 pm on Dec 4, 2002 (gmt 0)

10+ Year Member



Is there a way to use upside's method of:

SetEnvIf Remote_Addr ^12\.40\.85\. getout
SetEnvIfNoCase User-Agent ^Microsoft.URL getout

<Limit GET POST>
order allow,deny
allow from all
deny from env=getout
</Limit>

but return something more ambiguous than a 403 so that the person trying to grab the site is confused...say a 304 for page not modified for example?

Is upside's method more expensive than using rewrite?

Thanks

webmasta

8:45 pm on Dec 6, 2002 (gmt 0)

10+ Year Member


back to the bot list ... i am new here (first post) and found this thread very interesting so i tried it on my site/

i downloaded the new version 4.37 of BlackWidow ... and used the last updated list that superman posted / somehow BlackWidow is getting past the htaccess file and downloading the site. However when i tried it from http://www.wannabrowser.com/ and changed the UA to BlackWidow it was blocked .. getting the 403 from there /

any takes on that? this new black widow comes with a ton of plugins that could get around almost anything including encrypted sites! and can decode script generated url's....

Next .. i noticed "Web Spider" in my logs. Any idea who this is?

Also .. while searching for some offline browsers i came across www.matuschek.net/software/jobo/index.html .. seems like this JoBo is a smartass / it gives u the option to change the UA to anything u want! of course when i added JoBo to the htaccess file it was blocked but when i changed the UA to Mozilla it downloaded the entire site!

This one is dangerous! guess we might as well start thinking that insted of blocking unwanted bots we only allow the good ones.... its just as hard or even harder to keep on top of the bad ones and add them to the list as keeping on top of the good ones and add them /

webmasta

[1][[b]edited by[/b]: jatar_k at 9:04 pm (utc) on Dec. 6, 2002][/1]

jdMorgan

3:05 am on Dec 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Wow - Looks like this thread woke up again!

pmkpmk,
You'd be right - spoofing a known-harmless user-agent is a popular technique. Most site scrapers don't know how to change the UA though, or don't bother to do it. For those who do, there are other measures to dispatch them - see the last part of this post.

Andy_White,
Yes, the virtual account has to have sufficient "permissions" for .htaccess to do its job, so AllowOverride needs to be set up correctly. In addition, many setups will require the Options +FollowSymlinks directive to precede the RewriteEngine on directive in per-directory .htaccess.

okidata,
Banning by country doesn't work well for several reasons: First the domain name for the requesting IP must be available in reverse dns in order to get the country code as remote_host. That's not always true, and reverse DNS is slow. Second, you are banning an IP assigned to the ISP, not necessarily the user or a group of users. The user and ISP could be in different countries by virtue of the international nature of telecom. Even banning by IP numbers assigned to countries doesn't work all that well. There is no central "map" to tell you what address blocks go with what country - and they are assigned piecemeal and willy-nilly. There are, however, some nice subscription services to make the info available to you - big buck$, though...

Part two. Use order "deny,allow", "deny from" and "allow from" to get past your problem.

The following bans two IP addresses, but allows all to access 403.html and robots.txt:


SetEnvIf Remote_Addr ^12\.148\.209\.196$ banit
SetEnvIf Remote_Addr ^65\.80\.255\.116$ banit
SetEnvIf Request_URI "^(/403\.html¦/robots\.txt)$" allowit
<Files *>
order deny,allow
deny from env=banit
allow from env=allowit
</Files>

upside,
You don't need mod_rewrite to serve a custom error page. All your need is ErrorDocument 403 /your403page.html and/or ErrorDocument 404 /your404page.html in your .htaccess file at web root.

SomeCallMe ... Tim?
Neet!
No, you can't return a bogus server code. You could redirect to a PERL script and start a very long delay, or just execute a die without creating an html response I suppose, but it's hardly worth it, IMHO. Think of 403-Forbidden responses as a raised middle finger, and take joy in sending them! You want to send the shortest response possible to bad-bots, while still giving enough info to an unintentionally-denied visitor to allow him/her to fix problems such as misconfigured Norton Internet Security settings.

Webmasta,
This anomaly is probably due to the fact that you are running blackwidow from inside your server.

JoBo leaves its url in your server log in order to get your attention so you'll buy it. Bad strategy with this group, eh?

Blocking by allowing only "known good" user-agents doesn't work well - I tried it myself. The problem is that major search engines and directories come up with new UA variations and new IP addresses all the time - too hard to keep up with and the penalty might be getting dropped from the search engine. Also, even legitimate browsers have thousands of variations of UA layout.

Even "Mozilla" can't download my site - a multi-defense solution is needed, though:

--

As some have pointed out in earlier posts, there is no perfect solution. My UA ban list is about four times larger than the ones posted here, and server workload is negligible because I have "compressed" the UA list. But my sites are small - hits in the hundreds or thousands per day, but rarely more. So your approach may need to be different than mine. I block by UA, http referer, request method, remote host, remote address (IP address), requested protocol, and combinations of several of these.

I have also implemented a version of the bad-bot banning script available in the archives here at WebmasterWorld. Search for "malicious robots PERL script" using WebmasterWorld site search for more info. This script and its associated traps tend to be very good at catching heretofore-unreported user-agents that come to your site and try to have their way with it. They'll get a few pages or objects, and then the door gets slammed in their face. Next stop - WebmasterWorld Search Engine Spider Identification forum to report them. :)

HTH,
Jim

webmasta

4:13 am on Dec 7, 2002 (gmt 0)

10+ Year Member



thnx jd..
>>This anomaly is probably due to the fact that you are running blackwidow from inside your server. /

but no .. BlackWidow is sitting on a normal machine with internet access .. just called up the url to the site and it shows up in browser mode... i would think that blackwidow would be sending its UA to get that url?/

something doesnt seem right... it grabbed the entire site like if the htaccess file wasnt there .. and when i tried to acces the same site from wannabrowser with BlackWidow as the UA it was blocked .. could be that bw is spoofing when using it as a browser?

i checked my logs .. i see nothing there about blackwidow ..
hmmm

webmasta

webmasta

5:06 am on Dec 7, 2002 (gmt 0)

10+ Year Member



more..

i am getting this a lot in my logs "libwww perl" i tracked it to this website > www.linpro.no/lwp/

scroll down the page on the above site .. i see 2 bots there that i dont see in any of the htaccess list so far in this thread > webPluck and webMirror < but then again there are good bots based on that library also/

webmasta

webmasta

8:18 am on Dec 8, 2002 (gmt 0)

10+ Year Member



further to blackwidow 4.37/

i did a script to trap the UA for blackwidow when used in browser mode.. seems like the spidirt is using the UA from whatever default browser u have on board .. i couldnt tell the difference from the printout when i was using IE or black widow .. both UA strings were the same ...:o

and of course i was able to download the entire site while scanning... all packaged nicely and laid out like a picnic table .. i know it was bad before but now a predator in disguise.

Obvoiusly creative thinking is needed ..

webmasta

58sniper

4:29 am on Dec 12, 2002 (gmt 0)

10+ Year Member


jdMorgan -

I'm attempting to use your correction of my code, and it doesn't seem to be blocking.

RewriteCond %{HTTP_REFERER} ^http://(www\.)?flipdog\.com [NC]
RewriteRule .* /simple.php?sid=robots [F,L]

Is still letting flipdog.com through. Also, I'm getting traffic from bsb.jobs.flipdog.com that I need to block as well.

I'm also using:
RewriteEngine on
RewriteCond %{HTTP_REFERER}!^$
RewriteCond %{HTTP_REFERER}!^http://(www\.)?mydomain.com.*$ [NC]
RewriteCond %{HTTP_REFERER}!^http://(dev\.)?mydomain.com.*$ [NC]
RewriteCond %{HTTP_REFERER}!^http://localhost/.*$ [NC]
RewriteCond %{HTTP_REFERER}!^http://myipaddress$ [NC]
RewriteRule \.(gif¦jpg¦zip¦pdf)$ http://www.mydomain.com/dev/apology.gif [R,L]

And this does work when viewing the flipdog site. At least my images are not there. But the content, including formatting, is.

My legal staff is sending them a nasty gram today, but I need to do something ASAP.

[1][[b]edited by[/b]: jatar_k at 4:56 pm (utc) on Dec. 12, 2002][/1]

Edge

5:49 pm on Dec 26, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The following browser entry clearly doesn't work for my website .htaccess,

"RewriteCond %{HTTP_USER_AGENT} ^MSIECrawler [OR] "

Is there an alternative entry for this browser within .htaccess? MSIECrawler actualy shows as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; .NET CLR 1.0.3705; MSIECrawler)"

MSIECrawler does check my robots.txt, so I disallowed it there.

Suggestions?

Thanks in advance!

webmasta

2:47 am on Dec 27, 2002 (gmt 0)

10+ Year Member



Edge....

>>>> RewriteCond %{HTTP_USER_AGENT} ^MSIECrawler [OR]

Seems like ure looking to match MSIECrawler at the start of the UA string when it appears at the end /

>>>MSIECrawler actualy shows as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; .NET CLR 1.0.3705; MSIECrawler)"

try - RewriteCond %{HTTP_USER_AGENT} MSIECrawler [OR] -
without ^... the ^ will try to match "MSIECrawler" at the start of the string and when it doesnt find it will just move on to the next rewrite .. it wouldnt look down the string to the end.

hope this helps...
webmasta

maxidrom11

1:51 pm on Jan 7, 2003 (gmt 0)



I have recently published the following .htaccess and the server gives me 500 error, could you check if something is not correct, please!

ErrorDocument 401 /custompage.html
ErrorDocument 403 /custompage.html
ErrorDocument 404 /custompage.html
ErrorDocument 500 /custompage.html
RewriteOptions +FollowSymLinks
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule!^custompage\.html$ /custompage.html [L]
RewriteCond %{REMOTE_ADDR} ^217\.113\.22[4-8]\. [OR]
RewriteCond %{REMOTE_ADDR} ^217\.113\.229\.([0-9]¦[1-9][0-9]¦1[01][0-9]¦12[0-7])$
RewriteRule .* [somesite.com...]

[edited by: jatar_k at 6:42 pm (utc) on Jan. 7, 2003]
[edit reason] removed specifics [/edit]

This 243 message thread spans 25 pages: 243