homepage Welcome to WebmasterWorld Guest from 54.211.7.174
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

This 122 message thread spans 5 pages: < < 122 ( 1 [2] 3 4 5 > >     
A Close to perfect .htaccess ban list - Part 2
adriaant




msg:1508343
 11:46 pm on May 14, 2003 (gmt 0)

<modnote>
continued from [webmasterworld.com...]



UGH, bad typo in my original post. Here's the better version (I wasn't able to re-edit the older post?):

I'm trying to ban sites by domain name, since there are recently lots of reference spammers.

I have, for example, the rule:

RewriteCond %{HTTP_REFERER} ^http://(www\.)?.*stuff.*\.com/.*$ [NC]
RewriteRule ^.*$ - [F,L]

which should ban any sites containing the word "stuff"
www.stuff.com
www.whatkindofstuff.com
www.some-other-stuff.com

and so on.

However, it is not working, so I am sure I did not setup a proper pattern match rule. Anyone care to advise?

[edited by: jatar_k at 5:06 am (utc) on May 20, 2003]

 

photoman




msg:1508373
 1:11 pm on Jul 6, 2003 (gmt 0)

Hello Everyone,

I am new here, and just recently hosting my own site on a real server (i.e. cgi bin access and so on); I've been lurking for a few days and have learned a lot! So thank you to everyone.

I would like to ask 2 questions:

First, I would like to post my .htaccess contents and if someone knowledgeable could look it over and let me know if they see any errors or such, I would appreciate it. I've built this file using what I've found in these forums, and it includes a spider trap I found here which seems to be working great.

Second, by looking at the .htaccess contents, can anyone tell me why the spider semanticdiscovery/0.2 is being blocked from my site. I can't figure out why, since it is not included in my .htaccess, and it makes no sense to me. It's not that I really care about that spider, I'm just trying to understand. My thinking is if it can ban that spider without me knowing I've banned it, what else will it ban without my knowledge or wishes?

BTW, I do not have access to mod rewrite on my server.

And finally, before I post the .htaccess file, I don't believe I've violated any forum rules or etiquette here, but if I have, my apologies.

Here is the .htaccess contents:

SetEnvIf Remote_Addr ^66\.28\.139\.66$ getout # Sat Jul 5 13:44:59 2003 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)
SetEnvIf Remote_Addr ^218\.18\.32\.60$ getout # Sat Jul 5 11:17:30 2003 Mozilla/4.0 (compatible; MSIE 5.5;Windows NT 5.0)
SetEnvIf Remote_Addr ^68\.2\.119\.169$ getout # Fri Jul 4 02:06:54 2003 Mac Finder 1.0.22
SetEnvIf Remote_Addr ^212\.138\.47\.20$ getout
SetEnvIfNoCase User-Agent "Alexibot" getout
SetEnvIfNoCase User-Agent "asterias" getout
SetEnvIfNoCase User-Agent "autoemailspider" getout
SetEnvIfNoCase User-Agent "b2w 0.1" getout
SetEnvIfNoCase User-Agent "BackWeb" getout
SetEnvIfNoCase User-Agent "BackDoorBot 1.0" getout
SetEnvIfNoCase User-Agent "Black Hole" getout
SetEnvIfNoCase User-Agent "BlackWidow" getout
SetEnvIfNoCase User-Agent "BlowFish 1.0" getout
SetEnvIfNoCase User-Agent "CherryPicker 1.0" getout
SetEnvIfNoCase User-Agent "CherryPickerSE 1.0" getout
SetEnvIfNoCase User-Agent "CherryPickerElite 1.0" getout
SetEnvIfNoCase User-Agent "ChinaClaw" getout
SetEnvIfNoCase User-Agent "Collector" getout
SetEnvIfNoCase User-Agent "Copier" getout
SetEnvIfNoCase User-Agent "Crescent" getout
SetEnvIfNoCase User-Agent "Crescent Internet ToolPak HTTP OLE Control v.1.0" getout
SetEnvIfNoCase User-Agent "Custo" getout
SetEnvIfNoCase User-Agent "DISCo" getout
SetEnvIfNoCase User-Agent "DISCo Pump" getout
SetEnvIfNoCase User-Agent "DISCo Pump 3.1" getout
SetEnvIfNoCase User-Agent "Download Demon" getout
SetEnvIfNoCase User-Agent "Download Wonder" getout
SetEnvIfNoCase User-Agent "Downloader" getout
SetEnvIfNoCase User-Agent "Drip" getout
SetEnvIfNoCase User-Agent "eCatch" getout
SetEnvIfNoCase User-Agent "EirGrabber" getout
SetEnvIfNoCase User-Agent "EmailCollector" getout
SetEnvIfNoCase User-Agent "EmailCollector 1.0" getout
SetEnvIfNoCase User-Agent "EmailSiphon" getout
SetEnvIfNoCase User-Agent "EmailWolf" getout
SetEnvIfNoCase User-Agent "EmailWolf 1.00" getout
SetEnvIfNoCase User-Agent "Express WebPictures" getout
SetEnvIfNoCase User-Agent "ExtractorPro" getout
SetEnvIfNoCase User-Agent "EyeNetIE" getout
SetEnvIfNoCase User-Agent "FileHound" getout
SetEnvIfNoCase User-Agent "Flaming AttackBot" getout
SetEnvIfNoCase User-Agent "FlashGet" getout
SetEnvIfNoCase User-Agent "GetRight" getout
SetEnvIfNoCase User-Agent "GetSmart" getout
SetEnvIfNoCase User-Agent "GetWeb!" getout
SetEnvIfNoCase User-Agent "Go!Zilla" getout
SetEnvIfNoCase User-Agent "Go-Ahead-Got-It" getout
SetEnvIfNoCase User-Agent "gotit" getout
SetEnvIfNoCase User-Agent "Grabber" getout
SetEnvIfNoCase User-Agent "GrabNet" getout
SetEnvIfNoCase User-Agent "Grafula" getout
SetEnvIfNoCase User-Agent "Harvest 1.5" getout
SetEnvIfNoCase User-Agent "HMView" getout
SetEnvIfNoCase User-Agent "HTTrack" getout
SetEnvIfNoCase User-Agent "Image Stripper" getout
SetEnvIfNoCase User-Agent "Image Sucker" getout
SetEnvIfNoCase User-Agent "Indy Library" getout
SetEnvIfNoCase User-Agent "InterGET" getout
SetEnvIfNoCase User-Agent "Internet Ninja" getout
SetEnvIfNoCase User-Agent "Iria" getout
SetEnvIfNoCase User-Agent "JetCar" getout
SetEnvIfNoCase User-Agent "JOC Web Spider" getout
SetEnvIfNoCase User-Agent "JOC" getout
SetEnvIfNoCase User-Agent "JustView" getout
SetEnvIfNoCase User-Agent "larbin" getout
SetEnvIfNoCase User-Agent "lftp" getout
SetEnvIfNoCase User-Agent "LeechFTP" getout
SetEnvIfNoCase User-Agent "likse" getout
SetEnvIfNoCase User-Agent "Magnet" getout
SetEnvIfNoCase User-Agent "Mag-Net" getout
SetEnvIfNoCase User-Agent "Mass Downloader" getout
SetEnvIfNoCase User-Agent "Memo" getout
SetEnvIfNoCase User-Agent "MIDown tool" getout
SetEnvIfNoCase User-Agent "Mirror" getout
SetEnvIfNoCase User-Agent "Mister PiX" getout
SetEnvIfNoCase User-Agent "Navroad" getout
SetEnvIfNoCase User-Agent "NearSite" getout
SetEnvIfNoCase User-Agent "NetAnts" getout
SetEnvIfNoCase User-Agent "NetSpider" getout
SetEnvIfNoCase User-Agent "Net Vampire" getout
SetEnvIfNoCase User-Agent "NetZIP" getout
SetEnvIfNoCase User-Agent "NICErsPRO" getout
SetEnvIfNoCase User-Agent "Ninja" getout
SetEnvIfNoCase User-Agent "Octopus" getout
SetEnvIfNoCase User-Agent "Offline Explorer" getout
SetEnvIfNoCase User-Agent "Offline Navigator" getout
SetEnvIfNoCase User-Agent "PageGrabber" getout
SetEnvIfNoCase User-Agent "Papa Foto" getout
SetEnvIfNoCase User-Agent "pavuk" getout
SetEnvIfNoCase User-Agent "pcBrowser" getout
SetEnvIfNoCase User-Agent "Pump" getout
SetEnvIfNoCase User-Agent "RealDownload" getout
SetEnvIfNoCase User-Agent "Reaper" getout
SetEnvIfNoCase User-Agent "Recorder" getout
SetEnvIfNoCase User-Agent "ReGet" getout
SetEnvIfNoCase User-Agent "Siphon" getout
SetEnvIfNoCase User-Agent "SiteSnagger" getout
SetEnvIfNoCase User-Agent "SmartDownload" getout
SetEnvIfNoCase User-Agent "Snake" getout
SetEnvIfNoCase User-Agent "SpaceBison" getout
SetEnvIfNoCase User-Agent "Sucker" getout
SetEnvIfNoCase User-Agent "SuperBot" getout
SetEnvIfNoCase User-Agent "SuperHTTP" getout
SetEnvIfNoCase User-Agent "Surfbot" getout
SetEnvIfNoCase User-Agent "tAkeOut" getout
SetEnvIfNoCase User-Agent "Teleport" getout
SetEnvIfNoCase User-Agent "Teleport Pro" getout
SetEnvIfNoCase User-Agent "Teleport Pro/1.29.1718" getout
SetEnvIfNoCase User-Agent "Teleport Pro/1.29.1632" getout
SetEnvIfNoCase User-Agent "Teleport Pro/1.29.1590" getout
SetEnvIfNoCase User-Agent "Teleport Pro/1.29.1616" getout
SetEnvIfNoCase User-Agent "Vacuum" getout
SetEnvIfNoCase User-Agent "VoidEYE" getout
SetEnvIfNoCase User-Agent "WebAuto" getout
SetEnvIfNoCase User-Agent "WebBandit" getout
SetEnvIfNoCase User-Agent "WebBandit 2.1" getout
SetEnvIfNoCase User-Agent "WebBandit 3.50" getout
SetEnvIfNoCase User-Agent "webbandit 4.00.0" getout
SetEnvIfNoCase User-Agent "WebCapture 2.0" getout
SetEnvIfNoCase User-Agent "WebCopier v.2.2" getout
SetEnvIfNoCase User-Agent "WebCopier v3.2a" getout
SetEnvIfNoCase User-Agent "WebCopier" getout
SetEnvIfNoCase User-Agent "WebEMailExtractor 1.0B" getout
SetEnvIfNoCase User-Agent "WebFetch" getout
SetEnvIfNoCase User-Agent "WebGo IS" getout
SetEnvIfNoCase User-Agent "Web Image Collector" getout
SetEnvIfNoCase User-Agent "Web Sucker" getout
SetEnvIfNoCase User-Agent "WebLeacher" getout
SetEnvIfNoCase User-Agent "WebReaper" getout
SetEnvIfNoCase User-Agent "WebSauger" getout
SetEnvIfNoCase User-Agent "Website" getout
SetEnvIfNoCase User-Agent "Website eXtractor" getout
SetEnvIfNoCase User-Agent "Website Quester" getout
SetEnvIfNoCase User-Agent "Webster" getout
SetEnvIfNoCase User-Agent "WebStripper" getout
SetEnvIfNoCase User-Agent "WebWhacker" getout
SetEnvIfNoCase User-Agent "WebZIP" getout
SetEnvIfNoCase User-Agent "WebZip/4.0" getout
SetEnvIfNoCase User-Agent "WebZIP/4.21" getout
SetEnvIfNoCase User-Agent "WebZIP/5.0" getout
SetEnvIfNoCase User-Agent "Wget" getout
SetEnvIfNoCase User-Agent "Wget/1.5.3" getout
SetEnvIfNoCase User-Agent "Wget/1.6" getout
SetEnvIfNoCase User-Agent "Whacker" getout
SetEnvIfNoCase User-Agent "Widow" getout
SetEnvIfNoCase User-Agent "WWW-Collector-E" getout
SetEnvIfNoCase User-Agent "WWWOFFLE" getout
SetEnvIfNoCase User-Agent "Xaldon" getout
SetEnvIfNoCase User-Agent "Xaldon/WebSpider" getout
# Block bad-bots using lines written by bad_bot.pl script above
SetEnvIf Request_URI "^(/403.*\.shtml¦/robots\.txt¦/file_instead_of_what_they_want\.html)$" allowsome

<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

<Files .htaccess>
order deny,allow
deny from all
</Files>

Thanks.

photoman




msg:1508374
 6:57 pm on Jul 6, 2003 (gmt 0)

Follow-up to my own post:

Interesting. Concerning the banning of semanticdiscovery/0.2 that I mentioned above. If I remove the entry for Agent 'Disco', the Semantic DISCOvery agent is no longer banned. So, either my code in .htaccess is not what it should be, or there is a better, more specific way to ban specific user agents (obviosuly, I am assuming it is the common occurrence of the contiguous word DISCO in both user-agents that is causing the ban). Has anyone seen this before? Is this the way it should work? How can I ban SPECIFIC user agents through htaccess without unwittingly banning other, potentially harmless or welcome user agents that have common parts of their UA names?

Any thoughts or help are appreciated.

Thanks.

claus




msg:1508375
 7:06 pm on Jul 6, 2003 (gmt 0)

photoman,

SetEnvIfNoCase has a "NoCase" in it. If you use this in stead:

SetEnvIf

then upper- and lowercase matters, so that "DISCo" and "disco" are not treated as the same word.

You can also try this syntax:

SetEnvIf User-Agent "^DISCo" getout

The ^ character means that the string should start with DISCo. The two other DISCo-entries you have will also get caught by this one.

/claus

photoman




msg:1508376
 7:39 pm on Jul 6, 2003 (gmt 0)

Thanks very much for those pointers claus, that has done the trick. :)

photoman

pmkpmk




msg:1508377
 4:16 pm on Jul 17, 2003 (gmt 0)

Hi there,

I use an .htaccess file more or less copied from the previous thread which is some 3 months old. I just found out that I blocked Webmasterworolds Keyword Density measurer because I blocked empty UA's - probably not a good idea. After changing this, I did a quick statistics and was a bit schocked/unsure whether some blocks might be a good idea after all. The reason for blocking is to get address harvesters out of my site, but I blocked these fellows as well:

Mozilla/4.0 (compatible; grub-client-0.3.0; Crawl your own stuff with [grub.org)...]
Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; [inktomi.com...]
sitecheck.internetseer.com (For more info see: [sitecheck.internetseer.com)...]
SurveyBot/2.2 <a href='http://www.****'>Whois Source</a>
SurveyBot/2.3 (Whois Source)

Are those harmless or harmful?

jdMorgan




msg:1508378
 4:41 pm on Jul 17, 2003 (gmt 0)

pmkpmk,

You almost certainly do not wish to block Slurp, since that is Inktomi's spider. This feeds MSN and several others now.

Early versions of Grub do not fetch or obey robots.txt. No-one seems to know where the data collected by Grub is to be used. For these two reasons, many block Grub, or at least the early versions.
I can confirm that grub-client-1.4.3 seems to obey robots.txt, but 1.3.7 does not even check it.

Internet Seer is either good (if you use the service) or not (if you don't).

SurveyBot from Whois Source is OK IMO; One of our members here works there.

Opinions may vary widely (and wildly) on these user-agents. That opinions here are my own. YMMV.

HTH,
Jim

pmkpmk




msg:1508379
 4:50 pm on Jul 17, 2003 (gmt 0)

Hi jdMorgan,

I see you made it to the new thread as well? I valued your feedback in the old thread much.

I almost thought as much about Inktomi... The problem is: I have looked the .htaccess up and down and I CAN'T FIND the condition that blocks Inktomi!

I'm kind of hesitant to post an outdated list here and therefore took the liberty to send it to you via stickymail. If you'd be kind enought to have a look... If not that's OK as well.

Thanks!

Wizcrafts




msg:1508380
 6:05 pm on Jul 17, 2003 (gmt 0)

I have another User Agent to add to our blocklists, but the condition is iffy.

While reading and reviewing my weblogs I found several instances of this unique User Agent misspelling in my normally restricted spam-bait page logs. The UA is a typo of a common agent, and is used by two separate identifiable spam Domains.

Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)

Notice the space between compatible and ;? The legit UA has no space. Therefore, since my logs for the last 6 months only show harvesters using this UA I am banning it. I think it might have been circulated among spammers with one of their bot programs, hence the same misspelling from a few IPs, traced to spam domains.

Here is my RewriteCond for it, tested with wannabrowser:

RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible\ ;\ MSIE\ 6\.0;\ Windows\ NT\ 5\.1\)$

The current IP source using this UA to look for addresses and eat poison is at 68.59.94.40

Wiz Feinberg

wkitty42




msg:1508381
 10:51 pm on Jul 17, 2003 (gmt 0)

wizcrafts,

that's a comcast ip down in panama city, florida... possibly a call and/or email with evidence to them will assist... you may also want to review their TOS and see if there's a violation that will help them in nuking your intruder... if nothing else, definitely start LARTing them about it...

jazzguy




msg:1508382
 7:19 pm on Jul 27, 2003 (gmt 0)

jdMorgan wrote:
I can confirm that grub-client-1.4.3 seems to obey robots.txt, but 1.3.7 does not even check it.

According to my logs (from multiple sites), grub-client-1.4.3 does not obey robots.txt. I've had "grub-client" disallowed in my robots.txt files for over a month but all versions of Grub disobey it. grub-client-1.4.3 does check robots.txt, but then proceeds to disobey it (as recently as today). My robots.txt file validates and uses the User-agent ("grub-client") given on Grub's robots FAQ page.

Even if they did decide to start obeying robots.txt (which they lie about doing in their FAQ), I would still ban them based on their opinion that banning their bot is a "Draconian approach to [their] presence" (also from their robots FAQ page).

Sparky365




msg:1508383
 4:33 am on Jul 29, 2003 (gmt 0)

Hello folks, I'm new here but have read this entire 20 page thread three times to learn more about restricting UA's. The information shared speaks volumes to the depth of knowledge members have.

I had been blocking some of the UA's listed but not all of the ones listed here. To update my list, I've copied and pasted from this board. The list below is current as of this post. I saw requests for people to share an updated list, so I hope mine will help. I do have some IP tracking questions but will hold off posting them for now as I prefer to give something back first.

# Alphabetized List Of Site Snaggers
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^Alligator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^attach [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^BackWeb [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bandit [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Buddy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Collector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Copier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DA [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo\Pump [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DLExpert [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Master [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\Wonder [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Drip [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FileHound [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FlipDog [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FreshDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetSmart [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GornKer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^gotit [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Grabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HiDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InternetSeer.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Iria [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Irvine [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^iwantmy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JustView [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^lftp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^likse [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Link [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Magnet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mag-Net [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Memo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MetaProducts\ Download\ Express [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mirror [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MyGetRight [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetButler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetPumper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Nitro\ Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Ping [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Pockey [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Pump [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PuxaRapido [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Reaper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Recorder [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Siphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Snake [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SpaceBison [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SpeedDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Stripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Vacuum [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Webdup [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Go [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebPictures\ Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Webster [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWasher [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Whacker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ZyBorg
RewriteRule!^http://[^/.]\.your-site.com.* - [F]

stevenha




msg:1508384
 1:36 am on Jul 30, 2003 (gmt 0)

Can I ask a question... about the merits of using a long list of user agents, versus just using a small list.

I checked my logs for this month, and didn't find any entries from prospector, AsiaNetBot, ASSORT, attache, ATHENS, autohttp, bew, BlackWidow, ^Bot, Bullseye, CherryPicker, ChinaClaw, Crescent, curl, devsoft's\ http\ component, ^Deweb

Because it was taking a long time to grep through my logs for each of these, I gave up greping at the alphabetical "D"s... And by the way, these names came from a reply someplace earlier in this thread ( but not the most recent list graciously posted by Sparky365).

So my question ( coming from someone totally new to .htaccess ) is:

At what point do people start working on making these lists shorter, instead of longer? Might there be spiders and robots in these lists, that come rarely enough to ignore them. Should a site (like mine) with about 200 static pages and moderate user traffic, just focus on the "top 10" or "top 20" problem spiders, and leave the others alone?

Or, is the CPU load created by long user-agent lists, basically negligible in the big picture... and brute force spider-blocking is the way to go?

stevenha




msg:1508385
 6:49 am on Jul 30, 2003 (gmt 0)

A couple typo's maybe... in Sparky365's list, I think.

^Download\ Wonder
^DISCo\ Pump
( In both of these, a space was missing after the "\", but you also seemed to have a redunant ^DISCO preceeding ^DISCo\ Pump )

balam




msg:1508386
 8:40 pm on Jul 30, 2003 (gmt 0)

I'm not faulting anyone, but I think one reason these lists are so long is the end users' incomplete grasp of regular expressions... Using regexps can really shorten a list yet still retain all the "power" of the long list.

Here's an example... What takes over a dozen lines in the latest posted .htaccess can be condensed to two lines (and catch more!)... (Actually, it can be condensed to one line, but it's two so I don't have to side-scroll so much.)


RewriteCond %{HTTP_USER_AGENT} ^Web.?(Auto¦Cop¦dup¦Fetch¦Filter¦Gather¦Go¦Leach¦Mine¦Mirror¦Pix¦QL¦RACE¦Sauger) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(site.?(eXtractor¦Quester)¦Snake¦ster¦Strip¦Suck¦vac¦walk¦Whacker¦ZIP) [NC,OR]

(Note that not every "starts with 'Web'" bot is covered by these two lines. These two lines are an excerpt from my "site-grabbers" section.)

What's happening in these regexps? They catch any UA that starts "Web", "web" or "wEb" - the "[NC"... part at the end indicates that I want a case-INsensitive search. (But we all know that, because we've this thread from the start, right? :)

So, the UA starts "Web" and can, but not necessarily be followed by any character. (It catches "Web Auto", "webauto", "WEB-AUTO", "wEb#aUtO" - get the picture?)

The part that appears within the parentheses catches all the different site-grabbing UAs. Some of the agents have a space between "Web" and "whatever" and some don't. Some UAs are similar enough to each other that you can kill two birds with one stone - "Web Copier" and "WebCopy", might as well just search for "Web[optional character]Cop" (or, as above, RewriteCond %{HTTP_USER_AGENT} ^Web.?Cop [NC,OR].) (Do keep in mind that this will also catch UAs like "Web Copolymerization v1.0". Not a likely browser, I'll admit, but since I don't have a $ at the end of the RewriteCond, you should be aware of this if you cut 'n' paste.)

All those broken vertical pipes - ¦ - (which we all know need to be manually converted to a solid vertical pipe by you at your keyboard), separate the different "Web[whatever]" UAs - "WebAuto", "Web Copier", "Webdup"...

There you go... More than 2 dozen UAs caught in only two lines; a shorter list, just as - if not more so - powerful as 2 dozen separate lines. (And since .htaccess files are processed line-by-line for each visit, you just shaved off 22 lines of processing for each and every single visit.)

Another example? Ok, who here likes Microsoft? That's what I thought...


RewriteCond %{HTTP_USER_AGENT} ^(Microsoft¦MFC).(Data¦Internet¦URL¦WebDAV¦Foundation).(Access¦Explorer¦Control¦MiniRedir¦Class) [NC,OR]

That one line says bye-bye! to...

Microsoft Data Access
Microsoft Internet Explorer
Microsoft URL Control
Microsoft WebDAV MiniRedir
MFC Foundation Class

...as well as (currently) non-existant UAs like "Microsoft Internet Control". (Eeewww, that's a scary sounding browser!)

Ok, so far I don't think I've directly answered stevenha's question, so I'll add this...

Who to start banning? The UAs that are visiting you now that you don't want. After that, READ AND LEARN about the other UAs we all seem to be eager to ban. Decide if it's right for you to ban them.

And if you just cut 'n' paste the .htaccess files you see here, you must go over them with a fine-toothed comb to realize just who is being banned. I believe a couple of the files posted in this thread have banned the likes of Google and Inktomi. Some folks definitely don't want Google around, but are you going to leave that decision up to me, or make it for yourself?

Hmm, hope that helped someone...

claus




msg:1508387
 10:30 pm on Jul 30, 2003 (gmt 0)

Great post balam :):)

>> Decide if it's right for you to ban them.

That's right, it's an individual choice - as we're now on part two of this wonderful thread it's quite easy to get all excited and ban way too much. Don't. Some of the lists that have been posted here are really..hmm... sophisticated. They need some serious individual judgement and an experienced eye. Implementing such powerfull stuff means that you risk doing something you did not intend to. Please remember this. It could be slamming the door on search engines (there's been a few, yes) and it could be banning tools used by disabled people.

And even if you ban only 100% malicious evil-minded bots, it's not even sure that they would all care to visit you; some operate on sites in certain languages, some even for certain types of content. Banning something that just isn't relevant for you means that your server needs to process more than it has to - for each and every file request that is.

Consider starting with an empty list in stead of a very long one, then add only the ones you definitely don't like (and know disrespect robots.txt). If you start with the long list chances are that you will ban something you don't really want to ban by accident. It's always good to take a look at the threads in the "Search Engine Spider Identification" forum when in doubt.


I'd like to add this one: Review your list regularly. Because a bot behaves bad it need not always do so.

Personally, i've done temporary bans on a few occasions because some bot was running wild - contacted the the firm that owns the IP and asked them to fix it, then allowing them in again after the fix. Some bots are permanently bad, and of course you can do nothing with those, but occasionally some programmer just forgets to debug something.

I've never had a list with half as many banned UA's as the lists posted recently. Not even 25% i think. I'm doing quite allright, and my sites are up and running as usual - apart from those i've closed by myself that is. And that was never because of too many different user agents visiting.

/claus

oh, and i'd just like to say thanks to all. This thread is very impressive :)



Added:

stevenha:
>> just focus on the "top 10"

Personally i think this sounds OK, there's no need to ban a lot of stuff if they don't really bug you personally. I just checked my longest .htaccess - it had no more than (hold your breath folks...) 3 IP ranges and 4 User Agents banned (i just allowed grub again). Then again, for me it's a tool and i review it - it's not static, bots come and go - enter, leave and sometimes re-enter.

Oh, and i don't run spidertraps. If i did i would catch quite a few, as i do have a lot of robots on that site (it's a "hub" for at least some of them) but the "robots.txt" in question is very liberal.

stevenha




msg:1508388
 1:44 am on Jul 31, 2003 (gmt 0)

To Claus, Balam and others,
Thanks for the reply. I was going to ask you to stickymail me, but then I realized that others are probably interested too... so let me ask this question to everyone. Can you please share your minimalist list of IP ranges and user-agents to ban...

This is to get the discussion seeded into exploring the Small is Beautiful aspect of ban lists.

As you've said, these lists deserve some individualization, depending on your priorities. My priories are banning:
1) email harvesters
2) off-line browsers ( if they cause error-log entries)
3) recently-active recurrent-visiting bandwidth stealers ( would that be image archivers? )
4) Anything else that fills my error_logs with garbage, including codered, nimda, etc.

irubin




msg:1508389
 6:02 pm on Jul 31, 2003 (gmt 0)


RewriteCond %{HTTP_USER_AGENT} ^WebStripper [NC,OR]
RewriteRule!^http://[^/.]\.your-site.com.* - [F]

What if my .htaccess file already has RewriteRule entries? Example:

RewriteRule home.html index.html
RewriteRule hour.html /cgi-bin/engine.pl?action=hour

Since I can't change these pre-existing rules, how can I make the ban list work?

balam




msg:1508390
 4:12 am on Aug 2, 2003 (gmt 0)

...Small is Beautiful...

Hah! Not really, not when it's a punctuation-filled regexp! ;)

I'll quote some relevant sections from one of my commented .htaccess files; in a real-world application, I'd delete all comments. Some (maybe all, I'm not there yet! ;) sections could use some explanation and there are some serious caveats that I'll also point out. There's a great quote by andreasfriedrich [webmasterworld.com] about permissive- and restrictiveness, and this .htaccess file (is generally¦can be) permissive, but at the same time, it's not afraid to shoot first and ask questions later.

(You'll notice that virtually every RewriteCond is ended with a [NC,OR]. Even though I've added robots & whatnot in the (UPPER¦lower¦Mixed)case that we see in our logs, I've added the case-insensitive NC option because:

  1. I'll catch robots that visit using different cases (such as "lachesis" & "Lachesis"), with only one rule and not two.
  2. I'm not waiting for a "new" robot to show up that gets through the defences only because the operator changed the case of the UA. This is an example of me shooting first, strictly because I don't like your looks. Please remove the NC (and a following comma, if present) to change the behaviour of the RewriteCond to matching only "case-sensitively". (Err... ;)

You'll also notice that I rarely check for a spaces, or other characters like slashes, and instead opt for (without quotes) ".?" which catches any single optional - the "?" part - character. You won't fool me when you switch from "BadBot/1.0" to "BadBot 2.0".

Also, if a robot obeys robots.txt - whether you want it to crawl your site or not - then its place is in your robots.txt file. With only a few exceptions (such as Grub ("grub-client"), Nutch, and a couple of others), I believe these agents do not (read¦obey) robots.txt. Exceptions are made because I don't trust those robots to do as they're told, and many threads & mentions about Grub & Nutch exist around here for you to read and make your own decision.

Someone once brought up the subject of the order of things in your .htaccess. I have a few things to say about that, but I'm going to save that for a later message. Or, maybe a few words near the end... We'll see. :) However, FYI, the order I present the following blocks in is the order they appear in the .htaccess file.

Lastly, all blocks are finished off with a generic RewriteRule, if you are cutting 'n' pasting. Please season to taste... Never forget about fixing the broken vertical pipes - ¦ - and check spaces.)

Ok, a little mod_rewrite voodoo?


Exploits & naughty behaviour

# Forbid requests for exploits & annoyances
#
# Bad requests
RewriteCond %{REQUEST_METHOD}!^(GET¦HEAD¦POST) [NC,OR]
# CodeRed
RewriteCond %{REQUEST_URI} ^/default\.(ida¦idq) [NC,OR]
RewriteCond %{REQUEST_URI} ^/.*\.printer$ [NC,OR]
# Email
RewriteCond %{REQUEST_URI} (mail.?form¦form¦form.?mail¦mail¦mailto)\.(cgi¦exe¦pl)$ [NC,OR]
# MSOffice
RewriteCond %{REQUEST_URI} ^/(MSOffice¦_vti) [NC,OR]
# Nimda
RewriteCond %{REQUEST_URI} /(admin¦cmd¦httpodbc¦nsiislog¦root¦shell)\.(dll¦exe) [NC,OR]
# Various
RewriteCond %{REQUEST_URI} ^/(bin/¦cgi/¦cgi\-local/¦sumthin) [NC,OR]
RewriteCond %{THE_REQUEST} ^GET\ http [NC,OR]
RewriteCond %{REQUEST_URI} /sensepost\.exe [NC]
RewriteRule .* - [F]

"Bad requests" forbids requests like "OPTIONS" or "PROPFIND". "Email" covers a number of different probes for a form-to-mail exploit. The "CodeRed," "MSOffice," and "Nimda" sections cover obvious fluff... The first line of the "Various" section shooes away folks looking for things they shouldn't, the third line I've only seen once and when I went looking for info, couldn't find any, but...

The second line tests an environment variable you don't see as often as some others - "THE_REQUEST". This envar holds a value like (without the quotes), "GET / HTTP/1.0". If you've ever seen a bit in your access logs like "GET [webmasterworld.com...] HTTP/1.1", that's someone testing to see if your server will proxy their request, allowing them to surf anonymously. There's a thread around here somewhere about this, but I can't find it. This RewriteCond forbids this kind of goofing around... Originally, I had this RewriteCond test "REQUEST_URI" ([...]_URI} ^http [...]), but it didn't work - no clue why it doesn't. But, testing "THE_REQUEST" does! :)


"Hot-link Protection"TM - It's not just for graphics anymore!

# Forbid hot-linking of specified file-types - *blank and* local referers are ok
RewriteCond %{HTTP_REFERER}!^$
RewriteCond %{HTTP_REFERER}!^http://(www\.)?webmasterworld\.com [NC]
RewriteRule \.(avi¦bmp¦css¦doc¦exe¦gif¦jpg¦js¦mdb¦mid¦mov¦mp3¦mpg¦pdf¦png¦pps¦ppt¦ra¦ram¦swf¦wav¦wma¦xls¦zip)$ - [F]

We're all familiar with protecting our graphics, but why stop there? You can have my HTML, but that's it! Actually, that's not entirely correct... You just can't have any of the listed file-types. If you do want to limit it to only your \.s?html? files, then that's a good exercise for you to work on. ;) Do note that I did not include the file extensions "cgi" and "pl". Some SEs (that have visited me) have been allowed to spider URLs that end with those extensions, so I certainly don't want to block those referers! But - YMMV.


Blank UAs & referers

# Forbid if blank (or "-") Referer *and* UA
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule .* - [F]

Pretty standard stuff and visitors usually mean no good, but... Did you know that the popular HomeSite HTML editor [webmasterworld.com] (v4.5), when used to verify the links in your project, uses a blank UA (and referer)? I use v4.5 and I hate this fact. >8/ Wanna know something I've just discovered about v5.x of HomeSite? That's near the bottom of this message... There may be other honest visitors. I can't recall off-hand, but there is an "honest" robot that grabs robots.txt with blank UA & referer, too...


Single-word, text UAs

# Forbid if UA is a single word - case-insensitive, A-Z only
RewriteCond %{HTTP_USER_AGENT} ^[a-z]+$ [NC]
# Some exemptions though...
RewriteCond %{HTTP_USER_AGENT}!^ColdFusion$ [OR]
RewriteCond %{HTTP_USER_AGENT}!^DeepIndex$ [OR]
RewriteCond %{HTTP_USER_AGENT}!^FavOrg$ [OR]
RewriteCond %{HTTP_USER_AGENT}!^MantraAgent$ [OR]
RewriteCond %{HTTP_USER_AGENT}!^MARTINI$
RewriteRule .* - [F]

I've seen enough jerks show up using an UA like "Generic" that I've banned any single-word, all-text UA. BUT! - some exceptions, and there's probably more I should make. (Please let me know!) Who are these exceptions? Search this site for more info... (Also note this is one time where I DO want a case-sensitive search.)


Banning IPs

# Forbid by IP address
# Cyveillance
RewriteCond %{REMOTE_ADDR} ^63\.148\.99\.2(2[4-9]¦[34][0-9]¦5[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^63\.226\.3[34]\. [OR]
RewriteCond %{REMOTE_ADDR} ^63\.212\.171\.161$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.118\.41\.(19[2-9]¦2[01][0-9]¦22[0-3])$ [OR]
# NameProtect
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[89]¦1[3-9][0-9]¦2[0-4][0-9]¦25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]¦2[0-4][0-9]¦25[0-5])$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^NPBot [NC,OR]
# Web Content International
RewriteCond %{REMOTE_ADDR} ^65\.102\.12\.2(2[4-9]¦3[01])$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.102\.17\.(3[2-9]¦[4-6][0-9]¦7[01]¦8[89]¦9[0-5]¦10[4-9]¦11[01])$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.102\.23\.1(5[2-9]¦6[0-7])$ [OR]
# Wordtracker
RewriteCond %{REMOTE_ADDR} ^128\.242\.197\.101$ [OR]
# Unknown
# unknown.Level3.net
RewriteCond %{REMOTE_ADDR} ^64\.156\.198\.(6[89]¦7[0-9]¦80)$ [OR]
# host25x.keebler.com
RewriteCond %{REMOTE_ADDR} ^65\.223\.250\.25[0-3]$
RewriteRule .* - [F]

If you don't recognize the names or IPs, search the site. Note the UA check for NameProtect's robot, just to be safe... (This is one robot that apparently obeys robots.txt, but I'm not trusting it to...)


"It's clobbering time!"

The real fun begins... The roughly 30 RewriteConds below forbid over 200 "base" robots, and "a lot" of variants and different version releases. And, as some have wanted, it's sorted by type (as best as I've been able to identify... Corrections would be appreciated.)

Most all of the below robots have been spoken about somewhere in these forums. Search around here or visit some of the UA/robot reference sites mentioned in this thread for more info...

A couple of other notes about my "power minimalist" style...

While it's not necessary, I like to "escape" my hyphens. See the first line after the "Download managers" comment? There a UA in there I'm looking for called "DC-Sakura". I escape the hyphen with a backslash - DC\-Sakura - so that it's always plainly (! ;) obvious that I mean "C-dash-S" and not "from-C-to-S."

There are a few lines that match similiar, multi-word robots. The last line of the "Tools" section, referencing Microsoft is an example. If you're trying to figure out the name of the robot I'm trying to forbid, this is the pattern: (A1stword¦B1stword[.....]).?(A2ndword¦B2ndword[.....]) Get it? See near the bottom of Message #44, above, for a better example.

# Address harvesters
RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider¦ExtractorPro) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^E?Mail.?(Collect¦Harvest¦Magnet¦Reaper¦Siphon¦Sweeper¦Wolf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DTS.?Agent¦Email.?Extrac) [NC,OR]
RewriteCond %{HTTP_REFERER} iaea\.org [NC,OR]
# Download managers
RewriteCond %{HTTP_USER_AGENT} ^(Alligator¦DA.?[0-9]¦DC\-Sakura¦Download.?(Demon¦Express¦Master¦Wonder)¦FileHound) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Flash¦Leech)Get [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Fresh¦Lightning¦Mass¦Real¦Smart¦Speed¦Star).?Download(er)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Gamespy¦Go!Zilla¦iGetter¦JetCar¦Net(Ants¦Pumper)¦SiteSnagger¦Teleport.?Pro¦WebReaper) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(My)?GetRight [NC,OR]
# Image-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(AcoiRobot¦FlickBot¦webcollage) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Express¦Mister¦Web).?(Web¦Pix¦Image).?(Pictures¦Collector)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image.?(fetch¦Stripper¦Sucker) [NC,OR]
# "Gray-hats"
RewriteCond %{HTTP_USER_AGENT} ^(Atomz¦BlackWidow¦BlogBot¦EasyDL¦Marketwave¦Sqworm¦SurveyBot¦Webclipping\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (girafa\.com¦gossamer\-threads\.com¦grub\-client¦Netcraft¦Nutch) [NC,OR]
# Site-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(eCatch¦(Get¦Super)Bot¦Kapere¦HTTrack¦JOC¦Offline¦UtilMind¦Xaldon) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(Auto¦Cop¦dup¦Fetch¦Filter¦Gather¦Go¦Leach¦Mine¦Mirror¦Pix¦QL¦RACE¦Sauger) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(site.?(eXtractor¦Quester)¦Snake¦ster¦Strip¦Suck¦vac¦walk¦Whacker¦ZIP) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCapture [NC,OR]
# Tools
RewriteCond %{HTTP_USER_AGENT} ^(curl¦Dart.?Communications¦Enfish¦htdig¦Java¦larbin) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (FrontPage¦Indy.?Library¦RPT\-HTTPClient) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(libwww¦lwp¦PHP¦Python¦www\.thatrobotsite\.com¦webbandit¦Wget¦Zeus) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Microsoft¦MFC).(Data¦Internet¦URL¦WebDAV¦Foundation).(Access¦Explorer¦Control¦MiniRedir¦Class) [NC,OR]
# Unknown
RewriteCond %{HTTP_USER_AGENT} ^(Crawl_Application¦Lachesis¦Nutscrape) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[CDEFPRS](Browse¦Eval¦Surf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Demo¦Full.?Web¦Lite¦Production¦Franklin¦Missauga¦Missigua).?(Bot¦Locat) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (efp@gmx\.net¦hhjhj@yahoo\.com¦lerly\.net¦mapfeatures\.net¦metacarta\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Industry¦Internet¦IUFW¦Lincoln¦Missouri¦Program).?(Program¦Explore¦Web¦State¦College¦Shareware) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Mac¦Ram¦Educate¦WEP).?(Finder¦Search) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa¦MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NaverRobot [NC]
RewriteRule .* - [F]

Ok, now what needs clarifying? :)

Um, the first three sections, while maybe not entirely straightforward ;), don't need much comment. It's worth mentioning that, yes, the third word is optional at the end of the second RewriteCond under the "Image-grabbers" section (UAs matched are "Express WebPictures", "Mister Pix" and the three-word "Web Image Collector"). Why did I write this conditon like this? Ah, because that's how I wrote it... Maybe I'm thinking future expansion...

The "Gray-hats" section... You should research these robots to be sure you want to ban them. Some of these visitors may be in fact useful to you. Search the site. The answers are here.

Nothing to say about the "Site-grabbers" section...

The "Tools" section is a collection of various UAs for "home-rolled" robots or software packages for various purposes that are generally looked at with a jaded eye. Looking at it now, I'm thinking that if I'm such a minimalist I really should reduce it to...

RewriteCond %{HTTP_USER_AGENT} ^(Microsoft¦MFC) [NC,OR]

...and be done with it. Better still would be to add "Microsoft¦MFC" into the line above, and remove yet another RewriteCond! :) I'm not going to "correct" revelant section above, because this is a beautiful example of the minimalist in action, no? ;)

Lastly, the "Unknown" section is where I've stuck robots of unknown purpose or ones that I haven't "positively" figured out yet. I do suspect four of the RewriteConds are catching email harvesters (second, third, fifth & sixth)...

If you modify the second RewriteCond, keep the letter "I" out of the range - "IBrowse" is a valid Amiga(?) browser.

The seventh & eighth RewriteConds deserve some discussion... Here's the seventh again:

RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa¦MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR]

Many of us have been visited by UAs like "Mozilla/2.0" or "MSIE 5.5". That's it. Nothing else to the string. This condition matches any UA that starts "Mozilla", "Mozzilla" (or "Mozzzzzzzzilla"), or MSIE followed by an optional character, an optional number, optional character and two optional numbers, then the string must end. This catches variations like:

  • Mozzilla
  • Mozilla/4.78
  • Mozilla 5.0
  • MSIE5.5
  • MSIE 6.0

I've heard rumours around here that there may be an honest robot or two using Mozilla-like UAs like above. You are warned. (Confirmation of this would be appreciated by myself and others!)

The eighth condition...

RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR]

...is "designed" to catch two particular UAs:

  • Mozilla/3.0 (compatible)
  • Mozilla/4.0 (compatible ; MSIE 6.0[.......]

The first one is generally considered to be a cache or proxy server, if it wasn't missing a semi-colon after the "compatible". When the semi-colon is missing, this UA is generally considered evil. Remember I said I just discovered something about HomeSite v5.x? While v4.5 uses a blank UA (and referer) when verifying links, v5.x uses, you guessed it - "Mozilla/3.0 (compatible)" - no semi-colon. Maybe it's not such a "bad" UA after all... (Does any link validator check robots.txt? That wouldn't seem right... And if it is someone validating a link to your site, that could account for seemingly "random" hits on your site.) Yet another case where discretion is necessary.

The second UA, I don't think anyone is too sure what it is. (I've trimmed off the tail-end in the example.) When I was visited, it seemed to be a semi-selective site-grabber... Notice the misplaced semi-colon...

This condition matches a UA that starts with "Mozilla/" followed by a number, a literal period, a number, an optional number, any character, a literal left parentheses, the word "compatible", then either a mandatory literal right parentheses or a space. (This is probably the one place in the whole .htaccess file where I check for a literal space character, as opposed to just any ol' character...)

Two birds, one stone...


If you're with me still, then I've got a captive audience ;) so I might as well give my opinion on the order of a .htaccess file...

Since a .htaccess must be gone through line-by-line until...

  • a RewriteRule that [F]orbids is executed, or
  • a RewriteRule marked as [L]ast is executed, or
  • the end of the file is reached

...for every request made to your server, you want process .htaccess as quickly as possible based on your priorities. So, for me that means I don't care what your UA is, if your looking for an exploit, then you're out of here. And if you're still around but stealing my files, then your out of here. If you're still around, I check your UA and referer to see if they're blank... As I mentioned hours ago, these sections are shown in the order they appear in my .htaccess file.

If I have any temporary redirects, I always add them near the top of my .htaccess file - redirect first, then worry about who it is. If I have permanent redirects, I first add them near the beginning, but once most of the SEs (and regular visitors) have come by and noted the new URL - say, a month or two - I then move the redirects to the end of the file (it's once again more important to ask who you are first, rather than tell you where to go first).


So there you have it - didn't think I'd ever shut up, eh? Too bad I'm not a "pyscho minimalist;" I would have squeezed all the above into 4 RewriteConds, kept my mouth shut and made this whole message shorter... :)

Sparky365




msg:1508391
 5:57 am on Aug 2, 2003 (gmt 0)

Balam, thank you for the most comprehensive explanation of .htaccess methodology I've ever read.

claus




msg:1508392
 1:14 pm on Aug 2, 2003 (gmt 0)

Balam, did you say "minimalist"? *lol*

1) blank UA / referers:

Link-checkers (homebrew) often use this, i've observed. If you like backlinks to your site these might be allowed. In #Tools, libwww/lwp are also among these (and of course they may just as well be used to rip an entire sire). I also seem to remember (jdMorgan?) saying that python(-urllib?) is used by Google for some of ther experimental stuff.

BTW: Why not ^.?$ in stead of ^-?$

2) FormMail and The Moz UA's

Here's fresh input: [webmasterworld.com...]

/claus

jdMorgan




msg:1508393
 7:46 pm on Aug 2, 2003 (gmt 0)

Comments on "blank" referer and user-agent:

It's fairly common to see a blank referer, but blank user-agents are rare. Nevertheless, I have elected not to "ban" truly-blank user-agent+referer, partly because I use key_master's bad_bot.pl script to catch them later if they are up to no good.

However, the one case where I've never seen an innocent visitor is when the user agent is a hypen and the referer is a hyphen. This is an intentional ploy to get past blocks/bans on blank ua+referer. For these guys, I ban them by calling the script, which records their IP address and blocks all subsequent requests.

Note that in most server logs, blank referer and user-agent are displayed as "-" "-" and so these tricky user-agents using hyphens look identical in the logs to a blank referer/ua, because they are also displayed as "-" "-".

RewriteCond %{HTTP_REFERER} ^-$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^-$
RewriteRule .* /cgi-local/bad_bot.pl [L]

Jim

Wizcrafts




msg:1508394
 8:12 pm on Aug 2, 2003 (gmt 0)

JD, what about ANDing those two rules instead of ORing them? Wouldn't that make certain that only a bot with a blank or dash Referrer AND UserAgent gets poisoned/banned?

How would you rewrite the code if you want to AND them?

jdMorgan




msg:1508395
 8:38 pm on Aug 2, 2003 (gmt 0)

Wizcrafts,

That code blocks anyone who tries to use a hyphen for either request field in order to fake me out. As such, I intentionally [OR]ed them. To make it an AND condition, just omit the [OR] at the end of the first RewriteCond.

Jim

Wizcrafts




msg:1508396
 8:41 pm on Aug 2, 2003 (gmt 0)

Thanks Jim, that's what I thought, but needed to know for sure. I'm used to Javascript statements, where AND is && and OR is ¦¦. Now I know the RegExpr method.

claus




msg:1508397
 5:37 pm on Aug 3, 2003 (gmt 0)

>> tricky user-agents using hyphens look identical in the logs

- ah, that explains why. I've always thought it wass odd. Meaning; if they absolutely wanted something apart from blank, then why use the hyphen when there's a whole character set to choose from?

So, they're actually betting on people banning blank strings and forgetting to ban hyphens. Good to know :)

However, some day they might start thinking that another character than a hyphen may also be worth a try, that was the reason for my "BTW" comment in post #50

/claus

jdMorgan




msg:1508398
 8:11 pm on Aug 3, 2003 (gmt 0)

They are counting on the common use of "-" in log files to represent a blank ua. No other character would look like a logged blank referer, so we need not be concerned about other characters.

This ploy was first reported by WebmasterWorld member guabito some time last year, IIRC.

Jim

viggen




msg:1508399
 6:17 pm on Aug 11, 2003 (gmt 0)

After reading like 2 hours straight through this threads
I implemented an .htaccess file.

I dont encounter any problems (that I am aware off) I checked with wannabrowser if the bad bots are kept out (yes)
however I dont know how to check if the IP banning works.
Also there is already an RewriteEngine On, so I have it twice, is that suppose to be like this?

here is my .htaccess file If anyone could check if all looks ok as i had already other stuff on it.


DirectoryIndex index.php

php_flag magic_quotes_gpc on

RewriteEngine On

RewriteRule ^news_archive-([0-9][0-9][0-9][0-9][0-9][0-9]*).* index.php?m=$1

# this will make register globals off in b2's directory
# just put a '#' sign before these three lines if you don't want that

#
#php_flag register_globals off
#

# this will set the error_reporting level to remove 'Notices'
#
# php_value error_reporting 247
#

# this is used to make b2 produce links like [example.com...]
# if you renamed the file 'archives' to another name, please change it here too

#
#ForceType application/x-httpd-php
#

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]¦2[0-4][0-9]¦25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]¦1[3-9][0-9]¦2[0-4][0-9]¦25[0-5])$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
<long list of more like those>
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F]

thanks

claus




msg:1508400
 9:18 pm on Aug 11, 2003 (gmt 0)

>> # this is used to make b2 produce links like ...

these two comment lines should probaly be just after the: RewriteRule ^news_archive

>> reading like 2 hours straight

And it's getting longer still ;)

>> there is already an RewriteEngine On

You only need one, delete number two and perhaps collect the Rewrite-statements in one block for easy maintenance. As it is now, there's some php-stuff in-between although it's commented out.

>> how to check if the IP banning works

You'll have to be able to spoof the IP-address, but they seem quite allright to me. They ban:

12\.148\.209\.(19[2-9]¦2[0-4][0-9]¦25[0-5])

- from 12.148.209.192 to 12.148.209.255

12\.148\.196\.(12[8-9]¦1[3-9][0-9]¦2[0-4][0-9]¦25[0-5])

- from 12.148.196.128 to 12.148.196.255

Extra:

^news_archive-([0-9][0-9][0-9][0-9][0-9][0-9]*).*

What you are saying here is "news_archive-" followed by any number of any digit as long as there are at least five - followed by any character any number of times including zero. I suspect that this is not what you want, rather i think that you would like to catch a filename like this:

news_archive-200209.php

That is: exactly six digits, then a dot and then a php... or htm or asp, etc. Try this in stead, and replace "php" with the relevant ending if needed:

^news_archive-(\d{6})\.php$

The six digits are still getting caught and turned over to $1 by means of the parenthesis.

/claus

berli




msg:1508401
 8:54 pm on Aug 16, 2003 (gmt 0)

Just wanted to share a problem I ran into:

I copied and modified a big list of bad bots that appeared months ago on this thread.

One of the lines was:

RewriteCond %{HTTP_USER_AGENT} MS\ FrontPage [OR]

I had to change that to:

RewriteCond %{HTTP_USER_AGENT} MS.?FrontPage [NC,OR]

The previous version was letting "MSFrontpage" through. (It was trying to POST. The request 404'd, fortunately, because I don't use Frontpage.)

IanKelley




msg:1508402
 4:42 am on Aug 31, 2003 (gmt 0)

I'm sure everyone here has been enjoying all of the virus spam lately.

Because the email addresses being used for these virus mass mailings are coming from a web spider... I'm wondering if anyone here knows how that spider identifies itself. Does it look exactly like a legitimate IE broswer, or is it catchable?

nancyb




msg:1508403
 7:36 pm on Sep 17, 2003 (gmt 0)

I have the following in htaccess

# Block libwww-perl except from AltaVista, Inktomi, and IA Archiver
RewriteCond %{HTTP_USER_AGENT} ^libwww-perl/[0-9] [NC]
RewriteCond %{REMOTE_ADDR}!^209\.73\.(1[6-8][0-9]¦19[01])\.
RewriteCond %{REMOTE_ADDR}!^209\.131\.(3[2-9]¦[45][0-9]¦6[0-3])\.
RewriteCond %{REMOTE_ADDR}!^209\.237\.23[2-5]\.
RewriteRule!^err403\.htm$ - [F]
# Block Java and Python URLlib except from Google
RewriteCond %{HTTP_USER_AGENT} ^(Python.urllib¦Java/?[1-9]\.[0-9]) [NC]
RewriteCond %{REMOTE_ADDR}!^216\.239\.(3[2-9]¦[45][0-9]¦6[0-3])\.

can anyone tell why the first hit gets a 200 and the second is 404? and what I need to do to correct it so both are 404?

65.49.178.17 - - [17/Sep/2003:10:40:52 -0400] "GET /xxx.htm HTTP/1.1" 200 14724 "-" "xxxxxxxxx_xxxxxxxx/0.1 libwww-perl/5.65"
65.49.178.17 - - [17/Sep/2003:10:34:02 -0400] "GET /xxxxxx/- HTTP/1.1" 404 7550 "-" "xxxxxxxxx_xxxxxxxx/0.1 libwww-perl/5.65"

thanks

This 122 message thread spans 5 pages: < < 122 ( 1 [2] 3 4 5 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved