Forum Moderators: phranque

Message Too Old, No Replies

A Close to perfect .htaccess ban list - Part 2

         

adriaant

11:46 pm on May 14, 2003 (gmt 0)

10+ Year Member



<modnote>
continued from [webmasterworld.com...]



UGH, bad typo in my original post. Here's the better version (I wasn't able to re-edit the older post?):

I'm trying to ban sites by domain name, since there are recently lots of reference spammers.

I have, for example, the rule:

RewriteCond %{HTTP_REFERER} ^http://(www\.)?.*stuff.*\.com/.*$ [NC]
RewriteRule ^.*$ - [F,L]

which should ban any sites containing the word "stuff"
www.stuff.com
www.whatkindofstuff.com
www.some-other-stuff.com

and so on.

However, it is not working, so I am sure I did not setup a proper pattern match rule. Anyone care to advise?

[edited by: jatar_k at 5:06 am (utc) on May 20, 2003]

photoman

1:11 pm on Jul 6, 2003 (gmt 0)

10+ Year Member



Hello Everyone,

I am new here, and just recently hosting my own site on a real server (i.e. cgi bin access and so on); I've been lurking for a few days and have learned a lot! So thank you to everyone.

I would like to ask 2 questions:

First, I would like to post my .htaccess contents and if someone knowledgeable could look it over and let me know if they see any errors or such, I would appreciate it. I've built this file using what I've found in these forums, and it includes a spider trap I found here which seems to be working great.

Second, by looking at the .htaccess contents, can anyone tell me why the spider semanticdiscovery/0.2 is being blocked from my site. I can't figure out why, since it is not included in my .htaccess, and it makes no sense to me. It's not that I really care about that spider, I'm just trying to understand. My thinking is if it can ban that spider without me knowing I've banned it, what else will it ban without my knowledge or wishes?

BTW, I do not have access to mod rewrite on my server.

And finally, before I post the .htaccess file, I don't believe I've violated any forum rules or etiquette here, but if I have, my apologies.

Here is the .htaccess contents:

SetEnvIf Remote_Addr ^66\.28\.139\.66$ getout # Sat Jul 5 13:44:59 2003 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)
SetEnvIf Remote_Addr ^218\.18\.32\.60$ getout # Sat Jul 5 11:17:30 2003 Mozilla/4.0 (compatible; MSIE 5.5;Windows NT 5.0)
SetEnvIf Remote_Addr ^68\.2\.119\.169$ getout # Fri Jul 4 02:06:54 2003 Mac Finder 1.0.22
SetEnvIf Remote_Addr ^212\.138\.47\.20$ getout
SetEnvIfNoCase User-Agent "Alexibot" getout
SetEnvIfNoCase User-Agent "asterias" getout
SetEnvIfNoCase User-Agent "autoemailspider" getout
SetEnvIfNoCase User-Agent "b2w 0.1" getout
SetEnvIfNoCase User-Agent "BackWeb" getout
SetEnvIfNoCase User-Agent "BackDoorBot 1.0" getout
SetEnvIfNoCase User-Agent "Black Hole" getout
SetEnvIfNoCase User-Agent "BlackWidow" getout
SetEnvIfNoCase User-Agent "BlowFish 1.0" getout
SetEnvIfNoCase User-Agent "CherryPicker 1.0" getout
SetEnvIfNoCase User-Agent "CherryPickerSE 1.0" getout
SetEnvIfNoCase User-Agent "CherryPickerElite 1.0" getout
SetEnvIfNoCase User-Agent "ChinaClaw" getout
SetEnvIfNoCase User-Agent "Collector" getout
SetEnvIfNoCase User-Agent "Copier" getout
SetEnvIfNoCase User-Agent "Crescent" getout
SetEnvIfNoCase User-Agent "Crescent Internet ToolPak HTTP OLE Control v.1.0" getout
SetEnvIfNoCase User-Agent "Custo" getout
SetEnvIfNoCase User-Agent "DISCo" getout
SetEnvIfNoCase User-Agent "DISCo Pump" getout
SetEnvIfNoCase User-Agent "DISCo Pump 3.1" getout
SetEnvIfNoCase User-Agent "Download Demon" getout
SetEnvIfNoCase User-Agent "Download Wonder" getout
SetEnvIfNoCase User-Agent "Downloader" getout
SetEnvIfNoCase User-Agent "Drip" getout
SetEnvIfNoCase User-Agent "eCatch" getout
SetEnvIfNoCase User-Agent "EirGrabber" getout
SetEnvIfNoCase User-Agent "EmailCollector" getout
SetEnvIfNoCase User-Agent "EmailCollector 1.0" getout
SetEnvIfNoCase User-Agent "EmailSiphon" getout
SetEnvIfNoCase User-Agent "EmailWolf" getout
SetEnvIfNoCase User-Agent "EmailWolf 1.00" getout
SetEnvIfNoCase User-Agent "Express WebPictures" getout
SetEnvIfNoCase User-Agent "ExtractorPro" getout
SetEnvIfNoCase User-Agent "EyeNetIE" getout
SetEnvIfNoCase User-Agent "FileHound" getout
SetEnvIfNoCase User-Agent "Flaming AttackBot" getout
SetEnvIfNoCase User-Agent "FlashGet" getout
SetEnvIfNoCase User-Agent "GetRight" getout
SetEnvIfNoCase User-Agent "GetSmart" getout
SetEnvIfNoCase User-Agent "GetWeb!" getout
SetEnvIfNoCase User-Agent "Go!Zilla" getout
SetEnvIfNoCase User-Agent "Go-Ahead-Got-It" getout
SetEnvIfNoCase User-Agent "gotit" getout
SetEnvIfNoCase User-Agent "Grabber" getout
SetEnvIfNoCase User-Agent "GrabNet" getout
SetEnvIfNoCase User-Agent "Grafula" getout
SetEnvIfNoCase User-Agent "Harvest 1.5" getout
SetEnvIfNoCase User-Agent "HMView" getout
SetEnvIfNoCase User-Agent "HTTrack" getout
SetEnvIfNoCase User-Agent "Image Stripper" getout
SetEnvIfNoCase User-Agent "Image Sucker" getout
SetEnvIfNoCase User-Agent "Indy Library" getout
SetEnvIfNoCase User-Agent "InterGET" getout
SetEnvIfNoCase User-Agent "Internet Ninja" getout
SetEnvIfNoCase User-Agent "Iria" getout
SetEnvIfNoCase User-Agent "JetCar" getout
SetEnvIfNoCase User-Agent "JOC Web Spider" getout
SetEnvIfNoCase User-Agent "JOC" getout
SetEnvIfNoCase User-Agent "JustView" getout
SetEnvIfNoCase User-Agent "larbin" getout
SetEnvIfNoCase User-Agent "lftp" getout
SetEnvIfNoCase User-Agent "LeechFTP" getout
SetEnvIfNoCase User-Agent "likse" getout
SetEnvIfNoCase User-Agent "Magnet" getout
SetEnvIfNoCase User-Agent "Mag-Net" getout
SetEnvIfNoCase User-Agent "Mass Downloader" getout
SetEnvIfNoCase User-Agent "Memo" getout
SetEnvIfNoCase User-Agent "MIDown tool" getout
SetEnvIfNoCase User-Agent "Mirror" getout
SetEnvIfNoCase User-Agent "Mister PiX" getout
SetEnvIfNoCase User-Agent "Navroad" getout
SetEnvIfNoCase User-Agent "NearSite" getout
SetEnvIfNoCase User-Agent "NetAnts" getout
SetEnvIfNoCase User-Agent "NetSpider" getout
SetEnvIfNoCase User-Agent "Net Vampire" getout
SetEnvIfNoCase User-Agent "NetZIP" getout
SetEnvIfNoCase User-Agent "NICErsPRO" getout
SetEnvIfNoCase User-Agent "Ninja" getout
SetEnvIfNoCase User-Agent "Octopus" getout
SetEnvIfNoCase User-Agent "Offline Explorer" getout
SetEnvIfNoCase User-Agent "Offline Navigator" getout
SetEnvIfNoCase User-Agent "PageGrabber" getout
SetEnvIfNoCase User-Agent "Papa Foto" getout
SetEnvIfNoCase User-Agent "pavuk" getout
SetEnvIfNoCase User-Agent "pcBrowser" getout
SetEnvIfNoCase User-Agent "Pump" getout
SetEnvIfNoCase User-Agent "RealDownload" getout
SetEnvIfNoCase User-Agent "Reaper" getout
SetEnvIfNoCase User-Agent "Recorder" getout
SetEnvIfNoCase User-Agent "ReGet" getout
SetEnvIfNoCase User-Agent "Siphon" getout
SetEnvIfNoCase User-Agent "SiteSnagger" getout
SetEnvIfNoCase User-Agent "SmartDownload" getout
SetEnvIfNoCase User-Agent "Snake" getout
SetEnvIfNoCase User-Agent "SpaceBison" getout
SetEnvIfNoCase User-Agent "Sucker" getout
SetEnvIfNoCase User-Agent "SuperBot" getout
SetEnvIfNoCase User-Agent "SuperHTTP" getout
SetEnvIfNoCase User-Agent "Surfbot" getout
SetEnvIfNoCase User-Agent "tAkeOut" getout
SetEnvIfNoCase User-Agent "Teleport" getout
SetEnvIfNoCase User-Agent "Teleport Pro" getout
SetEnvIfNoCase User-Agent "Teleport Pro/1.29.1718" getout
SetEnvIfNoCase User-Agent "Teleport Pro/1.29.1632" getout
SetEnvIfNoCase User-Agent "Teleport Pro/1.29.1590" getout
SetEnvIfNoCase User-Agent "Teleport Pro/1.29.1616" getout
SetEnvIfNoCase User-Agent "Vacuum" getout
SetEnvIfNoCase User-Agent "VoidEYE" getout
SetEnvIfNoCase User-Agent "WebAuto" getout
SetEnvIfNoCase User-Agent "WebBandit" getout
SetEnvIfNoCase User-Agent "WebBandit 2.1" getout
SetEnvIfNoCase User-Agent "WebBandit 3.50" getout
SetEnvIfNoCase User-Agent "webbandit 4.00.0" getout
SetEnvIfNoCase User-Agent "WebCapture 2.0" getout
SetEnvIfNoCase User-Agent "WebCopier v.2.2" getout
SetEnvIfNoCase User-Agent "WebCopier v3.2a" getout
SetEnvIfNoCase User-Agent "WebCopier" getout
SetEnvIfNoCase User-Agent "WebEMailExtractor 1.0B" getout
SetEnvIfNoCase User-Agent "WebFetch" getout
SetEnvIfNoCase User-Agent "WebGo IS" getout
SetEnvIfNoCase User-Agent "Web Image Collector" getout
SetEnvIfNoCase User-Agent "Web Sucker" getout
SetEnvIfNoCase User-Agent "WebLeacher" getout
SetEnvIfNoCase User-Agent "WebReaper" getout
SetEnvIfNoCase User-Agent "WebSauger" getout
SetEnvIfNoCase User-Agent "Website" getout
SetEnvIfNoCase User-Agent "Website eXtractor" getout
SetEnvIfNoCase User-Agent "Website Quester" getout
SetEnvIfNoCase User-Agent "Webster" getout
SetEnvIfNoCase User-Agent "WebStripper" getout
SetEnvIfNoCase User-Agent "WebWhacker" getout
SetEnvIfNoCase User-Agent "WebZIP" getout
SetEnvIfNoCase User-Agent "WebZip/4.0" getout
SetEnvIfNoCase User-Agent "WebZIP/4.21" getout
SetEnvIfNoCase User-Agent "WebZIP/5.0" getout
SetEnvIfNoCase User-Agent "Wget" getout
SetEnvIfNoCase User-Agent "Wget/1.5.3" getout
SetEnvIfNoCase User-Agent "Wget/1.6" getout
SetEnvIfNoCase User-Agent "Whacker" getout
SetEnvIfNoCase User-Agent "Widow" getout
SetEnvIfNoCase User-Agent "WWW-Collector-E" getout
SetEnvIfNoCase User-Agent "WWWOFFLE" getout
SetEnvIfNoCase User-Agent "Xaldon" getout
SetEnvIfNoCase User-Agent "Xaldon/WebSpider" getout
# Block bad-bots using lines written by bad_bot.pl script above
SetEnvIf Request_URI "^(/403.*\.shtml¦/robots\.txt¦/file_instead_of_what_they_want\.html)$" allowsome

<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

<Files .htaccess>
order deny,allow
deny from all
</Files>

Thanks.

photoman

6:57 pm on Jul 6, 2003 (gmt 0)

10+ Year Member



Follow-up to my own post:

Interesting. Concerning the banning of semanticdiscovery/0.2 that I mentioned above. If I remove the entry for Agent 'Disco', the Semantic DISCOvery agent is no longer banned. So, either my code in .htaccess is not what it should be, or there is a better, more specific way to ban specific user agents (obviosuly, I am assuming it is the common occurrence of the contiguous word DISCO in both user-agents that is causing the ban). Has anyone seen this before? Is this the way it should work? How can I ban SPECIFIC user agents through htaccess without unwittingly banning other, potentially harmless or welcome user agents that have common parts of their UA names?

Any thoughts or help are appreciated.

Thanks.

claus

7:06 pm on Jul 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



photoman,

SetEnvIfNoCase has a "NoCase" in it. If you use this in stead:

SetEnvIf

then upper- and lowercase matters, so that "DISCo" and "disco" are not treated as the same word.

You can also try this syntax:

SetEnvIf User-Agent "^DISCo" getout

The ^ character means that the string should start with DISCo. The two other DISCo-entries you have will also get caught by this one.

/claus

photoman

7:39 pm on Jul 6, 2003 (gmt 0)

10+ Year Member



Thanks very much for those pointers claus, that has done the trick. :)

photoman

pmkpmk

4:16 pm on Jul 17, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi there,

I use an .htaccess file more or less copied from the previous thread which is some 3 months old. I just found out that I blocked Webmasterworolds Keyword Density measurer because I blocked empty UA's - probably not a good idea. After changing this, I did a quick statistics and was a bit schocked/unsure whether some blocks might be a good idea after all. The reason for blocking is to get address harvesters out of my site, but I blocked these fellows as well:

Mozilla/4.0 (compatible; grub-client-0.3.0; Crawl your own stuff with [grub.org)...]
Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; [inktomi.com...]
sitecheck.internetseer.com (For more info see: [sitecheck.internetseer.com)...]
SurveyBot/2.2 <a href='http://www.****'>Whois Source</a>
SurveyBot/2.3 (Whois Source)

Are those harmless or harmful?

jdMorgan

4:41 pm on Jul 17, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



pmkpmk,

You almost certainly do not wish to block Slurp, since that is Inktomi's spider. This feeds MSN and several others now.

Early versions of Grub do not fetch or obey robots.txt. No-one seems to know where the data collected by Grub is to be used. For these two reasons, many block Grub, or at least the early versions.
I can confirm that grub-client-1.4.3 seems to obey robots.txt, but 1.3.7 does not even check it.

Internet Seer is either good (if you use the service) or not (if you don't).

SurveyBot from Whois Source is OK IMO; One of our members here works there.

Opinions may vary widely (and wildly) on these user-agents. That opinions here are my own. YMMV.

HTH,
Jim

pmkpmk

4:50 pm on Jul 17, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi jdMorgan,

I see you made it to the new thread as well? I valued your feedback in the old thread much.

I almost thought as much about Inktomi... The problem is: I have looked the .htaccess up and down and I CAN'T FIND the condition that blocks Inktomi!

I'm kind of hesitant to post an outdated list here and therefore took the liberty to send it to you via stickymail. If you'd be kind enought to have a look... If not that's OK as well.

Thanks!

Wizcrafts

6:05 pm on Jul 17, 2003 (gmt 0)

10+ Year Member



I have another User Agent to add to our blocklists, but the condition is iffy.

While reading and reviewing my weblogs I found several instances of this unique User Agent misspelling in my normally restricted spam-bait page logs. The UA is a typo of a common agent, and is used by two separate identifiable spam Domains.

Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)

Notice the space between compatible and ;? The legit UA has no space. Therefore, since my logs for the last 6 months only show harvesters using this UA I am banning it. I think it might have been circulated among spammers with one of their bot programs, hence the same misspelling from a few IPs, traced to spam domains.

Here is my RewriteCond for it, tested with wannabrowser:

RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible\ ;\ MSIE\ 6\.0;\ Windows\ NT\ 5\.1\)$

The current IP source using this UA to look for addresses and eat poison is at 68.59.94.40

Wiz Feinberg

wkitty42

10:51 pm on Jul 17, 2003 (gmt 0)

10+ Year Member



wizcrafts,

that's a comcast ip down in panama city, florida... possibly a call and/or email with evidence to them will assist... you may also want to review their TOS and see if there's a violation that will help them in nuking your intruder... if nothing else, definitely start LARTing them about it...

jazzguy

7:19 pm on Jul 27, 2003 (gmt 0)

10+ Year Member



jdMorgan wrote:
I can confirm that grub-client-1.4.3 seems to obey robots.txt, but 1.3.7 does not even check it.

According to my logs (from multiple sites), grub-client-1.4.3 does not obey robots.txt. I've had "grub-client" disallowed in my robots.txt files for over a month but all versions of Grub disobey it. grub-client-1.4.3 does check robots.txt, but then proceeds to disobey it (as recently as today). My robots.txt file validates and uses the User-agent ("grub-client") given on Grub's robots FAQ page.

Even if they did decide to start obeying robots.txt (which they lie about doing in their FAQ), I would still ban them based on their opinion that banning their bot is a "Draconian approach to [their] presence" (also from their robots FAQ page).

This 122 message thread spans 13 pages: 122