Forum Moderators: phranque

Message Too Old, No Replies

A Close to perfect .htaccess ban list - Part 4

Update?

         

slatz

7:55 pm on Oct 26, 2004 (gmt 0)

10+ Year Member



Moderator's note: This thread is a combination of two concurrent threads continuing from the A Close to perfect .htaccess ban list - Part 3 [webmasterworld.com] discussion. Some of the posts are out of date order for that reason.

Does anyone have an updated .htaccess file for bad spiders and bots etc....?

The last post to "A Close to perfect .htaccess ban list - Part 3" was in April of 2003... a tad outdated to say the least.

I know toolman and superman were very helpul in that thread... which was awesome.....

You can PM me if you like instead of posting the list here to save some bandwidth...

slats

[edited by: jdMorgan at 3:50 am (utc) on Nov. 4, 2004]

guitaristinus

8:01 pm on Oct 26, 2004 (gmt 0)

10+ Year Member



I'd like to see an update as well.

slatz

4:02 pm on Oct 27, 2004 (gmt 0)

10+ Year Member



Can anyone add anything to this list?

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [OR]
RewriteCond %{HTTP_USER_AGENT} Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} larbin [OR]
RewriteCond %{HTTP_USER_AGENT} LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} Siphon [OR]
RewriteCond %{HTTP_USER_AGENT} SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} Wget [OR]
RewriteCond %{HTTP_USER_AGENT} Widow [OR]
RewriteCond %{HTTP_USER_AGENT} Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} Zeus
RewriteRule .* - [F,L]

teasers

6:27 pm on Oct 29, 2004 (gmt 0)

10+ Year Member



I just notice lots of DA's in Urchin, i think those represent 'Download Accelerator'. Is this something like HTTrack and has to be banned?

teasers

6:44 pm on Oct 29, 2004 (gmt 0)

10+ Year Member



PuxaRapido v1.0

1.75 GB in two weeks - Brazilian users probably may use it

fish_eye

3:23 am on Oct 5, 2004 (gmt 0)

10+ Year Member



It's been some time since I updated my banned robots list and was wondering if much has changed in the last 12 months. I assume it has.

I appreciate all the effort that has gone into the previous discussions on this [webmasterworld.com] but I was wondering if someone can sticky me - or post if it's allowed - an up-to-date list of bad and useless bots?

I guess a more attractive alternative to the actual mod_rewrite code would be definitions / descriptions of currently active bots?

Umbra

9:13 am on Oct 12, 2004 (gmt 0)

10+ Year Member



Rather than slog through that massive thread... is there a one-stop comprehensive up-to-date source for this evolving ban list? (Either one particular message or perhaps a website?)

fish_eye

1:01 pm on Oct 12, 2004 (gmt 0)

10+ Year Member



It would be nice, yes, but it is prone to abuse I guess.

I did find more info in the robots.txt forum and also in the search engine spiders forum but they only go part of the way (of identifying the good guys not the bad ones).

Wizcrafts

3:14 pm on Oct 21, 2004 (gmt 0)

10+ Year Member



This is what I am currently using, after studying my server logs and those of others:

RewriteCond %{HTTP_USER_AGENT} !EmailProtect [NC]

RewriteCond %{HTTP_USER_AGENT} ^(BlackWidow¦Crescent¦Disco.?¦ExtractorPro¦HTML.?Works¦Franklin.?Locator) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Green\ Research¦Harvest¦HLoader¦http.?generic¦Industry.?Program) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(IUPUI.?Research.?Bot¦Mac.?Finder¦NetZIP¦NICErsPRO¦NPBot¦PlantyNet_WebRobot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Production.?Bot¦Program.?Shareware¦Teleport.?Pro¦TurnitinBot¦TE¦VOBSUB¦VoidEYE) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(WebBandit¦WebCopier¦Websnatcher¦Website\ Extractor¦WEP.?Search¦Wget¦Zeus) [NC,OR]

RewriteCond %{HTTP_USER_AGENT} cherry.?picker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} e?mail.?(collector¦extractor¦magnet¦reaper¦search¦siphon¦sweeper¦harvest¦collect¦wolf) [NC,OR]

RewriteCond %{HTTP_USER_AGENT} \.\.\.\.\.\..?¦Educate.?Search¦Full.?Web.?Bot¦Indy.?Library¦IUFW.?Web [NC,OR]

RewriteCond %{HTTP_USER_AGENT} Cowbot¦Downloader¦httrack¦larbin¦NaverRobot¦QuepasaCreep¦Siphon [NC,OR]

RewriteCond %{HTTP_USER_AGENT} efp@gmx\.net [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^P\.Arthur\ 1\.1$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Miss.*g.*.?Locat.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.?URL.?Control [NC,OR]
# Phoney User_Agents used by email harvesters
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3\.0\ \(compatible\)$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible\ ;\ MSIE.? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible;\ MSIE\ 5\.0;\ Windows\ NT\)$ [NC,OR]

RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible;\ MSIE\ 5\.00;\ Windows\ 98$ [NC,OR]

RewriteCond %{HTTP_USER_AGENT} ^Mozilla/6\.0\ \(compatible;\ MSIE\ 6\.0;\ Windows\ NT\ 5\.2\)$ [NC,OR]

RewriteCond %{REQUEST_URI} (MSOffice/cltreq\.asp¦_vti_bin/owssvr\.dll¦_vti_bin/_vti_aut/fp30reg\.dll¦_mem_bin¦MSADC¦sumthin) [NC,OR]

# RewriteCond %{REQUEST_URI} ~\!\^~\!\^~\!\.html [OR]
RewriteCond %{HTTP_REFERER} q=guestbook [NC,OR]
RewriteCond %{HTTP_REFERER} iaea\.org [NC]
# Above is last condition ^
RewriteRule - [F]


There are a lot of user agents from the "Close To Perfect Ban List" that are not in my list, because they haven't visited my websites, or are captured by my wildcard terms, or which I don't consider to be a problem or threat. Conversely, there are some in my list that others may not wish to ban at all, such as NaverBot.

As has been mentioned before (aforementioned ban list threads), you will need to retype the broken pipes into solid pipes before posting these directives. All of the above directives are on their own continuous lines, but were word wrapped by this Forum. Carriage returns are only allowed when starting a new condition, rule, comment, or blank line. Comments beginning with # and should be on separate lines from the directives, to avoid possible 500 server errors.

I have also left out my personal allowance for blocked agents to view my custom error pages and other files to which I might redirect them, such as poison or banning scripts. These allowances would go on the last line, before the - [F] command, as in:
RewriteRule !^(docs/403\.html¦robots\.txt¦other-allowed-files) - [F].

Wiz

[edited by: jdMorgan at 3:38 pm (utc) on Oct. 21, 2004]
[edit reason] Fixed side-scroll [/edit]

guitaristinus

11:14 am on Oct 27, 2004 (gmt 0)

10+ Year Member



Wizcrafts,

RewriteCond %{HTTP_USER_AGENT}!EmailProtect [NC,OR]
wouldn't let me get to my site. Gives me a "Forbidden".

For some reason the space before "!" keeps getting deleted from post. But it was there in file.

Thanks for the update.

Wizcrafts

2:39 pm on Oct 27, 2004 (gmt 0)

10+ Year Member



Guitaristinus wrote:

RewriteCond %{HTTP_USER_AGENT}!EmailProtect [NC,OR]
wouldn't let me get to my site. Gives me a "Forbidden".

That is not what I listed in my code. I have this:
RewriteCond %{HTTP_USER_AGENT} !EmailProtect [NC]

Followed by the other expressions.
This line says (User Agent is) NOT EmailProtect (case insensitive) AND the following conditions occur <snip the rest>. You apparently have added an OR to my condition, which certainly is wrong. It is breaking the ruleset, because it would now say (the User Agent is NOT EmailProtect, OR ........., so ban it. That means that ALL user agents will be banned, including yours! Get rid of the OR in that directive.

Wiz

Umbra

7:26 am on Nov 2, 2004 (gmt 0)

10+ Year Member



# Phoney User_Agents used by email harvesters
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3\.0\ \(compatible\)$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible\ ;\ MSIE.? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible;\ MSIE\ 5\.0;\ Windows\ NT\)$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible;\ MSIE\ 5\.00;\ Windows\ 98$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/6\.0\ \(compatible;\ MSIE\ 6\.0;\ Windows\ NT\ 5\.2\)$ [NC,OR]

Are these user agents ALWAYS email harvestors? I read somewhere that these may be people who fake their browser user agent to get around what they consider to be annoying Javascript browser detection on websites. I haven't had any luck finding any good threads on this topic on Webmasterworld.

Wizcrafts

3:26 pm on Nov 2, 2004 (gmt 0)

10+ Year Member



Umbra queried my positronic net with the following question:

# Phoney User_Agents used by email harvesters
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3\.0\ \(compatible\)$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible\ ;\ MSIE.? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible;\ MSIE\ 5\.0;\ Windows\ NT\)$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible;\ MSIE\ 5\.00;\ Windows\ 98$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/6\.0\ \(compatible;\ MSIE\ 6\.0;\ Windows\ NT\ 5\.2\)$ [NC,OR]

It has been my personal experience that people, or bots that have visited my websites, using any of those user agents, have always been up to no good. I especially find that the first two UAs, which are missing the semi-colon, are always harvesters, or guestbook spammers,or formmail exploiters. The one ending in NT is usually a spybot conducting surveillance, ignoring robots.txt directives.

Most people do not modify their browser's ID string, or even know that this is possible. Those that possess this knowledge and use it are cloaking their activities, for some reason.

I added these UA's after analyzing my logs by user agent and what they requested. My conclusion is that if these are human visitors they have either purposely mis-identified their browser, or they are using a program that has such a UA. In either case, I don't want them wasting my bandwidth.

Wiz