Forum Moderators: coopster & phranque

Message Too Old, No Replies

A Close to perfect .htaccess ban list

         

toolman

3:30 am on Oct 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's the latest rendition of my favorite ongoing artwork....my beloved .htaccess file. I've become quite fond of my little buddy, the .htaccess file, and I love the power it allows me to exclude vermin, pestoids and undesirable entities from my web sites

Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.

Feel free to use this on your own site and start blocking bots too.

(the top part is left out)

<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

Panicschat

10:17 am on Feb 15, 2003 (gmt 0)

10+ Year Member


Oh yes, a couple of things I forgot to mention. I know I don't really needs the index.html on the hotlink code, but for a while I did and just left it there.

I have also tried the main .htaccess with the full path to the 403 document, eg; ErrorDocument 403 http://www.mydomain.org/403.shtml Still no deal. I know I'm doing something wrong. I know this has to be re-written, but I'm stuffed if I know how. :) I have an idea from the code on this discussion, but I prefer not to 500 my entire site by stuffing up my .htaccess. ;) Ideas and suggestions welcome.

roelbaz

8:27 pm on Feb 16, 2003 (gmt 0)



Hello all,

Read this tread, but still have a problem. I've:

ErrorDocument 401 /error/errorbot.php3?error=401
ErrorDocument 403 /error/errorbot.php3?error=403
ErrorDocument 404 /error/errorbot.php3?error=404
ErrorDocument 500 /error/errorbot.php3?error=500

How do i call tot document 403 using the errorobot.php3?error=403 in rewriterule:

{....}
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule ^.* - [F,L]

Tryed some of the suggestions, but stil get a:

"Additionally, a 403 Forbidden
error was encountered while trying to use an ErrorDocument to handle the request."

Can I and how change the rewriterule incorperating the errordocument?

grz, roel

beachbum

11:17 pm on Feb 17, 2003 (gmt 0)



Panicschat; roelbaz

Think about this, you've decided to ban these bots or IPs from viewing ANY file on your site...now you want to serve them your custom error file...a file they are banned from seeing.

It can be done, and 'how' was discussed elsewhere in this thread (I believe...although I can't find it now).

BUT....the theme of this thread has been how to get bad bots off of your site as quickly and efficiently as possible...minimizing the load on your server and your bandwidth.

SO....why do you want to serve up a custom error page? I also have custom error pages (pretty ones....complete with my navigation links) which I serve up to mis-guided humans who may need and benefit from a little help. But, who thinks that bad bots are actually reading their 'helpful' error pages. :-) Why be 'friendly' to them, and waste YOUR resources? Why not dipatch them as quickly as possible? :-)

Hello ALL!

Very helpful thread! I did much of this before ever discovering this forum....so, naturally I did a few things a little different. I'll give some examples of what I've done, and perhaps you'll give me some feedback on doing things one way vs. another.

I've seen this condition frequently, in this forum:
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]

I use this instead:
RewriteCond %{HTTP_USER_AGENT} ^(.*)WebBandit(.*) [NC,OR]

The (.*) says I don't care where in the UA string it appears, it's gone! The [NC] says nothing in the string is 'case-sensitive'.

I'm curently using this rule:
RewriteRule ^(.*) - [F]

But, I have played around with this one:
RewriteRule ^(.*) [127.1.1.1...] [R=permanent,L]

I know this last rule will take longer to return an error to a browser, but will either rule move the load off my server quicker than the other? I realize that if the banned UA came from a machine running a server, this rule 'might' create a problem on that machine....but, gee, that's their problem, right? :-)

Any thoughts, pro or con would be appreciated!

thermoman

12:33 am on Feb 18, 2003 (gmt 0)

10+ Year Member



Anybody already tried SugarPlum? www.devin.com/sugarplum/

"Sugarplum employs a combination of Apache's mod_rewrite URL rewriting rules and perl code. It combines several anti-spambot tactics, includling fictitious (but RFC822-compliant) email address poisoning, injection with the addresses of known spammers (let them all spam each other), deterministic output, and "teergrube" spamtrap addressing.

Hi,

i'm using an other way to fool spammers spider:

<snip>

This is _not_ a guestbook - you have been warned ;-)

greetings from germany,
Marcel.

richard

1:02 am on Feb 27, 2003 (gmt 0)

10+ Year Member



A very impressive thread, my 2 bits of mod_rewrite knowledge are not worth adding to it, except to reiterate what others have said, read the apache documentation [httpd.apache.org].

A minor aside, some time ago as I was first tackling mod_rewrite, and thought I'd discovered a minor bug, (can't remember what it was, except that it was a bug in my logic), total naivety, I sent the good man Ralf S. Engelschall an email, only to get a "Mail delivery failed: returning message to sender". The reason being <sigh>Mr Engelschall was getting to much spam</sigh>.

P.S.
I liked andreasfriedrich's "If you care about freedom be permissive, if you are paranoid be restrictive."

And a tiny bit of irony, including self irony, the definition of an expert: x is an uncertain quantity and spurt is a drip under pressure ;-).

Another post closer to being a "preferred member".

Brett_Tabke

2:06 am on Feb 28, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



A nice article I found in the referrer logs from Mark Pilgrims Diveintomark.org site:

Mouse over it - this crowd will just love this url [diveintomark.org]! Thanks Mark.

lorax

2:35 am on Feb 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's a beaut Brett - Thanks!

Hester

12:12 pm on Feb 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have these questions:

1) What are the bot companies doing with all the data they take?

2) Isn't it illegal?

3) I've checked my webstats but all I see are lists of IP addresses and unknown names. How can I tell what is good and what is bad? None of the names published here seem to be in my list. (Some obvious search engines are though.)

4) To reiterate a previous post, what's to stop all robots announcing themselves as legitimate browsers? (See point 2!)

5) It's only a matter of time before bots can decipher Javacript URLS too. Is it even worth trying to fight them when extra bandwidth is fairly cheap? So long as it doesn't impact the genuine user?

Brett_Tabke

4:48 pm on Feb 28, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month


Lets try to stay on topic here in this mega multi-year thread. Please start a new thread in the spider id forum hester for the side topic issues.

Oaf357

1:01 am on Mar 7, 2003 (gmt 0)

10+ Year Member



So you only need to put the .htaccess in your root directory. What if the robots enter from another area?
This 243 message thread spans 25 pages: 243