Forum Moderators: coopster & phranque

Message Too Old, No Replies

A Close to perfect .htaccess ban list

         

toolman

3:30 am on Oct 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's the latest rendition of my favorite ongoing artwork....my beloved .htaccess file. I've become quite fond of my little buddy, the .htaccess file, and I love the power it allows me to exclude vermin, pestoids and undesirable entities from my web sites

Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.

Feel free to use this on your own site and start blocking bots too.

(the top part is left out)

<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

Superman

6:04 am on Oct 5, 2002 (gmt 0)

10+ Year Member




1. I use FrontPage to build and maintain my site.

2. I've never had a problem with people ripping my site with it.

I don't really see how FrontPage can be used similarly to something like Teleport Pro to rip a site ... it seems like it would be a very slow way to do it.

Anyway, it's easy to block if you want. Just add this line:

RewriteCond %{HTTP_USER_AGENT} FrontPage [NC,OR]

-Superman-

Annii

9:44 am on Oct 5, 2002 (gmt 0)

10+ Year Member



Hi JDMorgan
I uploaded the .htaccess
ErrorDocument 404 /404.htm
ErrorDocument 403 /403.htm
ErrorDocument 501 /404.htm
ErrorDocument 502 /404.htm
ErrorDocument 503 /404.htm

<FilesMatch "\.htm([l])*$">

ForceType application/x-httpd-php

</FilesMatch>

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
............
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org
RewriteRule !^http://www.mydomain.com/403.htm$ - [F,L]

My 404 works fine didn't really understand how to do the 500 tests
my filesmatch works fine
but the rewrite doesn't work, I actually download a trial of WebZip to test it and it allowed me to download my whole site..
I also used [wannabrowser.com...] typing in examples from the .htacces file such as WebZIP or WebCopier and it didn't show my 403.htm file

Any ideas anyone?

Thanks

Anni

carfac

3:46 pm on Oct 5, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Annii:

I THINK your problem is:

RewriteRule !^http://www.mydomain.com/403.htm$ - [F,L]

Do not put the URL there, just the page.... like this:

RewriteRule !^403.htm$ - [F,L]

dave

Annii

4:48 pm on Oct 5, 2002 (gmt 0)

10+ Year Member


Dave,
Thanks, but unfortunately, that didn't seem to make any difference...

When I use wannabrowser.com I just type in HTTP User Agent: WebSauger
and location: http://www.mydomain.com

right?

Anyway that's what I did to try to test it, but I het the html of my index page instead of my 403.htm page...

any other thoughts?

Anni

Annii

6:19 pm on Oct 5, 2002 (gmt 0)

10+ Year Member



Hi
Just to let you all know, it's working now, I contacted my host thinking it could be a server problem and they checked it and explained that I had left an OR off one line, Duhhh...

I was obviously having one of my blonde at the roots days!

Thanks for all your help

Anni

carfac

9:54 pm on Oct 5, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Annii:

Glad it is working. As I think you figured out, yes, that is how to do wannabrowser!

good luck

dave

jdMorgan

12:57 am on Oct 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Annii,

I'm glad you got it working too - I was away from WebmasterWorld today. But, as you found out, there's a lot of knowledge here among many members, and now you too are an experienced htaccess debugger!

carfac's comment on your rewrite rule above is correct: The rule's pattern (on the left) uses only a path, and the substitution (on the right) uses either a canonical URL (http://www.mydomain.com/substitutefile.html) or a path (/substitutefile.html). So as carfac said, the rule must read:

RewriteRule !^403.htm$ - [F,L]

mod_rewrite is like nitroglycerin - Very powerful, but don't drop it! One little typo can blow the whole thing up - a fact I was personally reminded of just two hours ago when I found I'd inadvertently introduced a syntax error into my spambot block RewriteRules, and essentially disabled the whole lot!

Jim

Natashka

12:16 am on Oct 8, 2002 (gmt 0)

10+ Year Member



I don't understand something here, maybe somebody could clear things up for me.
I have a htaccess on my site, just like the one you are discussing here, but it doesn't work for a simple reason: most of the offline browsers don't identify themselves as such! My logs show that all user-agent are either MSIE, or Netscape, or AOL! However, my site is constantly downloaded by those grabbers, some days ago there were 40,000 (!) hits coming from the same IP in one hour period. It looks like that browser got stuck on one page or something. Even my webhost said it was probably offline browser, but all logs indicated that it was MSIE. Yesterday I've even downloaded WebZIP myself, turned off MSIE, entered my site with WebZIP and... logs showed I was using MSIE, and not WebZIP! What's the point of this htaccess if offline browsers "mask" themselves?
I am really frustrated, because not only it sucks up all my bandwidth, but my CTR rate is dropping dramatically, I am afraid my advertisers will just kick me out. Maybe somebody can help and explain me what can be done? I heard there is a way to restrict amount of hits coming from the same IP, anybody knows how to do it? It may be more efficient, since offline browsers "pretend" to be MSIE.

jdMorgan

12:37 am on Oct 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Natashka,

Welcome to WebmasterWorld!

We are discussing one aspect of htaccess here - banning bad user-agents by name. The other aspect (the one we are not addressing directly here) is banning by IP addresses, or by range of IP addresses.

Where banning by IP is concerned, two appraoches are often used - blocking the IP address manually in htaccess, and a script-based approach used to trap bad bots.

In the script-based approach, you put a link to a "trap file" on one or more of your pages. Then you Disallow that "trap file" in robots.txt. This "trap file" doesn't really exist - accessing it in defiance of the robots.txt actually invokes the script, which then automatically adds an IP-address-based block to you htaccess file.

To get you started blocking the more eggregious abusers, here's a way to block an IP address manually in htaccess:

# Troublesome AT&T Broadband user
RewriteCond %{REMOTE_ADDR} ^65.97.14.251$

However, more often the real bad guys use entire ranges of address, making the blocking more complex. If these are not clear you'll need to study up on the regular expressions used for pattern matching in mod-rewrite:

# Cyveillance
RewriteCond %{REMOTE_ADDR} ^63\.148\.99\.2(2[4-9]¦[34][0-9]¦5[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^63\.226\.3[34]\. [OR]
RewriteCond %{REMOTE_ADDR} ^63\.212\.171\.161$ [OR]
# Webcontent International
RewriteCond %{REMOTE_ADDR} ^65\.102\.12\.2(2[4-9]¦3[01])$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.102\.17\.(3[2-9]¦[4-6][0-9]¦7[0-1]¦8[89]¦9[0-5]¦10[4-9]¦11[01])$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.102\.23\.1(5[2-9]¦6[0-7])$
RewriteRule .* - [F,L]

Hope this helps,
Jim

Superman

1:13 am on Oct 8, 2002 (gmt 0)

10+ Year Member



Natashka,

Virtually all the Offline Browsers have the ability to cloak. However, most of them use their default UA unless it is changed by the user. It's safe to assume most users don't bother. If there is someone who takes the time, there is nothing you can really do about it.

Do you have access to your "raw" logs? If you do, I suspect you will see many of the Offline Browsers listed there. Some stat programs, such as OpenWebScope, usually only list things like IE, Netscape, etc. for some reason. Also, many webhosts that provide stats only list the Top 20 or some other finite # in the stat logs. Obviously the top of the list is going to be dominated by the popular browsers, while things that hit you less often are going to get left off.

Anyway, blocking by IP is easy:

<Limit GET>
order allow,deny
deny from 12.101.35.172
deny from 12.108.37.2
allow from all
</Limit>

In that example, anybody coming from IP 12.101.35.172 or 12.108.37.2 will get a 403 Forbidden page. You can also put in partial IP's to block an entire group. For example, 12.101 will block everything beginning with 12.101.

Be careful when blocking IP's, because you don't want to accidentally block something like AOL for example.

I keep this .htaccess only in my "members" directory, since that's where 99 percent of my site content is. I tend to get abused by Access Diver users, and they virtually always use multiple proxy's to try and brute force into my members area ... I take the proxy's from my logs, verify them using the very same program, and then add them to my .htaccess.

-Superman-

This 243 message thread spans 25 pages: 243