homepage Welcome to WebmasterWorld Guest from 54.226.21.57
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

This 243 message thread spans 9 pages: < < 243 ( 1 2 3 4 [5] 6 7 8 9 > >     
A Close to perfect .htaccess ban list
toolman

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 687 posted 3:30 am on Oct 23, 2001 (gmt 0)

Here's the latest rendition of my favorite ongoing artwork....my beloved .htaccess file. I've become quite fond of my little buddy, the .htaccess file, and I love the power it allows me to exclude vermin, pestoids and undesirable entities from my web sites

Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.

Feel free to use this on your own site and start blocking bots too.

(the top part is left out)

<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

 

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 687 posted 10:49 pm on Oct 4, 2002 (gmt 0)

58sniper,

But since it's already there...

That first UA you've got commented out may have been meant to block "^Mozzilla*" - a misspelled and bogus user-agent. Blocking the "two z's" version is a good idea, blocking the common version certainly isn't! :)

Jim

Annii

10+ Year Member



 
Msg#: 687 posted 11:09 pm on Oct 4, 2002 (gmt 0)

JDMorgan,
Thanks very much for your help.
So.. I've removed the 400 error doc from part 1 and changed the 403 error document to 403.htm

I've left part 2 as is because I need to exclude shtml and shtm and this was the only combination I could get to work at the time...

I've changed the last bit of Part 3 as you suggested.

Just 3 quick questions if I can, what does the
RewriteRule !^403.htm$ - [F,L] (in particular the F,L bit) mean at the end?

Is it better to use absolute or relative paths for the error documents?

How can I check that it's working once I've uploaded it?

Thanks again

Anni

carfac

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 687 posted 11:44 pm on Oct 4, 2002 (gmt 0)

Annii:

RewriteRule !^403.htm$ - [F,L] (in particular the F,L bit) mean at the end?

RewriteRule- The conclusion of all thos conditions, and the terminator of that. In means that if any (*you use [OR]) of those conditions exist, do this.

!- this means not
^- means the request string BEGINS EXACTLY
403.htm- name of your 403 file
$- ends exactly (the above will not match 403.html!)

- means do nothing

F means this is forbidden- return FORBIDDEN (403)
L means LAST, do nothing further and end all rewrite rules for any requests effected by this block

Is it better to use absolute or relative paths for the error documents?

for the first part of the rewriterule, use a URI from the root directory
for the seconf part, you HAVE to use a full URL (http://www.domain.com/)

How can I check that it's working once I've uploaded it?

I would recommend going to [wannabrowser.com...]

and spoofing your UA!

Good Luck!

dave

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 687 posted 12:39 am on Oct 5, 2002 (gmt 0)

I see that carcfac has already replied, but I was called away while writing this, and since it took awhile to write, I'm gonna post it anyway... :)

I've left part 2 as is because I need to exclude shtml and shtm and this was the only combination I could get to work at the time...

The change I suggested will do the same. You just had a somewhat complicated and inefficient regex pattern saying, "match anything that ends with htm or html". The two methods are equivalent, except that the original method would match htm, html, htmll, htmlll, or htmlllllllllllllllllllll, etc.
The new method will match anything that ends with htm or html only.

If you want to exclude shtm and shtml files from the match, use <FilesMatch "\.html?$"> which will require the path to end with ".htm" or ".html".

Just 3 quick questions if I can, what does the
RewriteRule !^403.htm$ - [F,L] (in particular the F,L bit) mean at the end?

If the conditions match:
Rewrite any requested URL except 403.htm to (blank URL), return a Forbidden server status code, and stop processing rewrite rules, this is the Last one to process. The result is that any banned User-Agent will receive a 403-Forbidden server response, and it will be redirected to your custom 403 error page, 403.htm (which is why you don't want to rewrite that URL if it is subsequently requested). Most bad-bots will not follow this redirect, but that's OK.

Is it better to use absolute or relative paths for the error documents?

ErrorDocument paths must be relative, otherwise a 302-Moved Temporarily response code will be sent to the requesting client, masking the correct error code.

How can I check that it's working once I've uploaded it?

That's tricky... The first thing to check is whether you can still access your web site. Various errors in .htaccess can result in a 500-Server Error code being returned, and your site will be inaccessible. Be ready to remove your new .htaccess and replace it with a known-good backup if this happens! Then view your server error log to find out what caused the server error.

The next part can be done several ways. Checking to see whether your User-agent blocks can be accomplished by modifying your registry entries for Internet Explorer to make it send a blocked User-agent string. Do this only if you are familiar with registry backups and editing! Otherwise, you can simply check your log files once in a while to confirm that bad bots are being blocked as expected.

Testing the custom 404 error document is easy, just request a non-existent page from your site. Testing the 500-series codes is more difficult, since you will need to create redirects for several non-existent files and then request those files in order to test the custom handlers:

RewriteRule ^test501.htm$ - [R=501,L]
RewriteRule ^test502.htm$ - [R=502,L]
etc.

Also, unless you are handling password logins with a custom script, I suggest that you do not redirect 401s to a custom error document.

Again, spending some time reviewing the Apache server documentation [httpd.apache.org] will clear up many questions. I print it out once a year, or when my current copy is worn out, whichever comes first! :)

Jim

[edited by: jdMorgan at 12:59 am (utc) on Oct. 5, 2002]

Annii

10+ Year Member



 
Msg#: 687 posted 12:53 am on Oct 5, 2002 (gmt 0)

Dave and JD Morgan
Thankyou both so much for explaining this and taking so much time to do so... I really appreciate it. It's late here (UK) so I'll give it all a go tomorrow and see how I get on :)

Thanks again

Anni

Superman

5+ Year Member



 
Msg#: 687 posted 2:52 am on Oct 5, 2002 (gmt 0)
Wow, this thread has gotten a lot of activity lately after being dead for a long time.

.htaccess rocks, and there are many other things I use it for. Here is a great one for preventing people from hotlinking your files:

RewriteEngine On
RewriteCond %{HTTP_REFERER}!^http://([a-z0-9-]+\.)*yourdomain.com/ [NC]
RewriteCond %{HTTP_REFERER}!^http://([a-z0-9-]+\.)*12.345.67.890/ [NC]
RewriteRule /* http://yourdomain.com [L,R]

Obviously "yourdomain.com" would be your domain name, and the "12.345.67.890" would be your site's domain #.

For example, I use this one in my "images" directory to prevent people from hotlinking my images on their sites. I also have it in my "logs" folder so they can't view my site logs.

-Superman-

Superman

5+ Year Member



 
Msg#: 687 posted 3:01 am on Oct 5, 2002 (gmt 0)

Another way to shorten your list is to put those bots that read and respect robots.txt there instead. I have moved "ia_archiver", "psbot", and "SlySearch" to my robots.txt file. "internetseer.com" also reads and respects the robots.txt file, although I actually signed up for their service so I don't block them anymore.

-Superman-

Superman

5+ Year Member



 
Msg#: 687 posted 3:07 am on Oct 5, 2002 (gmt 0)

Sniper, you are blocking Googlebot by blocking everything with "bot" in it. Probably a bunch of other good bots as well.

-Superman-

Superman

5+ Year Member



 
Msg#: 687 posted 3:27 am on Oct 5, 2002 (gmt 0)

Here is my latest, thoroughly researched .htaccess file to block evil bots and site downloaders ... with some new tricks integrated from the recent posts:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]

Some notes:

1. The [1] at the begining and the [2] at the end of my original file was added sometime after this site's format changed (something to do with the formatting codes). Anyway, they did not belong there.

2. All the things in my list have been thorougly researched. 90 percent of them are Site Downloaders. There are also some email harvesters and other evil things (like VoidEye).

3. If you know a bot respects robots.txt, put it there. It will shorten your list (see my post above). If anybody sees something in my list that definitely obeys robots.txt, please let me know.

4. Adding the [NC,OR] to all of your entries will only make your file that much bigger. 99 percent of these things always use the exact useragent name. If there are anomolies (like httrack), then by all means make it case-insensitive. Same with the ^ character. They always start the same way.

-Superman-

stapel

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 687 posted 5:04 am on Oct 5, 2002 (gmt 0)

I note that you do not include FrontPage in your list. If you don't mind my asking: Why not?

Eliz.

Superman

5+ Year Member



 
Msg#: 687 posted 6:04 am on Oct 5, 2002 (gmt 0)


1. I use FrontPage to build and maintain my site.

2. I've never had a problem with people ripping my site with it.

I don't really see how FrontPage can be used similarly to something like Teleport Pro to rip a site ... it seems like it would be a very slow way to do it.

Anyway, it's easy to block if you want. Just add this line:

RewriteCond %{HTTP_USER_AGENT} FrontPage [NC,OR]

-Superman-

Annii

10+ Year Member



 
Msg#: 687 posted 9:44 am on Oct 5, 2002 (gmt 0)

Hi JDMorgan
I uploaded the .htaccess
ErrorDocument 404 /404.htm
ErrorDocument 403 /403.htm
ErrorDocument 501 /404.htm
ErrorDocument 502 /404.htm
ErrorDocument 503 /404.htm

<FilesMatch "\.htm([l])*$">

ForceType application/x-httpd-php

</FilesMatch>

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
............
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org
RewriteRule !^http://www.mydomain.com/403.htm$ - [F,L]

My 404 works fine didn't really understand how to do the 500 tests
my filesmatch works fine
but the rewrite doesn't work, I actually download a trial of WebZip to test it and it allowed me to download my whole site..
I also used [wannabrowser.com...] typing in examples from the .htacces file such as WebZIP or WebCopier and it didn't show my 403.htm file

Any ideas anyone?

Thanks

Anni

carfac

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 687 posted 3:46 pm on Oct 5, 2002 (gmt 0)

Annii:

I THINK your problem is:

RewriteRule !^http://www.mydomain.com/403.htm$ - [F,L]

Do not put the URL there, just the page.... like this:

RewriteRule !^403.htm$ - [F,L]

dave

Annii

10+ Year Member



 
Msg#: 687 posted 4:48 pm on Oct 5, 2002 (gmt 0)
Dave,
Thanks, but unfortunately, that didn't seem to make any difference...

When I use wannabrowser.com I just type in HTTP User Agent: WebSauger
and location: http://www.mydomain.com

right?

Anyway that's what I did to try to test it, but I het the html of my index page instead of my 403.htm page...

any other thoughts?

Anni

Annii

10+ Year Member



 
Msg#: 687 posted 6:19 pm on Oct 5, 2002 (gmt 0)

Hi
Just to let you all know, it's working now, I contacted my host thinking it could be a server problem and they checked it and explained that I had left an OR off one line, Duhhh...

I was obviously having one of my blonde at the roots days!

Thanks for all your help

Anni

carfac

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 687 posted 9:54 pm on Oct 5, 2002 (gmt 0)

Annii:

Glad it is working. As I think you figured out, yes, that is how to do wannabrowser!

good luck

dave

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 687 posted 12:57 am on Oct 6, 2002 (gmt 0)

Annii,

I'm glad you got it working too - I was away from WebmasterWorld today. But, as you found out, there's a lot of knowledge here among many members, and now you too are an experienced htaccess debugger!

carfac's comment on your rewrite rule above is correct: The rule's pattern (on the left) uses only a path, and the substitution (on the right) uses either a canonical URL (http://www.mydomain.com/substitutefile.html) or a path (/substitutefile.html). So as carfac said, the rule must read:

RewriteRule !^403.htm$ - [F,L]

mod_rewrite is like nitroglycerin - Very powerful, but don't drop it! One little typo can blow the whole thing up - a fact I was personally reminded of just two hours ago when I found I'd inadvertently introduced a syntax error into my spambot block RewriteRules, and essentially disabled the whole lot!

Jim

Natashka

10+ Year Member



 
Msg#: 687 posted 12:16 am on Oct 8, 2002 (gmt 0)

I don't understand something here, maybe somebody could clear things up for me.
I have a htaccess on my site, just like the one you are discussing here, but it doesn't work for a simple reason: most of the offline browsers don't identify themselves as such! My logs show that all user-agent are either MSIE, or Netscape, or AOL! However, my site is constantly downloaded by those grabbers, some days ago there were 40,000 (!) hits coming from the same IP in one hour period. It looks like that browser got stuck on one page or something. Even my webhost said it was probably offline browser, but all logs indicated that it was MSIE. Yesterday I've even downloaded WebZIP myself, turned off MSIE, entered my site with WebZIP and... logs showed I was using MSIE, and not WebZIP! What's the point of this htaccess if offline browsers "mask" themselves?
I am really frustrated, because not only it sucks up all my bandwidth, but my CTR rate is dropping dramatically, I am afraid my advertisers will just kick me out. Maybe somebody can help and explain me what can be done? I heard there is a way to restrict amount of hits coming from the same IP, anybody knows how to do it? It may be more efficient, since offline browsers "pretend" to be MSIE.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 687 posted 12:37 am on Oct 8, 2002 (gmt 0)

Natashka,

Welcome to WebmasterWorld!

We are discussing one aspect of htaccess here - banning bad user-agents by name. The other aspect (the one we are not addressing directly here) is banning by IP addresses, or by range of IP addresses.

Where banning by IP is concerned, two appraoches are often used - blocking the IP address manually in htaccess, and a script-based approach used to trap bad bots.

In the script-based approach, you put a link to a "trap file" on one or more of your pages. Then you Disallow that "trap file" in robots.txt. This "trap file" doesn't really exist - accessing it in defiance of the robots.txt actually invokes the script, which then automatically adds an IP-address-based block to you htaccess file.

To get you started blocking the more eggregious abusers, here's a way to block an IP address manually in htaccess:

# Troublesome AT&T Broadband user
RewriteCond %{REMOTE_ADDR} ^65.97.14.251$

However, more often the real bad guys use entire ranges of address, making the blocking more complex. If these are not clear you'll need to study up on the regular expressions used for pattern matching in mod-rewrite:

# Cyveillance
RewriteCond %{REMOTE_ADDR} ^63\.148\.99\.2(2[4-9][34][0-9]5[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^63\.226\.3[34]\. [OR]
RewriteCond %{REMOTE_ADDR} ^63\.212\.171\.161$ [OR]
# Webcontent International
RewriteCond %{REMOTE_ADDR} ^65\.102\.12\.2(2[4-9]3[01])$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.102\.17\.(3[2-9][4-6][0-9]7[0-1]8[89]9[0-5]10[4-9]11[01])$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.102\.23\.1(5[2-9]6[0-7])$
RewriteRule .* - [F,L]

Hope this helps,
Jim

Superman

5+ Year Member



 
Msg#: 687 posted 1:13 am on Oct 8, 2002 (gmt 0)

Natashka,

Virtually all the Offline Browsers have the ability to cloak. However, most of them use their default UA unless it is changed by the user. It's safe to assume most users don't bother. If there is someone who takes the time, there is nothing you can really do about it.

Do you have access to your "raw" logs? If you do, I suspect you will see many of the Offline Browsers listed there. Some stat programs, such as OpenWebScope, usually only list things like IE, Netscape, etc. for some reason. Also, many webhosts that provide stats only list the Top 20 or some other finite # in the stat logs. Obviously the top of the list is going to be dominated by the popular browsers, while things that hit you less often are going to get left off.

Anyway, blocking by IP is easy:

<Limit GET>
order allow,deny
deny from 12.101.35.172
deny from 12.108.37.2
allow from all
</Limit>

In that example, anybody coming from IP 12.101.35.172 or 12.108.37.2 will get a 403 Forbidden page. You can also put in partial IP's to block an entire group. For example, 12.101 will block everything beginning with 12.101.

Be careful when blocking IP's, because you don't want to accidentally block something like AOL for example.

I keep this .htaccess only in my "members" directory, since that's where 99 percent of my site content is. I tend to get abused by Access Diver users, and they virtually always use multiple proxy's to try and brute force into my members area ... I take the proxy's from my logs, verify them using the very same program, and then add them to my .htaccess.

-Superman-

Natashka

10+ Year Member



 
Msg#: 687 posted 1:40 am on Oct 8, 2002 (gmt 0)

Thanks guys for your replies,

I will look at my "raw" logs, that's true, I didn't look at those. But I got your point, if a user *hides* his real user_agent, nothing can be done :(
I have an IP blocking on my site too, and I did block all the past abusers, but it only works against those who has already done something wrong and whose IP I know. And what about all those potential abusers who *hide* their user_agent and whose IP's I don't know yet?

I was talking about another type of htaccess, not by particular IP address. It just restricts amount of bandwidth (hits) that comes from ANY IP, not just from some certain listed IP's. But I couldn't find anywhere on the Internet the sample of that htaccess file. Only heard people saying it's hard to do... Hard but possible! Thought maybe somebody on this board knows anything about it.

Still probably not made myself clear enough... This is a htaccess that limits EVERYBODY (people, robots etc...) to lets say maximum 1000-2000 hits on my site. But not 40,000 hits! :)

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 687 posted 2:04 am on Oct 8, 2002 (gmt 0)

a htaccess that limits EVERYBODY (people, robots etc...) to lets say maximum 1000-2000 hits on my site. But not 40,000 hits!

Per-IP, quota-based blocking will require a script and possibly a database function to implement. If anyone here has already done it, I'd like to know, too. I'm pretty sure you can buy a solution, but not sure I can afford it.

However, the approach I described above based on a robots.txt trap and script-based block will probably work for you. The script I described has been posted here on WebmasterWorld.

Jim

carfac

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 687 posted 4:05 pm on Oct 8, 2002 (gmt 0)

Natashka:

I have a method I have found very effective, if you have root access. I use a spider trap, and have modified that to work with Apache::BlockIP- that way it is somewhat faster than .htaccess. This does seem to pick up most cloaked UA's that try to d/l a lot.

Then I modified BlockIP to BlockAgent (to get UA blocking into Mod_perl from .htaccess)

I would LIKE to add Apache::SpeedLimit, but I can't seem to get that working on my system. But that woulod be a greatr addition.

Sticky me if you want details.

dave

Evoken

10+ Year Member



 
Msg#: 687 posted 1:40 am on Oct 9, 2002 (gmt 0)
Okay, I've been looking around here for a couple of days trying to figure this out, and I'm to the point where I am desperate! YOu guys here seem like experts, so hopefully you can help!

I have implemented the .htaccess block files for a few months now, and have no problem blocking things like Teleport Pro, FrontPage, etc.

However, I am having a problem blocking this program "Access Diver" Superman mentioned, so I hope Superman or someone can help me block it.

For those that don't know, this program tries to hack into your password protected members directory using a list of usernames and passwords. It tries the combinations over and over again, until it finds a match. If there is no match for a particular combo, it gives a 401 error and goes on to the next one. Some of these lists can be 100,000 combos long, and not only does it rape my bandwidth, it also gives me a massive error log with all the 401 errors in it!

This program leaves multiple useragents, and apparently rotates them. My weblizer logs show different useragents than my other stats program. For example, my stats program might show "TWRAITH" while the exact same hit in weblizer will show "[jp]", so obviously my stat programs can't even figure out the correct useragent.

Before you give me the obvious answer, yes I've tested all the useragents that show up in both my logs. I can easily put them in my .htaccess, and when I test them on wannabrowser, it says they are blocked. Unfortunately, when I run the program (which I downloaded to test against my site), it never sends them to the 403 page, it just tries and tries the combos seemingly oblivious to my .htaccess!

I really want to send this evil program to a 403 error page every time they try and run it, but I've had absolutely no luck and I am ready to pull my hair out.

I am posting my logs from last night (with my IP changed), when I did some repeated tests with this program against my site. You can see how the useragent, OS, etc. are randomly rotated over and over again, but there has to be a way to block these things!

123.000.00.123- amphatamines [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/3.01 ( compatible; [dk]; Windows 95; DigiExt )"

123.000.00.123- blackhawks [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.73 ( compatible; [en]; Windows NT4.0; DigiExt )"

123.000.00.123- letmeenjoy [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.73 ( compatible; [dk]; Windows 95; DigiExt )"

123.000.00.123- fargifiction [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.0 ( compatible; MSIE 4.01; Windows NT5.0; win9x/NT 4.90 )"

123.000.00.123- blackhawks [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.7 ( compatible; MSIE 5.01; AOL 5.0; DigiExt )"

123.000.00.123- srinivassrinivas [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.6 ( compatible; MSIE 5.01; AOL 5.0; FREEI v2.53 )"

123.000.00.123- hhhhhaaaaa [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/3.01 ( compatible; MSIE 4.0; Windows 95; DigiExt )"

123.000.00.123- vvvvvppppp [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.72 ( compatible; MSIE 5.0; Windows 98; athome020 )"

123.000.00.123- sharontaylor [08/Oct/2002:00:55:36 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.7 ( compatible; MSIE 4.0; Windows 95; ezn IE )"

123.000.00.123- Albuquerque [08/Oct/2002:00:55:36 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.72 ( compatible; MSIE 4.01; Windows NT4.0; DigiExt )"

123.000.00.123- Greensboro [08/Oct/2002:00:55:36 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.73 ( compatible; [de]; AOL 5.0; DigiExt )"

<trim>
If anyone has any ideas, I would really appreciate it!

Natashka

10+ Year Member



 
Msg#: 687 posted 1:48 am on Oct 9, 2002 (gmt 0)

jdMorgan,
thanks, I'll probably try the robot.txt method, the thing is that I really don't want to block IP's, because just like Superman said, you can block too many "innocent" people that way. So far I was *lucky* and my abusers were not using any major ISPs. But what if one day an abuser will come from AOL? I cannot block the entire AOL :) That's why I am searching for the alternative ways, not based on a particular IP, and in case the user_agent is cloaked.

carfac,
thanks, but unfortunately I don't have a root access :(

And one more comment about the .htaccess I saw in this string. I've noticed "Scooter" on the list. Isn't the Altavista's crawler called "Scooter"? I remember it from the old good days, before they've re-designed their site and there was even a cartoon picture of that "Scooter" on their site... That was looong time ago though, maybe they've changed the crawler's name.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 687 posted 3:30 am on Oct 9, 2002 (gmt 0)

Evoken,

Before you give me the obvious answer, yes I've tested all the useragents that show up in both my logs. I can easily put them in my .htaccess, and when I test them on wannabrowser, it says they are blocked. Unfortunately, when I run the program (which I downloaded to test against my site), it never sends them to the 403 page, it just tries and tries the combos seemingly oblivious to my .htaccess!

It is likely that your server is set up to give higher priority to processing password protection than to processing per-directory (.htaccess) mod-rewrite blocking. "Built-in" access restrictions such as password protection are executed by the server before "custom" methods such as those implemented by the webmaster in .htaccess. Therefore, you see the effect you describe above. Should access diver ever hit a valid password combination, then your .htaccess would be processed, and your 403-Forbidden block would be invoked.

One thing I should point out for the casual reader is that all of these blocking methods work only to restrict access to content. They do not block requests to your site. The main purposes in .htaccess blocking are to restrict access to content and to reduce server bandwidth. This latter assumes that the size of the server error response (401 or 403) is smaller than the size of the object requested; Unless security or intellectual propert issues are involved, it is counter-productive to block access to a 1kB file and return a 2kB custom error page in it's place.

Evoken, because you changed the IP address, I cannot tell whether a block by IP address of the form:
RewriteCond %{REMOTE_ADDR} ^128\.242\.197\.101$
would be effective in your case. This only works if your attacker always uses the the same IP address or a small group of addresses. Even if you did block by IP, you would still have to count on the user-agent to "give up" after a certain period of time. The only alternatives if it doesn't give up are to ask your hosting company to "black hole" the offending IP at their firewall/router, or to chase down this IP address and report him to his ISP or host, or to law enforcement (depending on your country). If you can show that these requests are coming from the same source, you could construe them as a denial-of-service attack, and thus possibly get more help than is usual from your hosting company. If dealing with law enforcement, you may have to point out the obvious - that bandwidth costs money, and that therefore this kind of attack is equivalent to theft of goods and services.

I share Natashka's frustration with the limitations of blocking, but you have to decide whether the effort is worthwhile or not. Certainly, you should insist on a hosting company that allows you to use .htaccess and CGI (or other) scripts to protect your site.

Hope this helps,
Jim

Superman

5+ Year Member



 
Msg#: 687 posted 3:49 am on Oct 9, 2002 (gmt 0)

Evoken,

I share your frustration with that program! I've never been able to block it either. If what jd said works, you could block the UA's and if the program hits a good combination, it will return a 403 instead of letting them into your members area! Then again, if they try 1000 combos and only get one 403 when the rest are 401's they might be smart enough to figure that out ... I might try that out tonight just to see. I used to list all the UA's in my .htaccess but took them out after similar frustration.

Jd,

Users of that program generally use a large number of proxies. I've seen up to 1000 different ones, all rotated after one or two attempts (to circumvent things like PennyWize). If you have a particular IP blocked, the program quits using that particular one, and keeps trying the ones that work. No matter how many proxies you have blocked, you can never get all of them. My .htaccess to block IP's is huge ... 53K. AccessDiver and other like it really suck because they eat up a huge amount of bandwidth.

-Superman-

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 687 posted 4:48 am on Oct 9, 2002 (gmt 0)

Natashka,

I've noticed "Scooter" on the list. Isn't the Altavista's crawler called "Scooter"? I remember it from the old good days, before they've re-designed their site and there was even a cartoon picture of that "Scooter" on their site... That was looong time ago though, maybe they've changed the crawler's name.

Yes, that's AltaVista's spider. Some have added access blocks for Scooter/1.0 because it has recently been ignoring robots.txt. AV is apparently using Scooter/1.0 to index graphics. Scooter/3.2 is the one they usually use to index html pages. I suspect someone added "Scooter" to the list without making this distinction.

It does seem like a long time ago, but it wasn't really, when AV enjoyed all the praise (and criticism) that Google enjoys today. That's the reason we should never target just one search engine when working on our sites, even if that search engine drives 90% of our traffic today. Things change.

Jim

pmkpmk

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 687 posted 5:40 pm on Oct 9, 2002 (gmt 0)

Hi JDMorgan,

you write:
> One thing I should point out for the casual reader is that all of
> these blocking methods work only to restrict access to content.

Actually my primary goal is to block adress harvesters. I don't care (yet) for people downloading the whole site. But we really need to get a lid on this SPAM.

Unfortunately, I never got the .htaccess stuff working so far. The curse of a part-time webmaster...

carfac

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 687 posted 6:23 pm on Oct 9, 2002 (gmt 0)

pmkpmk:

Unfortunately, I never got the .htaccess stuff working so far

Any idea why it is not working? Are you on an ISP webspace? Do they allow .htaccess, and do they have mod_rewrite enabled?

dave

pmkpmk

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 687 posted 10:04 am on Oct 11, 2002 (gmt 0)

Hi Dave,

no, I have root access on my own server, which physically resides about 7m from where I am right now :-)

Excerpts from httpd.conf:

LoadModule rewrite_module /usr/lib/apache/mod_rewrite.so
AddModule mod_access.c
<VirtualHost a.b.c.d>
Options +FollowSymLinks
</VirtualHost>

Excerpts of .htaccess

XBitHack on
Options +FollowSymLinks
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]

Error messages in Logfile:

error.log:[Wed Sep 11 11:23:27 2002] [error] [client x.y.z.z] Options FollowSymLinks or SymLinksIfOwnerMatch is off which implies that RewriteRule directive is forbidden: /usr/local/httpd/virtual/....

This 243 message thread spans 9 pages: < < 243 ( 1 2 3 4 [5] 6 7 8 9 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved