Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.
Feel free to use this on your own site and start blocking bots too.
(the top part is left out)<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]
I will look at my "raw" logs, that's true, I didn't look at those. But I got your point, if a user *hides* his real user_agent, nothing can be done :(
I have an IP blocking on my site too, and I did block all the past abusers, but it only works against those who has already done something wrong and whose IP I know. And what about all those potential abusers who *hide* their user_agent and whose IP's I don't know yet?
I was talking about another type of htaccess, not by particular IP address. It just restricts amount of bandwidth (hits) that comes from ANY IP, not just from some certain listed IP's. But I couldn't find anywhere on the Internet the sample of that htaccess file. Only heard people saying it's hard to do... Hard but possible! Thought maybe somebody on this board knows anything about it.
Still probably not made myself clear enough... This is a htaccess that limits EVERYBODY (people, robots etc...) to lets say maximum 1000-2000 hits on my site. But not 40,000 hits! :)
a htaccess that limits EVERYBODY (people, robots etc...) to lets say maximum 1000-2000 hits on my site. But not 40,000 hits!
Per-IP, quota-based blocking will require a script and possibly a database function to implement. If anyone here has already done it, I'd like to know, too. I'm pretty sure you can buy a solution, but not sure I can afford it.
However, the approach I described above based on a robots.txt trap and script-based block will probably work for you. The script I described has been posted here on WebmasterWorld.
Jim
I have a method I have found very effective, if you have root access. I use a spider trap, and have modified that to work with Apache::BlockIP- that way it is somewhat faster than .htaccess. This does seem to pick up most cloaked UA's that try to d/l a lot.
Then I modified BlockIP to BlockAgent (to get UA blocking into Mod_perl from .htaccess)
I would LIKE to add Apache::SpeedLimit, but I can't seem to get that working on my system. But that woulod be a greatr addition.
Sticky me if you want details.
dave
I have implemented the .htaccess block files for a few months now, and have no problem blocking things like Teleport Pro, FrontPage, etc.
However, I am having a problem blocking this program "Access Diver" Superman mentioned, so I hope Superman or someone can help me block it.
For those that don't know, this program tries to hack into your password protected members directory using a list of usernames and passwords. It tries the combinations over and over again, until it finds a match. If there is no match for a particular combo, it gives a 401 error and goes on to the next one. Some of these lists can be 100,000 combos long, and not only does it rape my bandwidth, it also gives me a massive error log with all the 401 errors in it!
This program leaves multiple useragents, and apparently rotates them. My weblizer logs show different useragents than my other stats program. For example, my stats program might show "TWRAITH" while the exact same hit in weblizer will show "[jp]", so obviously my stat programs can't even figure out the correct useragent.
Before you give me the obvious answer, yes I've tested all the useragents that show up in both my logs. I can easily put them in my .htaccess, and when I test them on wannabrowser, it says they are blocked. Unfortunately, when I run the program (which I downloaded to test against my site), it never sends them to the 403 page, it just tries and tries the combos seemingly oblivious to my .htaccess!
I really want to send this evil program to a 403 error page every time they try and run it, but I've had absolutely no luck and I am ready to pull my hair out.
I am posting my logs from last night (with my IP changed), when I did some repeated tests with this program against my site. You can see how the useragent, OS, etc. are randomly rotated over and over again, but there has to be a way to block these things!
123.000.00.123- amphatamines [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/3.01 ( compatible; [dk]; Windows 95; DigiExt )"
123.000.00.123- blackhawks [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.73 ( compatible; [en]; Windows NT4.0; DigiExt )"
123.000.00.123- letmeenjoy [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.73 ( compatible; [dk]; Windows 95; DigiExt )"
123.000.00.123- fargifiction [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.0 ( compatible; MSIE 4.01; Windows NT5.0; win9x/NT 4.90 )"
123.000.00.123- blackhawks [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.7 ( compatible; MSIE 5.01; AOL 5.0; DigiExt )"
123.000.00.123- srinivassrinivas [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.6 ( compatible; MSIE 5.01; AOL 5.0; FREEI v2.53 )"
123.000.00.123- hhhhhaaaaa [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/3.01 ( compatible; MSIE 4.0; Windows 95; DigiExt )"
123.000.00.123- vvvvvppppp [08/Oct/2002:00:55:35 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.72 ( compatible; MSIE 5.0; Windows 98; athome020 )"
123.000.00.123- sharontaylor [08/Oct/2002:00:55:36 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.7 ( compatible; MSIE 4.0; Windows 95; ezn IE )"
123.000.00.123- Albuquerque [08/Oct/2002:00:55:36 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.72 ( compatible; MSIE 4.01; Windows NT4.0; DigiExt )"
123.000.00.123- Greensboro [08/Oct/2002:00:55:36 -0400] "HEAD /members/ HTTP/1.0" 401 0 "http://www.mysite.com/members/" "Mozilla/4.73 ( compatible; [de]; AOL 5.0; DigiExt )"
<trim>
If anyone has any ideas, I would really appreciate it!
carfac,
thanks, but unfortunately I don't have a root access :(
And one more comment about the .htaccess I saw in this string. I've noticed "Scooter" on the list. Isn't the Altavista's crawler called "Scooter"? I remember it from the old good days, before they've re-designed their site and there was even a cartoon picture of that "Scooter" on their site... That was looong time ago though, maybe they've changed the crawler's name.
Before you give me the obvious answer, yes I've tested all the useragents that show up in both my logs. I can easily put them in my .htaccess, and when I test them on wannabrowser, it says they are blocked. Unfortunately, when I run the program (which I downloaded to test against my site), it never sends them to the 403 page, it just tries and tries the combos seemingly oblivious to my .htaccess!
It is likely that your server is set up to give higher priority to processing password protection than to processing per-directory (.htaccess) mod-rewrite blocking. "Built-in" access restrictions such as password protection are executed by the server before "custom" methods such as those implemented by the webmaster in .htaccess. Therefore, you see the effect you describe above. Should access diver ever hit a valid password combination, then your .htaccess would be processed, and your 403-Forbidden block would be invoked.
One thing I should point out for the casual reader is that all of these blocking methods work only to restrict access to content. They do not block requests to your site. The main purposes in .htaccess blocking are to restrict access to content and to reduce server bandwidth. This latter assumes that the size of the server error response (401 or 403) is smaller than the size of the object requested; Unless security or intellectual propert issues are involved, it is counter-productive to block access to a 1kB file and return a 2kB custom error page in it's place.
Evoken, because you changed the IP address, I cannot tell whether a block by IP address of the form:
RewriteCond %{REMOTE_ADDR} ^128\.242\.197\.101$
would be effective in your case. This only works if your attacker always uses the the same IP address or a small group of addresses. Even if you did block by IP, you would still have to count on the user-agent to "give up" after a certain period of time. The only alternatives if it doesn't give up are to ask your hosting company to "black hole" the offending IP at their firewall/router, or to chase down this IP address and report him to his ISP or host, or to law enforcement (depending on your country). If you can show that these requests are coming from the same source, you could construe them as a denial-of-service attack, and thus possibly get more help than is usual from your hosting company. If dealing with law enforcement, you may have to point out the obvious - that bandwidth costs money, and that therefore this kind of attack is equivalent to theft of goods and services.
I share Natashka's frustration with the limitations of blocking, but you have to decide whether the effort is worthwhile or not. Certainly, you should insist on a hosting company that allows you to use .htaccess and CGI (or other) scripts to protect your site.
Hope this helps,
Jim
I share your frustration with that program! I've never been able to block it either. If what jd said works, you could block the UA's and if the program hits a good combination, it will return a 403 instead of letting them into your members area! Then again, if they try 1000 combos and only get one 403 when the rest are 401's they might be smart enough to figure that out ... I might try that out tonight just to see. I used to list all the UA's in my .htaccess but took them out after similar frustration.
Jd,
Users of that program generally use a large number of proxies. I've seen up to 1000 different ones, all rotated after one or two attempts (to circumvent things like PennyWize). If you have a particular IP blocked, the program quits using that particular one, and keeps trying the ones that work. No matter how many proxies you have blocked, you can never get all of them. My .htaccess to block IP's is huge ... 53K. AccessDiver and other like it really suck because they eat up a huge amount of bandwidth.
-Superman-
I've noticed "Scooter" on the list. Isn't the Altavista's crawler called "Scooter"? I remember it from the old good days, before they've re-designed their site and there was even a cartoon picture of that "Scooter" on their site... That was looong time ago though, maybe they've changed the crawler's name.
Yes, that's AltaVista's spider. Some have added access blocks for Scooter/1.0 because it has recently been ignoring robots.txt. AV is apparently using Scooter/1.0 to index graphics. Scooter/3.2 is the one they usually use to index html pages. I suspect someone added "Scooter" to the list without making this distinction.
It does seem like a long time ago, but it wasn't really, when AV enjoyed all the praise (and criticism) that Google enjoys today. That's the reason we should never target just one search engine when working on our sites, even if that search engine drives 90% of our traffic today. Things change.
Jim
you write:
> One thing I should point out for the casual reader is that all of
> these blocking methods work only to restrict access to content.
Actually my primary goal is to block adress harvesters. I don't care (yet) for people downloading the whole site. But we really need to get a lid on this SPAM.
Unfortunately, I never got the .htaccess stuff working so far. The curse of a part-time webmaster...