|Getting up to date with htaccess user-agent blocking|
I have a block of bog-standard htaccess stuff that I cut and paste to new websites, and that includes blocking dodgy user-agents.
As far as I know, it seems to work okay, but it's been a long time since I looked into this in detail, and I sometimes wonder if it's effective enough, or maybe even causing problems I'm not aware of.
So, I wondered if anyone could point me in the direction of accepted current wisdom on the subject. I've already googled, but most of the stuff I found is years old.
I can provide my working code, if required, but I'm not sure if it's bad form, so I'll wait until asked.
Thanks for your help. :-)
When I saw the subject header, my first thought was that it might be useful if a few experienced oldtimers posted their current lists. User-Agents come and User-Agents go. But like the man said, the Ukrainians you will always have with you. And if you remove someone from the list on the grounds that you haven't set eyes on them in several years, you can bet your britches they will show up again next week.
An equally important question is how you're blocking them. For simple text matches my preference is for mod_setenvif in conjunction with mod_authz-whatever:
BrowserMatch ^-?$ keep_out
BrowserMatch blahblah-here keep_out
BrowserMatch other-blahblah-here keep_out
leading up to a single sweeping
Deny from env=keep_out
Some folks use mod_rewrite instead.
That's assuming you blacklist: allow everyone except the folks you explicitly lock out. A few brave souls use whitelisting instead: lock out everyone except some specifically authorized user-agents. It depends on the purpose of your site and the nature of your target audience.
It's all on two lines. The first line starts:
and is followed by a long (311) list of regexs, separated by |s. At the end is a [NC], and the second line is:
RewriteRule ^.* - [F,L]
I just noticed when I came in here that there's an issue using [NC] with user-agents, so that's a bad start.
|When I saw the subject header, my first thought was that it might be useful if a few experienced oldtimers posted their current lists. |
Banning bots [webmasterworld.com]
Awesome. Thanks, wilderness.
|and is followed by a long (311) list of regexs, separated by |s. |
Urk. Is this happening in htaccess or the config file? I think you said htaccess. If so, the rule will run much more efficiently if you replace all the pipes with separate Conditions ending in [OR]. (Put it in a text editor and you can do this bit with a single global replace.)
You've now got a rock-and-a-hard-place dilemma, because
(a) long lists are vastly easier to keep organized if you maintain them in alphabetical order
(b) when there's a long string of RewriteConds, you should list them in order of most-likely-to-succeed. (Or most-likely-to-fail, in the case of an ordinary AND-delimited list.)
Ok. I split it up into separate RewriteConds. :)
Short of reading through tons of server logs, I'm not sure what the best way of obtaining an up-to-date list of bad user-agents would be, or even if this is the best way to tackle spammers and scrapers these days. There's nothing useful in my hosts awstats other than a few host names/IPs. Am I better to block by IP?
Is there a useful list of known offenders anywhere? ... a good cut-and-paste htaccess code block anyone can recommend? Do I really have to re-invent the wheel and make my own list?
Since folks using "Buster Browser" can switch to "Botser Browser" I like to focus on what bots do rather than who they say they are.
SetEnvIfNoCase Request_URI /timthumb.php blocked
So, does that mean I should concentrate on another method, like tar-pitting? Any practical advice would be very welcome, as I've got half a mind to ditch the user-agent screening altogether.
User Agent blocking is one component of a good arsenal. Some UAs will never be up to any good, so it is easiest to block them globally by name. You can't do much about intelligent robots that pretend to be someone else-- unless it's something obvious like a mismatch between IP and UA. But fortunately most robots are quite stupid; it never occurs to them to claim to be something other than lib-www-perl or what have you.
311 seems excessive, though. You can almost certainly make the list a lot shorter simply by merging similar UAs and by matching against shorter pieces of the name.
:: shuffling papers ::
My current list, using BrowserMatch or BrowserMatchNoCase, is less than 25 items. The IP list is vastly longer. I've always worked on the assumption that a straight "Deny from aa.bb.cc.dd" places less strain on the server than any other approach you can use.
Well, since my OP, I have deleted about 30 for that very reason (apparently didn't occur to me at the time), but 25 is impressive. I don't suppose you'd care to share the method, would you?
No mystery at my end. It's simply that I only use BrowserMatch for elements that come down to one or two words. The current list-- it's so short, I may as well post it in full-- goes
BrowserMatch ^-?$ keep_out
BrowserMatch Ahrefs keep_out
BrowserMatch "America Online Browser" keep_out
BrowserMatch AppEngine keep_out
BrowserMatch "Bork-edition \[en\]" keep_out
BrowserMatch Clipish keep_out
BrowserMatch Covario keep_out
BrowserMatch CoverScout keep_out
BrowserMatch "Extreme Picture Finder" keep_out
BrowserMatch FairShare keep_out
BrowserMatchNoCase HTTrack keep_out
BrowserMatch "Jakarta Commons-HttpClient/3\.1" keep_out
BrowserMatch "Java/" keep_out
BrowserMatch kalooga keep_out
BrowserMatchNoCase libcurl keep_out
# comment-out following for link checker
BrowserMatchNoCase libwww-perl keep_out
BrowserMatch "Mozilla/[0-3]" keep_out
BrowserMatch "MSIE [1-4]\." keep_out
BrowserMatch NativeHost keep_out
BrowserMatchNoCase Python keep_out
BrowserMatch "rarely used" keep_out
BrowserMatchNoCase scanner keep_out
BrowserMatch Synapse\) keep_out
BrowserMatch TencentTraveler keep_out
BrowserMatch vcbot keep_out
BrowserMatch webcollage keep_out
BrowserMatch WebReaper keep_out
BrowserMatch web/snippet keep_out
BrowserMatchNoCase Wget keep_out
BrowserMatch Wikimpress keep_out
BrowserMatch Yahoo keep_out
The emphasis is on user-agents that show up in browser addons or low-budget robots so an IP block is no use. Well-known search engines get supplementary checks in mod_rewrite. Currently it's a toggle: "claims to be googlebot but is not from known google IP" b/w "comes from bing IP but does not identify self as bingbot or msn-media"
... which reminds me that just recently I've been getting image requests from the bingbot by that name, but that's for another thread
Thanks, lucy24. I now have a few up to date sources, so I'll compile a list and get it installed today. I was using mod_rewrite, but I like the Apache alternatives that read more like English, so I will give BrowserMatch/Deny a bash this time.