Forum Moderators: open
w(eb(Account|Capt|Copier|rank|Whack|Strip|Zip|ster|bandit|\ services\ client\ proto)|get)
what's a bad agent for one site may be beneficial for another site, so it's pointless.
This is quite clever and potentially speeds up the scanning vs. big long monolithic lists.
While I find that block lists are problematic, I think the OP's block list has some interesting techniques that should be explored in detail IMO as most don't fully understand what's going on with the Apache code being used.
RewriteCond %{HTTP_USER_AGENT} a(ccess|ds|pp(engine|id)) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} c(a(che|pture)|heckp|law|o(llect|pi|py)|url) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} d(ata|evs|ns|o(main|wn)) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} e(ngine|ezooms) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} f(etch|i(lter|nd)|tp) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} g(enieo|grab) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} harvest [NC,OR]
RewriteCond %{HTTP_USER_AGENT} i(mage|ps) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} j(a(karta|va)) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} l(arbin|i((b(rary|www))|nk)|oad) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} m(icro|j12bot|mcrawl) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} openany [NC,OR]
RewriteCond %{HTTP_USER_AGENT} p(age_test|erl|hpcrawl|ic|pid|review|ython) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} r(everse|g(ana|et)) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} s(bider|c(an|rape|reen)|iph|noop|trip|u(ck|rvey)|ymantec) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} trend [NC,OR]
RewriteCond %{HTTP_USER_AGENT} video [NC,OR]
RewriteCond %{HTTP_USER_AGENT} w(eb-sniffer|get|in(32|http)|otbox) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} yandexmedia [NC,OR]
RewriteCond %{HTTP_USER_AGENT} zoom [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^-?$ [NC]
RewriteCond %{REQUEST_URI} !410.shtml$ [NC]
RewriteCond %{REQUEST_URI} !robots.txt$ [NC]
RewriteRule .? - [G,L]
RewriteCond %{HTTP_USER_AGENT} bot [NC]
RewriteCond %{REMOTE_HOST} !google [NC]
RewriteCond %{HTTP_USER_AGENT} !(bing|msn|yandex) [NC]
RewriteCond %{REQUEST_URI} !(403|410).shtml$ [NC]
RewriteCond %{REQUEST_URI} !robots.txt$ [NC]
RewriteRule .? - [G,L]
RewriteCond %{HTTP_USER_AGENT} crawl [NC]
RewriteCond %{HTTP_USER_AGENT} !sistrix [NC]
RewriteCond %{REQUEST_URI} !(403|410).shtml$ [NC]
RewriteCond %{REQUEST_URI} !robots.txt$ [NC]
RewriteRule .? - [G,L]
RewriteCond %{HTTP_USER_AGENT} spider [NC]
RewriteCond %{REQUEST_URI} !(403|410).shtml$ [NC]
RewriteCond %{REQUEST_URI} !robots.txt$ [NC]
RewriteRule .? - [G,L]
RewriteCond %{REMOTE_HOST} !google [NC]
RewriteCond %{HTTP_USER_AGENT} !(bing|msn|yandex) [NC]
RewriteCond %{REQUEST_URI} !(403|410).shtml$ [NC]
RewriteCond %{REQUEST_URI} !robots.txt$ [NC]
A good place for this to be *explored* IMO is the Apache Web Server forum or possibly even the Webmaster General forum. If I was looking for htaccess code syntax tips I certainly wouldn't look in the Search Engine Spider and User Agent Identification forum.
Any given robot has one correctly cased form of its name. Only that form should be given a pass.
The [G] flag carries an implied [L].
I prefer to constrain my access-control RewriteRules to requests in the form
(\.html|/|^)$
Cases of robots walking in off the street and making "cold" requests for non-page files when they haven't already got the page are so rare that it isn't worth making the server stop and evaluate every single request.
Just start your RewriteRules with an all-encompassing
RewriteRule[snip]
But in practice I hardly ever use mod_rewrite for access control. Flies-with-an-elephant-rifle sort of thing. Instead it's mod_authz-thingummy alone for IP-based blocks; mod_setenvif leading to "Deny from" for simple UA checks.
I've learned everything that way, never read the documentation
BrowserMatch Yukkybot keep_out
BrowserMatch ZeroSum keep_out
Deny from env=keep_out
RewriteCond %{HTTP_USER_AGENT} Yukkybot [OR]
RewriteCond %{HTTP_USER_AGENT} ZeroSum
RewriteRule . - [F]
Isn't .? the most efficient form?
<Files "robots.txt">
Order Allow,Deny
Allow from all
</Files>
RewriteEngine On
RewriteRule . - [L]
I've got one shared htaccess that's primarily for access control [...] say something once and then forget about it.
I've always assumed that the single most resource-efficient rule is Allow/Deny using an IP address in CIDR form [...] the IP address is the very first thing in a request, which ought to count for something.
Edit:Isn't .? the most efficient form?
Here we're talking strictly about access-control rules, meaning that (a) nothing gets captured and (b) if it's happening in mod_rewrite, conditions have to be evaluated. My assumption is: the extra work of looking at the specific content of a requested URI is outweighed by the savings in not having to look at conditions at all if it turns out to be a non-page request.
RewriteCond %{HTTP_HOST} example.com
RewriteCond %{REQUEST_URI} xxyyzz1
RewriteCond %{REQUEST_URI} xxyyzz2
RewriteRule .? - [G]
RewriteCond %{HTTP_HOST} example.com
Rewrite Rule ^xxyyzz(1|2)$ - [G]
What I learned is that a given request is matched against a RewriteRule before the associated RewriteCond conditions are checked.
RewriteCond %{HTTP_USER_AGENT} crawl [NC]
RewriteCond %{HTTP_USER_AGENT} !sistrix [NC]
RewriteRule .? - [G]
This is a textbox case of inappropriate [NC] flag.
When there's more than one condition, list them in order of most-likely-to-fail