Forum Moderators: phranque
RewriteRule . - [F,L]
It is enclosed in:Why? If you have access to the config file, you already know that you have mod_rewrite.
The idea is to reduce the chance of missing an htaccess file when adding a new bot etc to the server.What does this mean? Is htaccess enabled (Override settings) throughout the server, or isn't it? If it is, is mod_rewrite consistently set to inherit? Wouldn't it be easier simply to turn off all overrides, so there is no possibility of an htaccess file interfering?
^.*(winhttp|libwww|perl|curl|wget|harvest|scan|grab|extract).*The only time you ever need to say ^.* is when you're capturing. (A trailing .* with or without $ is doubly superfluous.) Otherwise it's just more work for the server. Leave off the anchors and the .* Anchors are most useful when a particular element comes at the very beginning of the UA string: if it isn't right there, stop looking.
As I understand it I cannot have a common htaccess file across several sites.You can if the sites are grouped in the same physical directory. The shared-hosting version works best if the host uses the “userspace” setup rather than the “primary/addon” setup. (Mine does. In primary/addon setups it gets more convoluted, since the sites aren’t all parallel.) This lets me have a shared htaccess file governing access controls for all sites.* Site-specific stuff--including a couple of things that are the same for all sites but don't work in the shared file--goes in individual sites' htaccess. Notably, mod_rewrite only happens in the individual files. The shared file is mostly mod_setenvif + mod_authwhatsit (I'm on 2.2, but it will transition easily to 2.4).
...
RewriteCond %(HTTP_USER_AGENT) ^Mozilla/5\.0\s\(compatible;\sYandexBot/3\.0;\s\+http://yandex\.com/bots\)$ [OR]
RewriteCond %(HTTP_USER_AGENT) ^Mozilla/5\.0\s\(compatible;\sbingbot/2\.0;\s\+http://www\.bing\.com/bingbot\.htm\)$ [OR]
...
RewriteCond %(HTTP_USER_AGENT) ^Mozilla/5\.0\s\(compatible;\sDuckDuckBot-Https/1\.1;\shttps://duckduckgo\.com/duckduckbot\)$
RewriteRule .* - [L]
# common bad user-agents
RewriteCond %{HTTP_USER_AGENT} (agent|analy[sz]|anonymous|bandit|bot|brand|cherrypicker|collector|compatible;[a-z]|craftbot|crawl|deepnet|discover|download|explorer|file|greasemonkey|indy\slibrary|java|larbin|le[ae]ch|legs|link|lynx|mail|netcraft|ninja|n[-_\s]?u[-_\s]?t[-_\s]?c[-_\s]?h|open|php|proxy|ripper|script|search|seo|shodan|sitemap|snoop|sph?ider|stripper|sucker|survey|sweep|torrent|webpictures|webspider|worm) [NC]
RewriteRule . - [F,L]
IP testing for (eg) googlebot etc, either in htaccess or setenvAre you considering actual on-the-fly IP lookups, or just a quick test to verify that a crawler that claims to be SearchBot is coming from an attested SearchBot range?
RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteCond %{REMOTE_ADDR} !^66\.249
RewriteRule .? - [F]
or BrowserMatch Googlebot fake_google
SetEnvIf Remote_Addr ^66\.249 !fake_google
Deny from env=fake_google
Replacing “Deny from” with whatever is appropriate for 2.4. The IP can of course be more narrowly constrained (it's really 66.249.64-79) if you find it necessary; I simply don't see fake Googlebots from elsewhere in the /16 (including 80-95) so it isn’t worth the trouble. > Syntax error on line 11 of /etc/apache2/mods-enabled/setenvif.conf:
> deny not allowed here
The Deny directive is only permitted in a directory contextFile under: Today I Learned :)
could I useI should think so. In fact that's probably what most server administrators do if they have more than one site living on the same server. Gather them all in one directory, such as /users/ or /sites/, and then any rules that should apply to everyone all the time go in that directory. (Tangent: On an individual-site level, it’s clever to give your boilerplate directories non-standard names, to stump robots coming in asking for /includes/ and the like. But on the server level it doesn’t matter, since nobody but you will ever see the directory names--unless you’ve made a serious blunder in coding.)
BrowserMatch GooglebotI think whitespace was asking about the extra bit after the quotation mark (third line of your quoted material). It certainly looks like an artifact of posting, not something that actually occurs in your site code, or else you’d have got a different error.
blocking the term "bot" whilst allowing "bingbot", "googlebot" etc.
RewriteCond %{HTTPS} off
RewriteRule . - [F] Wouldn't a redirect be better? Legitimate robots will continue requesting http for years after you've changed. (An interesting exception is Yandex: once it has learned that you're accessible at https, it will make all its requests to https only, even for URLs that were redirected before you made the change and therefore never existed at https.)