Forum Moderators: phranque

Message Too Old, No Replies

Rewrite in conf files

Any special limitations?

         

dstiles

2:18 pm on Mar 19, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am trying to create a general rewrite conf file in order to address common issues across several web sites (currently 3 but counting). The idea is to reduce the chance of missing an htaccess file when adding a new bot etc to the server.

Are there any gotchas in doing this? I'm particularly concerned about blocking baddies and letting in known bots. Snippets from the conf file are below (elipses used to denote more lines of code). The whole lot is in a single conf file in /etc/apache2 and loaded from apache2.conf as:
include rewrite.conf

It is enclused in:
<IfModule mod_rewrite.c>
</IfModule>

RewriteEngine On
...
RewriteCond %(HTTP_USER_AGENT) ^Mozilla/5\.0\s\(compatible;\sYandexBot/3\.0;\s\+http://yandex\.com/bots\)$ [OR]
RewriteCond %(HTTP_USER_AGENT) ^Mozilla/5\.0\s\(compatible;\sbingbot/2\.0;\s\+http://www\.bing\.com/bingbot\.htm\)$ [OR]
...
RewriteCond %(HTTP_USER_AGENT) ^Mozilla/5\.0\s\(compatible;\sDuckDuckBot-Https/1\.1;\shttps://duckduckgo\.com/duckduckbot\)$
RewriteRule .* - [L]

The first line, Yandex, does not work in that yandex has repeatedly hit one site for the past two or three days and always received a 403. Bing and google (not shown) receive 200. I'm guessing Yandex is stuck in a groove after being blocked before I set up the above; I've had to block its IP ranges in iptables for now.

# kill bad user-agents - this is only one of several rewrite UA blocks that follow
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^[\"\'$%&*()-=+_@~#{}[]<>,.?/|\\\!] [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*(<|>|'|%0A|%0D|%27|%3C|%3E|%00).* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ht[tm][lpr] [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*(HTTrack|clshttp|archive|loader|email|nikto|miner|python).* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*(winhttp|libwww|perl|curl|wget|harvest|scan|grab|extract).* [NC]
RewriteRule . - [F,L]

The above (apparently) blocks python but not zgrab (should be trapped by grab).

# common bad user-agents
RewriteCond %{HTTP_USER_AGENT} (agent|analy[sz]|anonymous|bandit|bot|brand|cherrypicker|collector|compatible;[a-z]|craftbot|crawl|deepnet|discover|download|explorer|file|greasemonkey|indy\slibrary|java|larbin|le[ae]ch|legs|link|lynx|mail|netcraft|ninja|n[-_\s]?u[-_\s]?t[-_\s]?c[-_\s]?h|open|php|proxy|ripper|script|search|seo|shodan|sitemap|snoop|sph?ider|stripper|sucker|survey|sweep|torrent|webpictures|webspider|worm) [NC]
RewriteRule . - [F,L]

This should block anything-[Bb]ot (except bing etc above) but does not. It used to when it was in htaccess. And yes, I restart apache after changes.

Is there something I should look out for when dealing with conf files or is there something stupid in the above code?

ClosedForLunch

3:03 pm on Mar 19, 2019 (gmt 0)

5+ Year Member Top Contributors Of The Month



RewriteRule . - [F,L]

Missing asterisk.

Use this instead:

RewriteRule ^ - [F]

As for trapping search engine bots you can simplify it like this:

RewriteCond %{HTTP_USER_AGENT} YandexBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} bingbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} DuckDuckBot [NC,OR]
etc

If your rules are now in a global conf file, remove those rules from individual .htaccess files.

whitespace

4:15 pm on Mar 19, 2019 (gmt 0)

10+ Year Member Top Contributors Of The Month



Missing asterisk.


The asterisk is not required - a single dot is sufficient. These directives would seem to be included in a server context (as opposed to a directory context), so the URL-path is always at least a slash.

dstiles

4:23 pm on Mar 19, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The file is included directly with no directory or other tags around it, so dot it is.

I know I could simplify the SE traps but at the moment prefer to use the full UA. My next move is to add IP ranges to each engine where known, so I could then allow some leeway; I'm working out the syntax for it.

lucy24

6:16 pm on Mar 19, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is enclosed in:
Why? If you have access to the config file, you already know that you have mod_rewrite.

The idea is to reduce the chance of missing an htaccess file when adding a new bot etc to the server.
What does this mean? Is htaccess enabled (Override settings) throughout the server, or isn't it? If it is, is mod_rewrite consistently set to inherit? Wouldn't it be easier simply to turn off all overrides, so there is no possibility of an htaccess file interfering?

^.*(winhttp|libwww|perl|curl|wget|harvest|scan|grab|extract).*
The only time you ever need to say ^.* is when you're capturing. (A trailing .* with or without $ is doubly superfluous.) Otherwise it's just more work for the server. Leave off the anchors and the .* Anchors are most useful when a particular element comes at the very beginning of the UA string: if it isn't right there, stop looking.

When matching search-engine user-agents, I don't think you need to supply the exact string at all. Just give the significant part, like YandexBot or bingbot, unanchored--and then, elsewhere, make a separate rule if needed to match UA against IP. (In practice, only Google and Baidu get a lot of fakers--and even Googlebot is going out of fashion.) Otherwise everyone gets locked out every time they update their crawler.

Personally I'd consider mod_rewrite as a last resort for access control, both because of the inheritance issues and the more generalized killing-flies-with-an-elephant-rifle nature of the thing. You can do a lot of the same thing with mod_setenvif, and then it inherits straight down the line.

dstiles

8:18 pm on Mar 19, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



> you have mod_rewrite.

Early days caution. I was pretty sure I could get away with it but it seemd best to include it for now.

All htaccess files are enabled (override) because I'm still developing this and at the moment some things won't work in my rewrite conf (deny from ip, geoip and few others). As I understand it I cannot have a common htaccess file across several sites. Although whilst writing this it occured to me: could I use a symlink to a single copy? But that would kill the possibility of adding per-site htaccess code. :(

I agree about the anchors. originally they were omitted; I added them in an effort to get "bot" rejected, though I didn't really believe that would make a difference. I then forgot to remove them.

Also agree about the SE strings, as noted in previous posting, but as I said, still working on proper enabling of SE UAs. This is a short-term effort to kill as many fakers as possible whilst I work on perfecting everything. The yandex was a puzzler, though, as I got a few hundred hits from them, all to the home page. I think something may have jammed somewhere, either with their bot or my site. From another incident I wonder if an editor backup file (.conf~ or even .htaccess~) may have been included in the apache restart. That might explain it. I've never liked saving backup files to working directories. :(

More to the point was bot and grab not being trapped. They seem to be nightly so I'll have to wait until the morning now.

I've been looking at setenv. As I said, I have a number of IP denies and some geoips still in htaccess, mainly because I could not get them working in rewrite conf; Require allways caused an apache restart fault. I'll be looking at that when rewrite is working, but reading your comment on using setenv I'll try that as well. Something new to try, anyway!

Thanks for the help, lucy! :)

lucy24

8:56 pm on Mar 19, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As I understand it I cannot have a common htaccess file across several sites.
You can if the sites are grouped in the same physical directory. The shared-hosting version works best if the host uses the “userspace” setup rather than the “primary/addon” setup. (Mine does. In primary/addon setups it gets more convoluted, since the sites aren’t all parallel.) This lets me have a shared htaccess file governing access controls for all sites.* Site-specific stuff--including a couple of things that are the same for all sites but don't work in the shared file--goes in individual sites' htaccess. Notably, mod_rewrite only happens in the individual files. The shared file is mostly mod_setenvif + mod_authwhatsit (I'm on 2.2, but it will transition easily to 2.4).


* It also lets me set a couple of environmental variables that are not directly used for access control, but are used by some sites to determine which version of robots.txt a given visitor sees.

dstiles

4:29 pm on Mar 21, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the information, Lucy. Much appreciated!

I think I understand what you're saying but I need some time to work it out - time is short at present.

The zgrab bot is still getting a 200. Not sure where I'm going wrong with that. :(

ClosedForLunch

5:20 pm on Mar 21, 2019 (gmt 0)

5+ Year Member Top Contributors Of The Month



...
RewriteCond %(HTTP_USER_AGENT) ^Mozilla/5\.0\s\(compatible;\sYandexBot/3\.0;\s\+http://yandex\.com/bots\)$ [OR]
RewriteCond %(HTTP_USER_AGENT) ^Mozilla/5\.0\s\(compatible;\sbingbot/2\.0;\s\+http://www\.bing\.com/bingbot\.htm\)$ [OR]
...
RewriteCond %(HTTP_USER_AGENT) ^Mozilla/5\.0\s\(compatible;\sDuckDuckBot-Https/1\.1;\shttps://duckduckgo\.com/duckduckbot\)$
RewriteRule .* - [L]

If you're going to use the full UA, add the mobile UA too.

# common bad user-agents
RewriteCond %{HTTP_USER_AGENT} (agent|analy[sz]|anonymous|bandit|bot|brand|cherrypicker|collector|compatible;[a-z]|craftbot|crawl|deepnet|discover|download|explorer|file|greasemonkey|indy\slibrary|java|larbin|le[ae]ch|legs|link|lynx|mail|netcraft|ninja|n[-_\s]?u[-_\s]?t[-_\s]?c[-_\s]?h|open|php|proxy|ripper|script|search|seo|shodan|sitemap|snoop|sph?ider|stripper|sucker|survey|sweep|torrent|webpictures|webspider|worm) [NC]
RewriteRule . - [F,L]

'bot' will also deny access to many helpful / useful bots (such as other search engines / advertisers / SEO tools etc).

lucy24

5:24 pm on Mar 21, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The zgrab bot is still getting a 200.
Just for ###s and giggles you could try changing the rule to say |z?grab|

The above counts as “sparring for time”.

dstiles

11:40 am on Mar 22, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



ClosedForLunch - yes, I know. I've been running ASP sites for a couple of decades. But thanks anyway. :)

Lucy, I added zgrab into another string but it still got through.

I found another scuz getting through last night that was so-say blocked. I think there may be a "spelling mistake" further back in the file that's clobbering it. I'll go through the file again.

dstiles

11:58 am on Mar 23, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Changed a few things in the file and zgrab no longer grabs. Getting there slowly. Thanks, all, for the help!

In a few weeks I'll start on converting to setenv.

Meanwhile, has anyone any suggestions as to setting up IP testing for (eg) googlebot etc, either in htaccess or setenv?

lucy24

6:39 pm on Mar 23, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



IP testing for (eg) googlebot etc, either in htaccess or setenv
Are you considering actual on-the-fly IP lookups, or just a quick test to verify that a crawler that claims to be SearchBot is coming from an attested SearchBot range?

The easy way is
RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteCond %{REMOTE_ADDR} !^66\.249
RewriteRule .? - [F]
or
BrowserMatch Googlebot fake_google
SetEnvIf Remote_Addr ^66\.249 !fake_google

Deny from env=fake_google
Replacing “Deny from” with whatever is appropriate for 2.4. The IP can of course be more narrowly constrained (it's really 66.249.64-79) if you find it necessary; I simply don't see fake Googlebots from elsewhere in the /16 (including 80-95) so it isn’t worth the trouble.

The only other faker I see routinely is Baidu. I can't help with exact numbers there because I block China anyway, whether real or fake, but it’s the same syntax.

The reverse situation--non-SearchBot from a SearchBot crawl range--seems to be most common with bing. Then it's the same pattern, only reversed: within suchandsuch IP, check whether UA doesn’t include the expected string. (Currently I don’t block them, though I have done so in the past.)

dstiles

12:00 pm on Mar 24, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



On-the-fly lookups may be suitable for bing - they seem to have a policy of "hey, it's my IP, let use it for a bot" which results in a lot of IPs to list - though allowing bing UA on certain ranges only would probably suffice, avoiding their cloud etc. Same goes for yandex. Google isn't so much of a problem and your solution certainly looks ok for that; I'll probably go straight for the setenv version. Thanks

Duckduckgo uses amazon and has no discernable IP range; one has to take a chance on that one. Baidu isn't a priority with me but it seems to be constrained to about 9 ranges from China, Europe, USA, Japan and Brazil. Others I have seen on the site so far are seznam and applebot, both of which seem to use limited ranges so same solution as for google.

I re-enabled yandex a couple of days ago and it's been hammering the home file of one of the two sites ever since, as it did before. Disabled it again today using robots.txt and it's so far honoured it.

Thanks again for the help, Lucy.

dstiles

5:53 pm on Apr 7, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Starting on setenv conversion. The following example (from Lucy) is in setenvif.conf ...

> BrowserMatch Googlebot fake_google
> SetEnvIf Remote_Addr ^66\.249 !fake_google
> Deny from env=fake_google "BrowserMatch Googlebot fake_google

... results in

> Syntax error on line 11 of /etc/apache2/mods-enabled/setenvif.conf:
> deny not allowed here

I have a similar response when attempting to run geoip in a conf file.

I have tried variations on the deny theme, including some require's but no luck.

whitespace

6:11 pm on Apr 7, 2019 (gmt 0)

10+ Year Member Top Contributors Of The Month



> Syntax error on line 11 of /etc/apache2/mods-enabled/setenvif.conf:
> deny not allowed here


The Deny directive is only permitted in a directory context. ie. If you are using this in your server config then you need to surround it in the appropriate <Directory> container.

But if you are on Apache 2.4 then you should be using the corresponding Require directive, not Deny (Allow, Order).

What's the "BrowserMatch Googlebot fake_google that appears to be following that directive on the same line? Is that a typo?

lucy24

6:38 pm on Apr 7, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The Deny directive is only permitted in a directory context
File under: Today I Learned :)

I tend to assume that Apache directives work on a cascade: anything that can be used in htaccess can also be used in <Directory>, vhost, or loose in config; anything that can be used in <Directory> can also be used in vhost or loose in config ... and so on. Apparently no. The same Directory-or-htaccess limitation applies, predictably, to Allow and Order--which between them cover all of mod_authz_host in 2.2.

The 2.4 version also lets you put the Require directives in <Files> or <Location> envelopes--but still not in vhost or loose in config.

This is probably not a problem if all of your outward-facing files are ultimately in the same directory. Just put all your rules in that directory. I guess it can lead to chaos if you've got RewriteRules there and also in your vhost envelopes--unless you're in 2.4 (did you say at some point?) and can fine-tune inheritance.

dstiles

3:06 pm on Apr 8, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



whitespace - I tried require (which I use elsewhere) but that failed as well.

> BrowserMatch Googlebot

See Lucy's earlier post.

Lucy - I've looked at <directory> , <location> etc before and am not sure they would do what I want. Given a vhost structure of...
/srv/site1root/
/srv/site2root/
...etc...

... could I use...
<directory "/srv">
setenv ...
require...
</directory>

...to apply the contents to ALL sites?

lucy24

3:57 pm on Apr 8, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



could I use
I should think so. In fact that's probably what most server administrators do if they have more than one site living on the same server. Gather them all in one directory, such as /users/ or /sites/, and then any rules that should apply to everyone all the time go in that directory. (Tangent: On an individual-site level, it’s clever to give your boilerplate directories non-standard names, to stump robots coming in asking for /includes/ and the like. But on the server level it doesn’t matter, since nobody but you will ever see the directory names--unless you’ve made a serious blunder in coding.)

Either Deny/Allow or Require, depending on Apache version. If you’ve used Require elsewhere, that’s 2.4.

:: pause to direct dirty look in direction of hosts, who are being dilatory ::

BrowserMatch Googlebot
I think whitespace was asking about the extra bit after the quotation mark (third line of your quoted material). It certainly looks like an artifact of posting, not something that actually occurs in your site code, or else you’d have got a different error.

dstiles

6:53 pm on Apr 13, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ok, it's taken a while as this is a secondary project. Thanks for staying with me and rendering help!

I have now moved several directives to a new conf file named use-setenv.conf in /etc/apache2. Before this my rewrite.conf detailed above did not seem to work - at least in many respects. The new setenv conf seems to be no better. :(

The new file is shown below. The only thing that seems to work is the first line (urlwatch) which has dontlog appended to the CustomLog directive in each site's VirtualHost as env=!dontlog. Request_Method, BrowserMatch etc seem not to work. I've accessed a site using a browser with a google UA and can gain access, though it should be tied to IP. I know it's not a fatal syntax error as those are reported at restart time. I suspect it's not engaging somehow but equally I know the file is being used as I sometimes get a syntax error whan I make a typo.

The final lines of apache2.conf are...

--------------
Include vhosts.conf
IncludeOptional conf-enabled/*.conf
IncludeOptional sites-enabled/*.conf
Include use-setenv.conf
Include rewrite.conf
--------------

use-setenv.conf is...

#========
# special version of setenvif for all sites
<directory "/srv">

SetEnvIfNoCase User-Agent urlwatch dontlog

# protocol limits
SetEnvIf Request_Protocol HTTP/0.9 too_low_proto
SetEnvIf Request_Protocol HTTP/1.0 too_low_proto
Require env too_low_proto denied

# request type - only allow get/post/head
SetEnvIfNoCase Request_Method ^(delete|options|trace|track) getpost_inhibit
SetEnvIfNoCase Request_Method ^(GET|POST|HEAD) !getpost_inhibit
Require env getpost_inhibit denied

#====== bot UAs
# unwanted bot UAs
BrowserMatch Googlebot-Image unwanted-goodbot
Require env unwanted-goodbot denied

#====== good bot UAs
BrowserMatch bingbot fake_bing
SetEnvIf Remote_Addr ^(40\.77\.167|64\.4\.13|64\.4\.50|64\.4\.54|65\.54\.16[45]|65\.54\.247|65\.55\.25|131\.253\.[2-4][468]|157\.55\.2|157\.55\.39|157\.55\.154|199\.30\.[1-3][0-9])\. !fake_bing
Require env fake_bing denied

# google
BrowserMatch Googlebot fake_google
SetEnvIf Remote_Addr ^66\.249 !fake_google
Require env fake_google denied

# duckduckgo - 50.16.247.234
BrowserMatch DuckDuckBot fake_duck
SetEnvIf Remote_Addr ^50\.16\.247\. !fake_duck
Require env fake_duck denied

# seznam - 77.75.72-79
BrowserMatch SeznamBot fake_seznam
SetEnvIf Remote_Addr ^77\.75\.7[29]\. !fake_seznam
Require env fake_seznam denied

# applebot 17.58.97.28-17.58.97.31
BrowserMatch Applebot fake_apple
SetEnvIf Remote_Addr ^17\.58\.97\. !fake_apple
Require env fake_apple denied

# yandex...
BrowserMatch YandexBot fake_yandex
SetEnvIf Remote_Addr ^ ( 5\.45\.(19[2-9]|2\d\d)|5\.255\.(19[2-9]|2\d\d)|87\.250\.2(2[4-9]|[3-5]\d)|95\.108\.(12[89]|1[3-9]\d|2\d\d)|178\.154\.(12[89]|1[3-9]\d|2\d\d)|141\.8\.1(2[89]|[3-8]\d|9[01])|213\.180\.(19[2-9]|2([01]\d|2[0-3]) )\. !fake_yandex
Require env fake_yandex denied

#====== bad bot UAs
BrowserMatchNoCase (\(\)|\{|\}|__|test|base64_decode|bash|disconnectHandlers|echo) bad_bot
BrowserMatchNoCase (chr\(|eval\() bad_bot
BrowserMatchNoCase (|EXEC\(\@S\)|JDatabaseDriverMysqli|JSimplepieFactory|\$_REQUEST|JFactory|getConfig) bad_bot

BrowserMatch ^$ bad_bot
BrowserMatch ^[\"\'$%&*()-=+_@~#{}[]<>,.?/|\\\!] bad_bot
BrowserMatchNoCase (<|>|'|%0A|%0D|%27|%3C|%3E|%00) bad_bot
BrowserMatchNoCase ^ht[tm][lpr] bad_bot
BrowserMatchNoCase (HTTrack|clshttp|archive|loader|email|nikto|miner|python|zgrab) bad_bot
BrowserMatchNoCase (winhttp|libwww|perl|curl|wget|harvest|scan|grab|extract) bad_bot

BrowserMatchNoCase (agent|analy[sz]|anonymous|bandit|bot|brand|cherrypicker|collector|compatible;[a-z]|craftbot|crawl|deepnet|discover|download|explorer|file|greasemonkey|http-?client|indy\slibrary|java|kube|larbin|le[ae]ch|legs|link|lynx|mail|netcraft|ninja|n[-_\s]?u[-_\s]?t[-_\s]?c[-_\s]?h|open|php|proxy|ripper|script|search|seo|shodan|sitemap|snoop|sph?[iy]der|stripper|sucker|survey|sweep|torrent|webpictures|worm) bad_bot

Require env bad_bot denied

# only allow mozilla 5
BrowserMatchNoCase ^Mozilla/[0-46-9]\.\d bad_moz
Require env bad_moz denied

# not old firefox or windows
BrowserMatchNoCase (Firefox/[0-9]\.|Firefox/[1-5][0-9]\.|Windows\sNT\s[0-5]) bad_ffwin
Require env bad_ffwin denied

</directory>

lucy24

7:50 pm on Apr 13, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Although this is tangential to your main issue, I would strongly recommend keeping each mod's directives together. That is, group all the mod_setenvif directives in one place, and then all the mod_auththingy directives in another place, and so on. The server doesn’t care, but in the long run you will find it conducive to your own sanity.

And then you can put all the Require directives inside a further envelope, like

:: pause to look up syntax, with further dirty look aimed hostward ::

Oh, right, a <RequireNone> envelope, and then list all your environmental variables, minus the “denied” element. (Does the env + denied syntax even work? Or will the server think you’re looking for an environmental variable whose name happens to be “denied”?)

:: noting happily that <Require> envelopes can be nested, allowing for much more fine-tuning than the former Allow/Deny which is only a single toggle ::

dstiles

6:13 pm on Apr 14, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Suggestion noted, Lucy. Thanks. I need to get it working first, though. :)

> looking for an environmental variable whose name happens to be “denied”

I'll go through the file with this in mind. Thanks.

And having looked at some more of the apache site with that in mind, I get the impression that require ... denied is not valid for env, so I'll try RequireNone - or whatever is needed.

The problems with the apache site: far too many pages and the links are in stupid colours that get lost agains a white page background. Persevere, Dave!

dstiles

2:32 pm on Apr 18, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think I have it working now. I have again read through dozens of apache pages and apache-related forums and almost nothing relevant has been published. Lots of "This does this" but no "This is how to do it" - at least, not for 2.4. The following is what I have, in some cases abbreviated to save over-posting repetetive code. I can emulate (eg) googlebot and be blocked, and I can add my own IP to the IP blocklist and be banned, so I'm fairly well on the way. The only thing I'm currently stuck on is blocking the term "bot" whilst allowing "bingbot", "googlebot" etc. There is probably a processing order that will allow this within the "require" tree but I need either guidance or trial/error to find it. To begin with I've forsaken the fake_bing type of test and replaced it with its_bing and its_ip_bing with an associated <RequireAll> block.

Putting includes into apache.conf for the setenv etc files did not work. I have worked out that the includes need to go into the virtualhost section of each site as (eg)...

========
<VirtualHost ddd.dd.ddd.ddd:443>
ServerAdmin alert@example.co.uk
DocumentRoot /srv/site1
ServerName www.example.co.uk
ServerAlias example.co.uk
<Directory "/">
AllowOverride None
Require all denied
</Directory>
<Directory "/srv/site1">
DirectoryIndex index.php
AllowOverride All
Include /etc/apache2/use-setenv.conf
Include /etc/apache2/rewrite.conf
<RequireAll>
Require all granted
Include /etc/apache2/ban-ips.conf
</RequireAll>
</Directory>
CustomLog ${APACHE_LOG_DIR}/site1/access.log combined env=!dontlog
...etc...
========

use-setenv.conf...
========
SetEnvIfNoCase User-Agent urlwatch dontlog

# protocol limits
SetEnvIf Request_Protocol HTTP/0.9 too_low_proto
SetEnvIf Request_Protocol HTTP/1.0 too_low_proto

# real search engines
# unwanted bot UAs
BrowserMatch Googlebot-Image unwanted-goodbot

# good bot UAs
# bing
BrowserMatch bingbot its_bing
SetEnvIf Remote_Addr ^(40\.77\.167|64\.4\.13|64\.4\.50|64\.4\.54|65\.54\.16[45]|65\.54\.247|65\.55\.25|131\.253\.[2-4][468]|157\.55\.2|157\.55\.39|157\.55\.154|199\.30\.[1-3][0-9])\. its_ip_bing

# google, duckduck etc are similar

#====== bad bot UAs
BrowserMatchNoCase (\(\)|\{|\}|__|test|base64_decode|bash|disconnectHandlers|echo) bad_ua
BrowserMatchNoCase (chr\(|eval\() bad_ua
BrowserMatchNoCase (EXEC\(\@S\)|JDatabaseDriverMysqli|JSimplepieFactory|\$_REQUEST|JFactory|getConfig) bad_ua
BrowserMatch ^$ bad_ua
BrowserMatchNoCase (HTTrack|clshttp|archive|loader|email|nikto|miner|python|zgrab) bad_ua
... etc ...
BrowserMatchNoCase bot bad_bot

# only allow mozilla 5
BrowserMatchNoCase ^Mozilla/[0-46-9]\.\d bad_moz
# not old firefox or windows
BrowserMatchNoCase (Firefox/[0-9]\.|Firefox/[1-5][0-9]\.|Windows\sNT\s[0-5]) bad_ffwin
#==========
<RequireAll>
Require method GET POST HEAD
<RequireAny>
<RequireAll>
Require env its_bing
Require env its_ip_bing
</RequireAll>
... etc ...
</RequireAny>
<RequireNone>
Require env too_low_proto
Require env unwanted-goodbot
Require env bad_ua
Require env bad_moz
Require env bad_ffwin
</RequireNone>
</RequireAll>
========

rewrite.conf has some things I haven't worked out how to do with setenv and will eventuall, I hope, be dropped.
========

========
ban-ips.conf
========
# /10+
Require not ip 3.0.0.0/8
Require not ip 34.192.0.0/10
Require not ip 54.64.0.0/8
# /14+
Require not ip 13.52.0.0/14
Require not ip 13.56.0.0/14
Require not ip 23.96.0.0/13
... etc ...
========

Individual htaccess files are now defined for site-specifics only...
========
Header set Strict-Transport-Security "max-age=15552001; includeSubDomains; preload"
Header set X-Frame-Options "SAMEORIGIN"
Header set X-Xss-Protection "1; mode=block"
Header set X-Content-Type-Options "nosniff"
Header set X-Permitted-Cross-Domain-Policies "none"
Header set Referrer-Policy "strict-origin-when-cross-origin"
Header set Content-Security-Policy "default-src 'none'; style-src 'self'; child-src 'self'; frame-ancestors 'self'; base-uri 'self'; script-src 'self'; form-action 'self'; img-src 'self'; object-src 'none'; block-all-mixed-content;"
Header set Expect-CT "enforce,max-age=30"

#========
RewriteEngine on
#========
# only allow https
RewriteCond %{HTTPS} off
RewriteRule . - [F]
========

Any comments on the above welcomed, and thanks for all the help in pushing me in the right direction. :)

lucy24

6:23 pm on Apr 18, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



blocking the term "bot" whilst allowing "bingbot", "googlebot" etc.

Option A, using a fancy Regular Expression (can't remember if mod_setenvif supports lookbehinds)
(?<!bing|Google|Yandex)bot

Option B, which is probably safer and easier:
BrowserMatch [Bb]ot evil_robot
BrowserMatch (bing|Google|Yandex)bot !evil_robot

I tend to avoid case-insensitive matches except as an absolutely last resort. You won't see a lot of BOT, let alone bOt--and I did once meet a malign robot calling itself GoogleBot, cased like that. Only the correct casing should be accepted.

RewriteCond %{HTTPS} off
RewriteRule . - [F]
Wouldn't a redirect be better? Legitimate robots will continue requesting http for years after you've changed. (An interesting exception is Yandex: once it has learned that you're accessible at https, it will make all its requests to https only, even for URLs that were redirected before you made the change and therefore never existed at https.)

ymmv, but I like to make an exception for robots.txt on http, as I found that some legitimate robots seemed to get confused if a robots.txt request is redirected.

Continuing ymmv: If you've got a RequireNone envelope, wouldn't it be less confusing (for you, not the server) to take all the IP ranges you currently have as “Require not” and put them in the same envelope instead?

dstiles

4:13 pm on Apr 20, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks, Lucy, helpful as ever!

I used Option B but with [Bb]ot, as a couple of the bots I allow have capital B. I understand what you're saying about Case of UAs. I think I will modify it later to define the Bots and bots seperately - I know how to do that now. :)

I reinstated your fake_google type of tests as well, dropping my more intensive coding once I knew what was happening with "evil_robot". All that seems to be currently working.

I have had to remove the ban-ips file for now as for some reason it negates the actions of the setenv file, letting in all the bots. Possibly due to "require all granted" just above the inclusion but I couldn't make it work without. RequireNone insists on having a "granted" included outside the RequireNone group (can't have only RequireNone in RequireAny, which is the default group and I haven't yet discovered a way around it - possibly I will include something innocuous as I did in the setenv RequireNone group (Require method...).

> Wouldn't a redirect be better?

Yes, now you mention it. Rewrite is so alluring when most of the forums give that as solutions to everything. I have added a virtualhost :80 section that includes "Redirect permanent / https://www.example.co.uk/" (the trailing / seems important to redirect all pages). I'll keep in mind your comment re: robots.txt.

I'm attempting to redirect a non-www to www (80 and 443) using the same mechanism but I need more time to find out why it fails - if it does and it's not just browser caching.

Back to blocking IPs - I was working on the method given on a couple of web sites that purported to know what they are doing. As noted above, I have to return to that as it does not currently work anyway.

I've had a brief look at mod_secure for blocking IPs but not sure if it will be worth the effort. Also mod_evasive, but for a different reason. I'll pass on them at the moment, I think, and concentrate on another site I want to transfer from IIS. Several, in fact: my company sites are old and non-SSl, for one thing. :(