Forum Moderators: phranque

Message Too Old, No Replies

SetEnv not behaving as expected

Some HTTP headers work, others do not

         

dstiles

2:38 pm on Jan 5, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Either I have misunderstood the method somehow or I'm doing something wrong.

For some time I have been gradually implementing SetEnv as a means of blocking the more obvious nasties to a handful of minor, low-traffic sites I have moved to linux apache from IIS. These are sites I can afford to make mistakes on. I have discussed my approach to this in the past and have received encouraging responses, many of which I've implemented or adapted.

The blocking file contains a number of SetEnv(If/etc) directives and some Require statements. I also use a few <if> statements. What I have at the moment seems to have worked well for several months, with a few updates and additions. I added a logging mechanism in php for 200-response files and have now adapted the server to use a php ErrorDocument which can then log the non-200 responses.

The logging is a multi-line record per hit. The format of the records is:

IP Date ResponseCode
Host Page Pre-redirectedPage(if relevant)
EnvVars (Name: Value:regex) (one per line)
HTTP headers (one per line)

A typical EnvVars section, obtained using PHP's apache_getenv(), is:

accept: none
accept_lang: none
badua: old_browser:Chrome/66.
BlockCountry: 1
browser: chrome:Chrome/66
ips: amazon:18.237.
proto: too_low:HTTP/1.0
useragent: scrape:iodc

All of the above are trapped by a variety of SetEnvIf or BrowserMatch and are reported correctly, as are a few more such as bot. Ones that are not reported are things like:

SetEnvIf Remote_Host 10.0.1.21 host=us:$0 (our IP, obfuscated)

<if " %{Remote_Host} =~ m#\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}# ">
SetEnv host=numbers:$0
</if>
SetEnvIf Request_URI "\/\.?wp-" badrequri=wp:$0
...or its alternative...
<if " %{Request_URI} =~ m#wp-#i ">
SetEnv badrequri=wp:$0
</if>

In fact, none of the REQUEST_URI, REMOTE_HOST, QUERY_STRING, HTTP_REFERER, HTTP_COOKIE are reported, although they do work in trapping bad requests (as logged in the apache error log and site logs). As far as I can tell, my methods are as documented by apache.

So, what am I doing wrong or misunderstanding, please?

lucy24

6:58 pm on Jan 5, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: detour to Apache docs ::

Oh, cool, SetEnv without the If. New one on me. In fact I think it's a whole new module: mod_env alongside mod_setenvif.

We may need to do some more fine-tuning.

First: Have you ever, in any circumstances, done something involving an <If> envelope that worked as intended?

If yes, we proceed to more details.

Are you really using extra spaces inside the quotation marks in those If expressions, or is that an artifact of posting? Are all these variables case-insensitive (for example Remote_Host vs. REMOTE_HOST)? Not the operators, the variables themselves. I’d play it safe and use conventional casing just to eliminate all possible variations.

EnvVars (Name: Value:regex) (one per line)
Oh, that's a good idea. I should add that to my own header logging so I can see exactly why a given request got blocked.

:: further poring over docs ::

If the environment variable you're setting is meant as input into this early phase of processing such as the RewriteRule directive, you should instead set the environment variable with SetEnvIf.
This may or may not matter, depending on site. But using the ordinary reverse-alphabetical-order rule, mod_env unlike mod_setenvif would execute after mod_rewrite (though still before mod_authzwhatever).

dstiles

4:43 pm on Jan 6, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the reply, Lucy.

First: Have you ever, in any circumstances, done something involving an <If> envelope that worked as intended?

Yes. For example, I have converted bot detection from previously discussed browsermatch to:

<if "-R '13.64.0.0/11' || -R '13.96.0.0/13' || -R '13.104.0.0/14' || -R '40.77.167.0/24' || -R '52.145.0.0/16' || -R '64.4.0.0/18' || -R '65.52.0.0/16' || -R '65.54.0.0/15' || -R '131.253.24.0/21' || -R '131.253.32.0/20' || -R '157.55.0.0/15' || -R '191.232.0.0/16' || -R '199.30.16.0/20' || -R '207.46.0.0/16' ">
BrowserMatch bingbot bing bot=bing
Require env bing
</if>

and

<if " ! %{HTTP_USER_AGENT} =~ m#((Apple|bing|Clara|Cliqz|Exa|Google|istella)bot|(Mojeek|Seznam|Yandex)Bot|BingPreview|DuckDuck|facebook|Qwantify|Vagabondo|Yeti)# && ! %{REQUEST_URI} =~ m#/robots\.txt#">
BrowserMatch [Bb]ot|crawler|spider bot_is=evil_robot:$0
</if>


Are you really using extra spaces inside the quotation marks in those If expressions

Yes, but I've tried with/without and see my bing example above. It helps me clarify the syntax.

Are all these variables case-insensitive

As far as I know, yes; but in any case for most instances that fail I'm using the casing that apache shows in the docs - eg Request_URI (which "fails" with SetEnvIf) - although from the above example, %{REQUEST_URI} "works" in an IF statement. (NOTE: Quotes because although they work in apache terms they do not transfer the env var to PHP with apache_getenv().) I have altered them to caps to see if that makes a difference but I can't see it altering the availability of env vars in PHP - either it works (which it does in blocking terms) or it doesn't.

that's a good idea.

:) I sometimes get them.

Your comment re: early/late operation was one I'd considered but does not really make sense. The failing vars are set in several places by SetEnvIf and IF - I've tried both at different times - and the relevant Require statements are mostly in one place, at the end of the script (I know that does not affect early/late).

Most notable failure is REQUEST_URI, which is set to trap (among other things) wp- (of which there are usually many) and to set an env var badrequri. New logs today show only a single instance of badrequri in the apache error log but none in the header log.

In fact, this may be an indicator: I have just checked and there are several instances of wp- in the header log, so badrequri is not being set properly anyway. Likewise there is a trap for the referer version of wp- and that isn't being seen anywhere.

I'll look further into this but it looks as if certain header variables are not being acted upon.

w3dk

1:37 am on Jan 7, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



... to use a php ErrorDocument which can then log the non-200 responses.


The ErrorDocument is triggered via an internal subrequest, so any environment variables you've set on the initial request are renamed with a "REDIRECT_" prefix by the time the error document is called (and new vars without the "REDIRECT_" might be set - depending on your criteria). eg. You'll need to check "REDIRECT_badrequri", not "badrequri" in your 404 error document.

In your PHP ErrorDocument dump the contents of the $_SERVER superglobal to see what you have available.


SetEnv host=numbers:$0


I don't think you can use backreferences with the SetEnv directive (only with SetEnvIf). But I'm not sure that mod_setenvif will use backreferences from an enclosed Expression (or maybe it does in later versions of Apache than what I have here)?

I'm using the casing that apache shows in the docs - eg Request_URI (which "fails" with SetEnvIf) - although from the above example, %{REQUEST_URI} "works" in an IF statement


It's not case-sensitive. However, mod_setenvif uses its own set of variables/arguments, which use camel case by convention. The variables used with Apache Expressions (and mod_rewrite etc) are server-variables (not Apache "environment" variables) and are uppercase by convention. There is some repetition between the two, however, there are differences. eg. SetEnvIf uses Request_Protocol, but Apache Expressions use SERVER_PROTOCOL. SetEnvIf allows you to reference HTTP request headers directly, eg "Referer". But otherwise you need to use the corresponding server variable (eg. HTTP_REFERER).

dstiles

2:05 pm on Jan 7, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



w3dk:

request are renamed with a "REDIRECT_" prefix

Thanks, but I think that only applies to header variables, not env vars, many of which are working anyway. I test for REDIRECT_ and REDIRECT_REDIRECT_ of Status Code in the ErrorDocument but have no need for others at present. I have used getenv() (no arg gets all) and no sign of REDIRECT_.

I don't think you can use backreferences with the SetEnv directive (only with SetEnvIf).

A good point I only have a few SetEnv - most are SetEnvIf - but I'm not seeing ANY reporting for the SetEnv ones, possibly because they are rare. I would not have expected the reporting to die completey just for a $0 but I'll bear it in mind when reapprasing it.

But I'm not sure that mod_setenvif will use backreferences from an enclosed Expression (or maybe it does in later versions of Apache than what I have here)?

Do you mean "enclosed in quotes"? I have both enclosed and not enclosed and for badrequri they ALL fail. This is Apache 2.4.18 (on Mint/Ubuntu) and PHP 7.0.33.

Thanks for the clarification concerning case. And for the request header Referer - is that ONLY a shorthand or is HTTP_REFERER a complete no-no? But in any case, I'm using <if> on the Referer header tests.

lucy24

6:45 pm on Jan 7, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And for the request header Referer - is that ONLY a shorthand or is HTTP_REFERER a complete no-no?
Depends on the module.

SetEnvIf accepts any and all header names, including ones in X- or the misspelled header names so popular with inept robots (which in my case sometimes leads to setting a variable called botheader).

Rewrite has predefined terms for a handful of header fields such as HTTP_REFERER and HTTP_ACCEPT. Otherwise you have to shift to the form HTTP:header-name replacing _ lowline with : colon.

Yes, that means some common header fields can be expressed in different ways. HTTP_REFERER means the same as HTTP:Referer; “SetEnvIf User-Agent” means the same thing as “BrowserMatch”.

Edit: Oops, sorry, did you mean specifically in an <If> expression? The docs [httpd.apache.org] give a list of what you can use. It seems to be basically the same list as mod_rewrite uses; scroll way down to the Example Expressions and it appears that you can again say HTTP:header-name for the others though they don’t mention it in the body of the document.

dstiles

3:46 pm on Jan 10, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Part of the solution does indeed lie in the SetEnv. I have converted a couple using a variation on SetEnvIf instead - eg:

SetEnvIf RemoteAddr .* varname=value:$0

Anyone know how to use something else as an empty SetEnvIf header/value pair? Or have an alternative to SetEnv that does the same job?

I think I've solved the failure of badrequi (env var set from Request_URI). It does indeed work as a blocker, but when reporting it via getenv() it reports an incorrect value (if anything); it is necessary to use:

$value=getenv($name,true);

The "true" refers to the optional parameter "walk_to_top", which reports the env var for the top-most setting. Without that, for Request_URI, $value is the error document URL. Eg: with no walk_to_top the returned request is always /errdoc.php; with it the result is (eg) /wp-login/.

I'm hoping this will apply to the other header reports as well. I'm making gradual changes and reviewing the results. So far I have changed <if QUERYSTRING...>...SetEnv name and HTTP_REFERER to the appropriate SetEnvIf statement but nothing has triggered them yet so I don't know if they will work.

It would be useful to allow several env var values to be set for a single name but this seems to be something apache does not cater for. A trivial example might be:

BrowserMatch curl|libwww|perl useragent=scrape:$0
BrowserMatch bitcoin|miner useragent=miner:$0

...resulting in

useragent=scrape:libwww
useragent=miner:bitcoin

I do not want to use different variable names, just values.

For further information on the general trapping process on this server, if the trapping script lets through a "valid" hit almost the first thing the php script does, before any page code is generated, is to look at a MySQL database table to see if the IP has been previously registered as a server farm or other "bad neighbourhood". This rarely happens on this server as most of the nasties are trapped by the SetEnv script, but when it does happen the php code issues a 403 and terminates. Exceptions are Amazon IPs, many of which are detected in the SetEnv script in order to allow amazon-based bots such as duckduckgo and cliqz; if it's not a proper bot then the amazon IP is blocked (amazon is responsible for a large number of blocked hits, obviously a major source of nastiness - yes, I know about the "good" amazon hits but seldom see them).

I have some traps in my published pages that I use to detect certain kinds of access and those I intend to add to the database when I have a moment to implement the code. This mechanism (and db) overall is based on my long-standing IIS system, though that does not have a SetEnv type script for the initial blocking.

lucy24

7:54 pm on Jan 10, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Seems like what you'd want instead is, say,

BrowserMatch (curl|libwww|perl) useragent=$1
or
...useragent=myname:$1

There may not technically be any difference between $0 without parentheses and $1 with parentheses, but using the parenthes makes it unambiguous that you want to capture this specific bit. You'd need to experiment to see which one comes up if the UA string contains more than one of the specified expressions: does it pick the first one in your pipe-separated list, or the first one that happens to come up? It probably makes no difference for our purposes, but it's useful to know.

I'm going to experiment with this on my test site and see what I learn.

lucy24

9:54 pm on Jan 10, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Update: I got apache_getenv to work locally (Apache 2.2, php 5.something), but it broke on the live site (2.4, 7.something). Instead I tried plain getenv, which worked as intended.

Is there a way to constrain it to environmental variables you've set yourself, or do you have to read specific names out of an array (my stopgap solution to avoid a major information dump)?

Now that I've got it to work I will continue experimenting--but not right away, because as we all know, sitting down to the computer for a quick five minutes of programming means that you will shortly look up and see the sun rising ... and I have to leave the house in half an hour.

dstiles

11:57 am on Jan 11, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for looking into this, Lucy.

I'm not worried about more than one value from a single SetEnvIf, rather from several SetEnvIf with the same name but different value (see example above). The first found value in a regex is reported per SetEnvIf but is over-ridden by a later successful match using the same name in a different SetEnvIf, resulting in only the last matched SetEnvIf per name.

I tried $1 (can't recall the context) but it returned empty. Probably no parentheses or something.

My reporting function holds an array of reportable env var names $envname which is used as foreach ($envname as $name) {... }. I suppose it's possible to read the $_ENV array but I only tried it briefly and gave up.

I'm surprised 2.4 failed. I'm using 2.4.18 with php 7.0.33 and that works fine. But looking back at my previous postings, I must apolgise for an error which may be responsible. I should have written
$value=apache_getenv($name,true);

not
$value=getenv($name,true);


The getenv() function has to be supplied with a valid env var name - you can't use a wild card. Unless you are using php 7.1.0 or later, when getenv() (NOT apache_getenv()) can be used with no name at all to return an array of all names. It's too much hassle to upgrade php just for that - my apache server shares a vps with a mail server and it's messy to update.

I use apache_getenv() instead of getenv() because, when I started, the latter seemed to give incorrect/no results at an early testing stage so I left it at apache_getenv(). The walk_to_top argument is only available in apache_getenv() so that function now seems necessary.

lucy24

6:45 pm on Jan 11, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Funny, I had the opposite experience: apache_getenv didn't work; getenv did--with otherwise identical code.

Yup, you need parentheses if you're specifying $1 $2 and so on.

The one thing I did learn was that if you say, for example,
blahblah (Firefox|Mozilla)
w/r/t the UA, the value will come through as “Mozilla”, i.e. the first term the function finds, not the first term in a pipe-separated list.

By and by I'll change most of my
SetEnvIf something-here envname
to
SetEnvIf something-here envname=$0
--extra information that's only useful when I'm actually looking at the environmental variables in logs.

The walk_to_top argument is only available in apache_getenv()
Oh, ###, I forgot to change that. getenv uses $local_only instead. I'm surprised it didn't break.

The getenv() function has to be supplied with a valid env var name - you can't use a wild card.
In my case, I supplied names anyway, because otherwise--this was in early testing with apache_getenv--it dumped a bunch of stuff that I guess the server counts as environmental variables but which are of no use to me, often because I'm already getting the same information in more concise form. And then I added a further
if ($value)
condition so it only logs environmental variables that have actually been defined, whether to default 1 or some specific value.

Next step: See if I can relocate the logheaders function to the shared userspace (where my generic robots list already lives) so I don't have to keep uploading a separate file for every site whenever I change something.

But we digress :)

w3dk

6:51 pm on Jan 11, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



request are renamed with a "REDIRECT_" prefix

Thanks, but I think that only applies to header variables, not env vars


Not sure what you mean specifically by "header variables", but it certainly does apply to "env vars" (more so than anything else). (?)


But I'm not sure that mod_setenvif will use backreferences from an enclosed Expression (or maybe it does in later versions of Apache than what I have here)?


Do you mean "enclosed in quotes"? I have both enclosed and not enclosed and for badrequri they ALL fail.


In your earlier rule blocks, the $0 backreference was seemingly referring back to the preceding IF expression (the enclosed Expression). The $0 backreference only refers back to the regex within the SetEnvIf directive itself AFAIK.

Anyone know how to use something else as an empty SetEnvIf header/value pair?


Not quite sure what you mean by "empty" but you could use something minimal like:


SetEnvIf ^ ^ MY_VAR=value


The "^ ^" is basically "any header" and "any value" so is always successful. Although if you are only using the environment variable later in your PHP script then you should be able to use SetEnv as before. You need to use SetEnvIf if you want to capture the matched value with a backreference, or use mod_rewrite:


RewriteRule ^ - [E=MY_VAR:%{HTTP:Some-Header}]


Although most values are accessible directly from within PHP - I don't thing you necessarily need to set another environment variable?


$value=apache_getenv($name,true);   [EDITED]


The "true" refers to the optional parameter "walk_to_top", which reports the env var for the top-most setting. Without that, for Request_URI, $value is the error document URL. Eg: with no walk_to_top the returned request is always /errdoc.php; with it the result is (eg) /wp-login/.


Passing "true" as the 2nd argument is the same as calling [apache_]getenv('REDIRECT_'.$name); - or however many levels you need to go up the tree. Specifying "true" just does it for you.

If you are in a PHP ErrorDocument then you can use the "PHP" var "REDIRECT_URL" (elements of the $_SERVER superglobal array), "REDIRECT_QUERY_STRING", etc. Or the PHP var "REQUEST_URI" (which is different to the Apache server variable of the same name - a little confusing!)


It would be useful to allow several env var values to be set for a single name but this seems to be something apache does not cater for. A trivial example might be:

BrowserMatch curl|libwww|perl useragent=scrape:$0
BrowserMatch bitcoin|miner useragent=miner:$0

...resulting in

useragent=scrape:libwww
useragent=miner:bitcoin


How would you envisage reading the values back from the receiving script? Like an "array"? This is a limitation of env vars in general, not just Apache. The later will simply overwrite the former.

Update: I got apache_getenv to work locally (Apache 2.2, php 5.something), but it broke on the live site (2.4, 7.something). Instead I tried plain getenv, which worked as intended.


Whether apache_getenv() is available will depend on how PHP is installed on the server... Apache module vs FastCGI, etc.


Is there a way to constrain it to environmental variables you've set yourself, or do you have to read specific names out of an array (my stopgap solution to avoid a major information dump)?


You could prefix all your own env vars with your own unique prefix, use getenv() (or $_ENV or even just $_SERVER) and apply PHP's array_filter() to extract just the elements you want (accounting for any "REDIRECT_" prefix)?

dstiles

7:55 pm on Jan 11, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lucy:

keep uploading a separate file for every site

As noted in a previous posting of mine, I put all my SetEnv(etc) into a single file which is included in the virtual host...

<VirtualHost nnn.nn.nnn.nnn:443>
ServerAdmin alert@mydomain.co.uk
ServerName www.example.com
DocumentRoot /srv/example
Header edit Set-Cookie ^(.*)$ __Host-$1;HttpOnly;Secure;SameSite=Strict
<Directory "/">
AllowOverride None
Require all denied
</Directory>
<Directory "/srv/example">
DirectoryIndex index.php
AllowOverride All
Include /etc/apache2/use-setenv.conf # this one here!
Include /etc/apache2/rewrite.conf
</Directory>
(etc)
</VirtualHost>

Most of the header security stuff (Strict_Security_Transport etc) could be in the included rewrite.conf file but the cookie one varies slightly on a couple of sites.

w3dk:

$0 backreference only refers back to the regex within the SetEnvIf directive itself

Agreed. I wasn't thinking clearly. :(

SetEnvIf ^ ^ MY_VAR=value

Thanks. That's a good idea. But I'm not sure I need it quite that wild - the env var would be weird?

REDIRECT_ and walk_to_top - yes, that makes sense.

How would you envisage reading the values back from the receiving script? Like an "array"?

I guess it would have to be.

prefix all your own env vars with your own unique prefix

In general I name vars for their source - badrequri for Request_URI, for example. The value refines the actual trapping for reporting. The minimal number of env var names makes it easy to keep under control, to act upon them (Require env badrequri) and to report them through a logging function.

Lucy, if it helps, my logging function is below. Log names are (bot-)?header-date-statuscode.log.
Note: this is intended for development and very low transaction sites. The logs can fill rapidly and take a lot of resources to generate.

function logHeaders() {
$envmde=array("TrapIP","ips","BlockCountry","host","accept","accept_lang","badua","badrequri","bot","bot_is","browser","query","referer","useragent"); # TrapIP set by db Bad IP Found
$ip=$_SERVER['REMOTE_ADDR'];
$hst=$_SERVER['HTTP_HOST'];
$fh; $str; $stat = http_response_code(); # name log for error code - easier to view
$dt = date('Ymd'); $tm = date('G:i:s'); $fn="header";
if (isset($_SERVER['REDIRECT_URL']) ) $url=$_SERVER['REDIRECT_URL']; else $url="";
if ($ip=="nn.nn.nnn.nn") { $fn="our-$fn"; } # my own ip to log away from general hits
else {
$value=apache_getenv('bot'); # log bots into separate files
if(!empty($value)) { $fn="bot-$fn"; }
else {
$value=apache_getenv('bot_is'); # bot_is resolves bad things named bot, crawl etc
if(!empty($value)) { $fn="bot-$fn"; }
}
}
$str="IP: $ip\t$dt $tm\t$stat\n";
$str .= "Host: $hst\tPage: ".$_SERVER["PHP_SELF"]; # page read if 200, else /errdoc.php
if (!empty($url)) { $str .= "\tURL: ".$url; } empty if 200 else set to original page
$str .= "\n";
# report setenv values
foreach ($envmde as $name) { $value=apache_getenv($name,true); if(!empty($value)) { $str .= "$name: $value\n"; } }
# report headers (except Host, dealt with above)
foreach (getallheaders() as $name => $value) { if ($name!="Host") { $str .= "$name: $value\n"; } }
$str .= "----\n\n"; # end of record separator
# write to log file
$fh = fopen("/srv/0logs/0headers/$fn-$dt-$stat.log","a");
fwrite($fh, $str);
fclose($fh);
}

lucy24

8:25 pm on Jan 11, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<tangent>
I took a closer look and realized I'd never be able to put logheaders in my userspace (shared by all sites) because most places that call the logheader function do so via an SSI "include virtual" line, which tops off at the site root. (I can do it with the shared robots file, because that's done with a php include using the physical filepath.)

Darn.

I guess I could do a two-step process where "logheaders.php" in its turn invokes another file--which can live anywhere it wants to, now that it's php--but it seems a bit convoluted. And will DOCUMENT_ROOT even work if you're in a php file that isn't located within the site's individual directory?
</tangent>

I also looked closer and found that the "Require env" directive only looks at whether an variable has been set at all; it can't look at its value. (I suppose an <If> statement could, but it wouldn't be worth the trouble.) Double darn, as I was thinking it could be useful to set selected variables to something like -1 for authorized robots, and then use >0 values for access control.

dstiles

11:56 am on Jan 12, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My logging function is called from /errdoc.php, which resides in every site's root and in a folder common to all sites (necessary under certain redirect circumstances, forget which).

The errdoc.php files contain two lines: a require_once to pull in the error parser (which itself has a require_once to load the logging file) and a call to the logging function (adapted from one commonly found online). The functions reside in a common "library" folder just outside the document roots, accessed with an absolute path to the lib folder.

This folder contains most of the actual code for all the sites, such as page building, db access, form parsing etc. If there is no error (200) then the logging function is called from a file which is always loaded early in the php code, specifically the one which tests for the IP being in the database.

I also found the lack of env var granularity annoying, but I got around it by assigning values to named env vars that would always perform a common function: eg useragent=fetch$0 (curl etc) and useragent=seo$0 (seo scrapers). The inclusion of "Require env useragent" in a <RequireNone> block rejects all instances of useragent and, incidentally, reports the trapping name/value in the log.

My bot trapping is on the lines of:

<if "-R '77\.75\.72.0/21' ">
BrowserMatch SeznamBot seznam bot=seznam
Require env seznam
</if>

...with a common test for amazon IPs for duck and cliqz (and rejection if it's neither).

lucy24

6:56 pm on Jan 12, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



After trying a php include on my test site, and finding that it worked perfectly, I had a belated “D’oh!” moment and realized that of course it works ... because the program isn’t actually running from the userspace, it’s running at the site level via the one-line program that includes it.

All is well :)

dstiles

3:31 pm on Jan 13, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I often get those moments! :) Glad it worked!

lucy24

7:23 pm on Jan 16, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: bump ::

I just figured out something unexpected, which I'm passing along in case it turns out to apply in your site as well.

BrowserMatch blahblah bad_agent=$0

If blahblah is plain text (with or without quotation marks), like "Firefox" or "Opera 8", the value will be set as the literal string "$0".

Only if blahblah can be construed as a regular expression--say, by containing a . unescaped period--does the value get set as blahblah.

BrowserMatch Firefox/ bad_agent=$0
>> logs record bad_agent=$0

BrowserMatch Firefox. bad_agent=$0
>> logs record bad_agent=Firefox/

Color me puzzled, but this explains a good bit.

dstiles

4:38 pm on Jan 17, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Could explain some of the results I'm seeing. In order to reduce the number of env vars, I actually add a value:$0 to mine, as in...
BrowserMatch Firefox/ bad_agent=firefox:$0

An example of my logging an actual $0 is...
SetEnvIfNoCase Accept-Language zh-cn accept_lang=foreign:$0
(logged as accept_lang=foreign:$0)

I've now made it into zh-cn|^en-US$ - adding a common bad-bot accept-lang - to see what happens. I suspect you are correct, though, since $0 is a regex value, which plain text isn't. And which I hadn't considered. :(

I also notice that the following reports correctly (there are many actual -R terms, removed for clarity)...
<if "-R '3.0.0.0/8' || -R '34.192.0.0/10' ">
SetEnvIf Remote_Addr .* amazon ips=amazon:$0
</if>
(which logs ips: amazon:34.197.76.213)

whereas I suspect replacing .* with ^ does not - not tested, though, so I could be wrong.

w3dk

9:03 pm on Jan 17, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



If blahblah is plain text (with or without quotation marks), like "Firefox" or "Opera 8", the value will be set as the literal string "$0".


Wow, that is interesting! Confirmed over here too. It still matches "like" a regex, in that it is matched anywhere in the target string, just no backreferences are captured (no attempt to capture backreferences). It looks like it might be an internal optimisation... if it doesn't look like a regex then perform a simple/faster substring() instead (no backreferences are captured). However, this does look like a bug.

Going forward then... it would seem wise to be explicit if you want to capture the match and use parentheses (which appears to work OK - as expected). For example:


# Results in bad_agent=Firefox
# Can use $0 or $1 since the entire pattern is a capturing group
BrowserMatch (Firefox) bad_agent=$1


I suspect replacing .* with ^ does not - not tested, though


The ^ (start-of-string-anchor) doesn't actually match anything, so cannot capture anything. So if you used ^ (instead of .*) then any backreferences will be empty.

dstiles

10:54 am on Jan 18, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



w3dk: The ^ (start-of-string-anchor) doesn't actually match anything, so cannot capture anything

Thanks for the confirmation and the explanation. :)

lucy24

6:27 pm on Jan 18, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The ^ (start-of-string-anchor) doesn't actually match anything, so cannot capture anything. So if you used ^ (instead of .*) then any backreferences will be empty.
Which, in turn, means that the variable's value would be nothing, blank or empty if you used a $0 construction--probably the exact opposite of the intended result if you're using it in access controls involving "Require env". But if you simply said name-of-variable without the =$0 bit, it would be set to the default 1. (An interesting case of Less Is More.)

Now, can anyone remember HOW to post a comment in Apache docs? I’ve done it once or twice in the past, but now I’m ### if I can figure out how to start :(

w3dk

8:15 pm on Jan 18, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



...can anyone remember HOW to post a comment in Apache docs?


Curious, I don't actually see any comments in the Apache docs - where previously I thought I had seen them - just an empty "Comments" section. Maybe they have all been removed?

Maybe consider posting a bug report instead?
[bz.apache.org...]

dstiles

4:29 pm on Jan 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Almost all working now - thanks for all the help! :)

One thing still not working fully in the report:
<if " %{QUERY_STRING} =~ m#[a-z]+#i ">
SetEnvIf QUERY_STRING .* query=any:$0
</if>

This (probably) blocks on any a-z character in the querystring - "probably" because it's almost always accompanied by other types of block - eg UA, Accept etc. In the log it reports:
query: any:

In other words, no actual $0 value. I know there is a value from the site logs. Any ideas, please? There are half a dozen querystring tests, the above is just one, but all give the same result: no $0 value.

lucy24

5:49 pm on Jan 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wouldn't it have to be
SetEnvIfExpr QUERY_STRING=~.*
(instead of SetEnvIf QUERY_STRING .*)
if you're doing an “expression”, not a named header field?

dstiles

12:00 pm on Jan 21, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't think so, though I could be wrong. As I said, the expression triggers sufficiently to report in the log, which it wouldn't if the trap failed, it just does not supply the $0 value. Some similar constructs with other header names work - eg (as noted above):
<if "-R '3.0.0.0/8' || -R '34.192.0.0/10' ">
SetEnvIf Remote_Addr .* amazon ips=amazon:$0
</if>
(which logs ips: amazon:34.197.76.213)

but one for HTTP_COOKIE only reports the env var value, not the $0 value.

I did try your suggestion, with variations, but configtest always responds "Could not parse...unexpected $end... expecting '('. In fact, I never have been able to get SetEnvIfExpr to work. I probably have a blind spot in understanding the syntax. Or something.

lucy24

7:23 pm on Jan 21, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So, in your findings, SetEnvIf can use the $0 etc. captures from the <If> envelope? I’ll be darned.

:: detour for much experimentation in test site, which earns its keep by throwing repeated 500 errors without imperiling “real” sites on same userspace ::

Using the query string "question", and omitting the ones that threw a 500 error:

1.
SetEnvIfExpr "%{QUERY_STRING} =~ /.+/" thisworks=$0
result "thisworks: 2"
(Feel free to join me in “wtf?” in two-part harmony.)

2.
SetEnvIfExpr "%{QUERY_STRING} =~ /(.+)/" thisworks=$0
SetEnvIfExpr "%{QUERY_STRING} =~ /(.+)/" thisworks=$1
result for both: "thisworks: question"

3.
SetEnvIfExpr "%{QUERY_STRING} =~ /(.+)/" thisworks=$2
result: variable "thisworks" NOT SET (technically I guess it is set to a value of "", which counts as not set)

Oh, and for completeness:

4.
SetEnvIfExpr "%{QUERY_STRING} == 'question'" thisworks
result: "thisworks: 1" (i.e. variable is set, using default value)

w3dk

9:12 pm on Jan 21, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



SetEnvIf QUERY_STRING .* query=any:$0


QUERY_STRING is a server variable (not an environment variable) - you can't reference server variables directly in the SetEnvIf directive. As mentioned in my first post above, SetEnvIf uses its own set of variables (although there is some overlap in the names used). I think you will need to use SetEnvIfExpr as lucy24 suggests above, or use mod_rewrite.

This is possibly why you are also having problems with HTTP_COOKIE (also a server variable). For this you could reference the "Cookie" header directly in the SetEnvIf directive.

lucy24

11:12 pm on Jan 21, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Darn, w3dk, I was hoping you were going to shed some light on the "thisworks: 2" headscratcher :)

w3dk

12:17 am on Jan 22, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



1.
SetEnvIfExpr "%{QUERY_STRING} =~ /.+/" thisworks=$0

result "thisworks: 2"
(Feel free to join me in “wtf?” in two-part harmony.)


Ooo, I get a different result....
"thisworks: " ($0 appears to be "empty" - a bit more expected)

2.


Same result.

3.
SetEnvIfExpr "%{QUERY_STRING} =~ /(.+)/" thisworks=$2

result: variable "thisworks" NOT SET (technically I guess it is set to a value of "", which counts as not set)


I appear to get a different result....
"thisworks: " (the variable IS SET, but value is "empty" - again, more expected IMO)

How are you checking the environment var? I'm simply dumping the contents of the $_SERVER superglobal in PHP and there it is.

4.


Same result.

Testing this on Apache 2.4.7
This 39 message thread spans 2 pages: 39