Forum Moderators: coopster

Message Too Old, No Replies

Logging bot activity

All bots work correctly except one

         

dstiles

10:16 am on Feb 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a common file included in all sites that checks for various bot and other activity using browsermatch, <if>, setenvif etc. A PHP script takes the Env results and parses them into different logs dependant on the Env value and the Status (200, 404 etc). Good and bad bot activity is logged into files b-(status)-(date). This has worked fine since I put it together a year ago. Except for the letsencrypt bot. This persistently ends up in the Header 404 file. 404, fine, but it should be in bot 404. Not critical, but it's annoying.

The code for a working bot, typified for mojeek, is:
<if "-R '5.102.173.64/28' ">
SetEnvIfExpr "%{REMOTE_ADDR} =~ /(.+)/" ips=mojeek:$0
BrowserMatch MojeekBot mojeek bot=mojeek
Require env mojeek
</if>

The env "bot" is used in the logging script to direct the result into the appropriate log, in this case b-200-(date) and "Require env mojeek" permits the bot access.

Letsencrypt is slightly different in that it does not adhere to known IPs. It can spin up another one at any time. And although it usually hits the well-known folder I hedge bets on that, though it invariably triggers ONLY on REQUEST_URI, never the others. So my code for letsencrypt is:
SetEnvIf HTTP_REFERER ".well-known/acme-challenge/" letsencrypt bot=letsencryptr
SetEnvIf REQUEST_URI ".well-known/acme-challenge/" letsencrypt bot=letsencryptu
BrowserMatch "letsencrypt.org/" letsencrypt bot=letsencryptb
Require env letsencrypt

where the bot env value has a suffix r, u or b depending on what actually triggered it.

In desperation I have tried:
<if " -R '0.0.0.0/2' ">
SetEnvIf REQUEST_URI ".well-known/acme-challenge/" letsencrypt bot=letsencryptu0
</if>
<if " -R '64.0.0.0/2' ">
SetEnvIf REQUEST_URI ".well-known/acme-challenge/" letsencrypt bot=letsencryptu64
</if>
<if " -R '128.0.0.0/2' ">
SetEnvIf REQUEST_URI ".well-known/acme-challenge/" letsencrypt bot=letsencryptu128
</if>
<if " -R '192.0.0.0/2' ">
SetEnvIf REQUEST_URI ".well-known/acme-challenge/" letsencrypt bot=letsencryptu192
</if>
Require env letsencrypt

(Apache complains if I go below /2 so it has to be split up to cover the complete ipv4 range).

The relevant part of the logging code that determines the logfile prefix is:
if(!empty(apache_getenv('bot'))) { $fn="b"; }
else { $fn="h"; }

followed by code to determing status, values etc.

Does anyone know what I'm doing wrong here? Or can throw any light on it?

w3dk

1:41 pm on Feb 5, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



SetEnvIf HTTP_REFERER ".well-known/acme-challenge/" letsencrypt bot=letsencryptr


This will never match since it should be Referer as the second argument, not HTTP_REFERER, in order to match against the Referer HTTP request header. HTTP_REFERER is the Apache server variable (as used by other modules), but that cannot be used here (it will simply be seen as an empty string). SetEnvIf uses its own (limited) syntax.

However, I wouldn't necessarily expect the Referer header to be set anyway on such requests, so this probably won't make a difference.

SetEnvIf REQUEST_URI ".well-known/acme-challenge/" letsencrypt bot=letsencryptu


Likewise, strictly speaking, it should be Request_URI, not REQUEST_URI (the Apache server variable) - although the argument is case-insensitive, so it doesn't actually matter.

Not that it will really make a difference here, but the 3rd argument is a regex and ".well-known" is always requested from the root, so this should strictly be "^/\.well-known/acme-challenge/" (which is also more efficient). Backslash escape the dots, otherwise, you are potentially matching too much.


BrowserMatch "letsencrypt.org/" letsencrypt bot=letsencryptb


I don't think the LetsEncrypt user-agent string contains a slash after the hostname, so this probably won't match either. Again, this is a regex, so it should be "letsencrypt\.org".

However, I would expect the bot to get caught by the "SetEnvIf Request_URI" check, so the problem looks like it's "somewhere else". Have you confirmed the URL-path that is logged? Is it as expected?

Is it possible that something else can overwrite the "bot" env var?

If none of those directives were successful then you'd presumably get a 403 (by "Require env letsencrypt")? OR, the directives aren't being processed at all?!

Since you are setting an env var yourself with SetEnvIf, you shouldn't have to resort to calling the apache_getenv() function (which isn't portable - if that is a concern). You could simply check the value in the $_SERVER['bot'] superglobal.

Have you tried logging the value of the "bot" env var itself?

Aside:

SetEnvIfExpr "%{REMOTE_ADDR} =~ /(.+)/" ips=mojeek:$0


You don't need to resort to using an Apache Expression here, you could use the simpler SetEnvIf. For example:


SetEnvIf Remote_Addr "(.*)" ips=mojeek:$1


(If you are explicitly capturing a backreference then it would be clearer to use $1 instead.)

w3dk

1:57 pm on Feb 5, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month




SetEnvIf HTTP_REFERER ".well-known/acme-challenge/" letsencrypt bot=letsencryptr
SetEnvIf REQUEST_URI ".well-known/acme-challenge/" letsencrypt bot=letsencryptu
BrowserMatch "letsencrypt.org/" letsencrypt bot=letsencryptb
Require env letsencrypt


Just taking a step back... what determines whether this code should be executed in the first place?

dstiles

4:23 pm on Feb 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



w3dk: thanks for the reponse!

Good call on the referer syntax! :) I have, in fact, seen referers of that value but never caught one in the log - surprise!

> ".well-known" regex

Good point, although I've changed this so many times I'm sure I must have tried the escapes.

> slash after the hostname

Right again! How do I miss these? :(

The path is correct and the bot value is logged correctly, just not in the correct log file. The bot IS being logged to a 403 log: the actual path does not exist during this bot's visit. I'm happy with that. The Require env? I suppose that's a possibility, but why? And remember the "bot" value is being logged correctly apart from the actual file. And as I said, the bot-selector for the logfile name works for everything else, wanted bots such as mojeek and 403'd bots such as semrush. And the bot env IS logged (eg letsencryptu).

> use the simpler SetEnvIf

Reasoning there is to log the actual value of (eg) the IP. I never considered Remote_Addr as the code originated with reporting trapped words in other places, not IPs.

> what determines whether this code should be executed

It's always executed. It's that way because I couldn't devise a simple test such as the IP tests used for all the other bots.

lucy24

5:32 pm on Feb 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<if " -R '0.0.0.0/2' ">
...
(Apache complains if I go below /2 so it has to be split up to cover the complete ipv4 range).
If this directive is meant to cover the whole IPv4 spectrum--and presumably also IPv6, unless you’ve got a particular reason for treating it differently--why do you need the <If> envelopes at all? Do you actually use the various letsencryptblahblah values later on?

dstiles

10:30 pm on Feb 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lucy, I tried that in desperation, simply to prove to myself that a precondition which emulated the other bots would make no difference. It was short lived, just long enough to prove the point.

dstiles

10:20 am on Feb 9, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sorry for the delay. Letsencrypt only triggers once a day so it's taken a while to assess changes.

In the order as given below the log records (in the correct b-404-(date) log) the fact that the referer triggered the hit. Moving BrowserMatch to the second line shows that the browser now triggers. Adding the anchor as suggested above has prevented the URI value from triggering. Removing the anchor and with the URI test in the position shown, the URI triggers. So, I now have all of them working as required. Many thanks w3dk and Lucy.
BrowserMatch "letsencrypt.org" letsencrypt bot=letsencryptb
SetEnvIf Referer ".well-known/acme-challenge/" letsencrypt bot=letsencryptr
SetEnvIf REQUEST_URI "^\.well-known/acme-challenge/" letsencrypt bot=letsencryptu
Require env letsencrypt

w3dk

12:35 pm on Feb 9, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



The bot IS being logged to a 403 log: the actual path does not exist during this bot's visit.


403 or 404 as stated in the OP?

> what determines whether this code should be executed

It's always executed.


Hhhmm, but if it's "always executed" then "Require env letsencrypt" will block the request when it's not the LetsEncrypt bot? ie. Every other request is blocked - which I assume is not the case - so you must have some conditional or "something"? (Your "mojeek" section is inside an "<If>" expression.)

However, isn't all your logging in PHP? If you block the request at all in Apache then the request won't be logged by PHP - or is that the intention?


Moving BrowserMatch to the second line shows that the browser now triggers. Adding the anchor as suggested above has prevented the URI value from triggering.

SetEnvIf REQUEST_URI "^\.well-known/acme-challenge/" letsencrypt bot=letsencryptu



Since you overwrite the "bot" env var then any later match will overwrite an earlier match. (Have you considered concatenating the "bot" var in order to preserve all reasons for the match? Although that would require some changes to your existing code.)

You are missing the (forward) slash prefix on the regex that matches against the URL-path, so the above will never match. (Which is why removing the anchor "^" allowed it to match.) It should read:

SetEnvIf Request_URI "^/\.well-known/acme-challenge/" letsencrypt bot=letsencryptu


(The backslash is escaping the literal dot in the regex.)

lucy24

5:34 pm on Feb 9, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Incidentally ... you don't need to put SetEnvIf patterns in quotation marks. Does no harm, but isn't needed. The only exception is when the RegEx contains spaces: unlike some environments (such as RewriteCond), you can't escape the space, so then you do have to put the whole pattern into quotation marks.

dstiles

11:21 am on Feb 10, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



> 403 or 404 as stated in the OP?
Sorry, typo. 404, of course.

> always executed
Should have been "always evaluated", of course. Everything works as expected except (previously) that one "clause".

> any later match will overwrite an earlier match
Yes, I'm aware of that. I was making the point that all "bot=" tests triggered if in the correct place (ie last) depending on the UA/URI/etc. In practice it doesn't matter what order they trigger: it's just a test of any of the cases and makes no difference to the outcome, that one or another is logged correctly. I would not normally bother with all of them but I have seen unique cases of one without the others for this bot.

> missing the (forward) slash prefix
Ah! I wish there was some consisteny in apache. Some take a preceding / and others don't. :(

Lucy - yes. A remnant of trying anything to get it working. :)

lucy24

5:02 pm on Feb 10, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<aside>
Yesterday, in the course of looking up something unrelated, I discovered another benefit to quotation marks in a SetEnvIf pattern: if you say something like
BrowserMatch teststring bad_agent=$0
the value will be set to the literal string $0, instead of the desired “teststring” ... unless “teststring” contains literal quotation marks. (Or certain other characters, which may have changed from one 2.4 version* to the next, because I had different results the last time I looked into this. Anchors and grouping brackets still work; periods--whether escaped or not--no longer do.)

* This, in turn, suggests that some aspect of my server was changed in August 2020, as that’s when the $0 starts showing up in logged headers again. And THIS, in turn, suggests that it's never safe to assume suchancsuch has now been fixed and will continue to work in perpetuity.
</aside>

dstiles

9:40 am on Feb 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think it may be that the quotes are interpreted as a regex rather than plain text? $0 only returns the hit part of a regex but generally nothing (or the literal) for a plain text.

lucy24

4:47 pm on Feb 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



$0 only returns the hit part of a regex but generally nothing (or the literal) for a plain text.
Yes, that's what changed (on my specific server) in August. Before then, any . (period) was also interpreted as a potential RegEx, so $0 returned the matched string.

The previous discussion was a bit longer ago than I thought, January 2020 [webmasterworld.com]. I'd actually forgotten what a long thread we had about logging headers. Turns out, handling of quotation marks has also changed. (Before: pattern containing . will be treated as RegEx, so $0 returns matched string, while "quotation marks" doesn't. Now: other way around. Thanks a bunch, Apache.)