Forum Moderators: phranque

Message Too Old, No Replies

Please Check my Attempt to Block Various User Agents

Groping around in the dark, trying to not offend the grapefruit

         

Webwork

8:33 pm on Feb 23, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Today, via WHM, I inserted the section entitled #BLOCK BAD USER AGENTS here -> Home »Service Configuration »Apache Configuration »Include Editor >> Pre VirtualHost Include.

# BLOCK BAD USER AGENTS
SetEnvIfNoCase User-Agent (archive.org|ahrefsbot|baiduspider|binlar|casper|checkpriv|choppy|clshttp|cmsworld|cukbot||diavol|domainappender|dotbot|extract|feedfinder|flicky|getintentcrawler|g00g1e|grapeshotcrawler|harvest|heritrix|httrack|kmccrew|loader|maxpoint|maxpointcrawler|miner|mj12bot|naver|netseer|nikto|nutch|paperlibot|planetwork|plukkie|postrank|proximic|purebot|pycurl|python|qwantify|seekerspider|semrushbot|seznambot|siclab|skygrid|sogou|sqlmap|sucker|turnit|vikspider|w3c-checklink|winhttp|wotbox|xxxyy|yandexbot|youda|zmeu|zune) bad_bot
# BAD USER AGENTS

# JAL Sets Files for Mod Deflate January 29 2016
AddOutPutFilterByType DEFLATE text/html text/plain text/xml text/css text/javascript application/javascript application/rss+xml application/xml application/json image/x-icon
# JAL Mod Deflate

# JAL Sets Header Caching January 29 2016
<FilesMatch "\.(gif|jpg|jpeg|png|ico|html|css|txt|xml|javascript|js)$">
Header set Cache-Control "max-age=2592000, public"
</FilesMatch>
# JAL Sets Header Caching

# JAL Sets Error Documents January 29 2016
ErrorDocument 500 /errors/errmaintenance.html
ErrorDocument 404 /errors/errnotfound.html
ErrorDocument 403 /errors/errforbidden.html
ErrorDocument 401 /errors/errunauthorized.html
ErrorDocument 400 /errors/errbadrequest.html
# JAL Error Documents

# JAL Sets Expires Defaults January 29 2016
ExpiresDefault A172800
ExpiresByType text/css A31536000
ExpiresByType application/x-javascript A31536000
ExpiresByType text/x-component A31536000
ExpiresByType text/html A31536000
ExpiresByType text/plain A31536000
ExpiresByType text/xml A31536000
ExpiresByType image/bmp A31536000
ExpiresByType image/gif A31536000
ExpiresByType image/x-icon A31536000
ExpiresByType image/jpeg A31536000
ExpiresByType application/pdf A31536000
ExpiresByType image/png A31536000
# JAL Sets Expires Defaults



NEXT I added the following to the htaccess file for one site - a Wordpress site - in the htaccess file located in public_html:

# BAD USER AGENTS
<limit GET POST>
Order Allow,Deny
Allow from All
Deny from env=bad_bot
</limit>
# BAD USER AGENTS


Nothing has gone "BOOM" but . . I'm uncertain if my regex will actually start sniping the strings/agents/bots that I'm looking to block.

I thought I'd ask y'all to take a look my code(?) and tell me if I'm near the mark. (Yes, I'm attempting to "speak the language" of regex, Apache, htaccess and I'm not even certain what to call "the language". Argh)

It's a moments like this, when I just can't wait (to see if things are working), that my father would refer to me as an "impatient virgin".

Whatever that means.

tangor

9:56 pm on Feb 23, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Check your site log(s). These should show up as 403 entries.

lucy24

11:15 pm on Feb 23, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<tangent>
SetEnvIfNoCase User-Agent

mod_setenvif includes the "BrowserMatch" (and BrowserMatchNoCase) shorthand for exactly this situation.

Try not to use NoCase if there's any alternative, since it basically means the server has to do twice as much work. For example "www" no-case doesn't just mean "WWW' and "www", it means all six permutations of [Ww][Ww][Ww]. Learn the correct casing, and use it.

</tangent>

I added the following to the htaccess file for one site

I hope this is a 2.2 site, not the 2.4 you've been lamenting elsewhere ;) In 2.2, the mod_setenvif-plus-mod_auththingy combination works exactly that way. I use the identical formulation myself, except that I call it "keep_out". (Also bad_agent and bad_ref and a clutch of others, so I can fine-tune the conditions and exceptions.)

What's the reason for the <Limit> envelope? Most of the time, if you're blocking someone, you'd want to block them regardless. If it's on shared hosting they probably block PUT by other means, but do you want your really malign visitors even to succeed in a HEAD or OPTIONS request?

Tip: If you find some of your unwanted visitors requesting robots.txt (you will, of course, have a Files exemption to let them all have it), make sure you also deny them in robots.txt. That way, if they really do obey, they'll never even make a page request and your server doesn't even have to expend another nanosecond sending out the 403s. The only thing better than a blocked request is no request at all.

Webwork

1:59 pm on Feb 26, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



@ tangor - I'm learning that's part of the fun. Like seeing flies caught on flypaper . . if you're old enough to remember that modern wonder.

@ Lucy - Argh. I recently upgraded to 2.4. I'm aware there's a mod to aid transition but, since there are plans to deprecate much, I'm attempting to learn the new ways. Argh.

Here's my latest attempt. I'm sure it's wrong because I couldn't find a clear explanation of "how to wrap the logic" of the conditionals <If> and "granted" or "not". After a night's sleep I'm thinking some version of a "not" flag might be the trick but is that akin to "all user agents are okay just 'not' those defined in the string'". Nothing has blown up. I've been searching far and wide for decent / in depth 2.4 tutorials. Haven't found them yet. So, here's JAL user-agent block V1.4:

 <RequireAll>
SetEnvIfNoCase User-Agent "^(archive.org_bot|ia_archiver|ahrefsbot|baiduspiker|cukbot|dotbot|domainappender|feedfinder|extract|getintentcrawler|getintent|g00gle|grapeshotcrawler|harvest|meritrix|maxpoint|maxpointcrawler|miner|mj12bot|naver|netseer|nikto|nutch|oBot|paperlibot|planetwork|plukkie|postrank|proxic|purebot|pycurl|python|qwantify|seekspider|semrushbot|seznambot|siclab|skygrid|sogou|sqlmap|sucker|turnit|w3c-checklink|winhttp|wotbox|xxxyy|yandexbot|youda|zmeu|zune)" bad_bot
<If "%{HTTP_USER_AGENT} =='bad_bot'">
Require all denied
</If>
<Else>
Require all granted
</Else>
</RequireAll>

Webwork

7:08 pm on Feb 26, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Okay, I "think" I got this:

 SetEnvIfNoCase User-Agent "^(archive.org_bot|ia_archiver|ahrefsbot|baiduspiker|cukbot|dotbot|domainappender|feedfinder|extract|getintentcrawler|getintent|g00gle|grapeshotcrawler|harvest|meritrix|maxpoint|maxpointcrawler|miner|mj12bot|naver|netseer|nikto|nutch|oBot|paperlibot|planetwork|plukkie|postrank|proxic|purebot|pycurl|python|qwantify|seekspider|semrushbot|seznambot|siclab|skygrid|sogou|sqlmap|sucker|turnit|w3c-checklink|winhttp|wotbox|xxxyy|yandexbot|youda|zmeu|zune)" bad_bot
<If "%{HTTP_USER_AGENT} =='bad_bot'">
Require all denied
</If>


To borrow a line from Lucy24 . . . Tapping finger . . Waiting for thunderous round of applause. :) ;) :p :-/

lucy24

8:04 pm on Feb 26, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Every time I see this thread, I wonder all over again what Japan Airlines has got to do with anything.
I'm sure it's wrong because I couldn't find a clear explanation of "how to wrap the logic"

The terminology is definitely confusing, because the single word "Require" is used in two different ways. In envelopes, it might be
<RequireAll>
<RequireAny>
<RequireNone>
and then inside the envelope it says "require" all over again. So you can have
Require all denied
<RequireAny>
Require one
Require two
Require three
</RequireAny>
vs.
Require all granted
<RequireNone>
Require one
Require two
Require three
</RequireNone>
(corresponding loosely to whitelisting and blacklisting
vs.
<RequireNone>
Require env keep_out
Require env bad_agent
Require env bad_ref
</RequireNone>
et cetera et cetera. (This last form is probably the most straightforward conversion of a 2.2 "Deny from" list.)
...
SetEnvIfNoCase User-Agent "^(archive.org_bot|ia_archiver|ahrefsbot|baiduspiker|cukbot|dotbot|domainappender|feedfinder|extract|getintentcrawler|getintent|g00gle|grapeshotcrawler|harvest|meritrix|maxpoint|maxpointcrawler|miner|mj12bot|naver|netseer|nikto|nutch|oBot|paperlibot|planetwork|plukkie|postrank|proxic|purebot|pycurl|python|qwantify|seekspider|semrushbot|seznambot|siclab|skygrid|sogou|sqlmap|sucker|turnit|w3c-checklink|winhttp|wotbox|xxxyy|yandexbot|youda|zmeu|zune)" bad_bot

Uh-oh, now this is wrong and it will fail, where "fail" means "will not have the intended effect", not "will crash the server". You've got a spurious ^ at the beginning, before the parentheses, meaning that the environmental variable will only be set if the user-agent begins with the specified string.

Incidentally, the quotation marks aren't really needed, though they will do no harm. In mod_setenvif the main use of a quotation mark is to "protect" a literal space, as an alternative to escaping it. (You could also say \s which, in a request, could hardly mean anything but an ordinary space.)

:: idly wondering what you've got against Seznam*, a perfectly legitimate search engine which obeys robots.txt ::


* I went over and checked logs to make sure I'm not talking through my hat. Couple of exeptions in May 2015, and some more in the early part of 2014. Based on their crawl frequency, that has to be some kind of hiccup. I've found the same thing in Yandex, which once in a blue moon seems to forget all about robots.txt

Webwork

8:22 pm on Feb 26, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I was mildly amused when, in my youth, I first saw an ad for Japan Airlines.

Sezwho I've got anything against Seznam? (My best friend in H.S. was Slovak and that's all I'm gonna offer as a . . ahem . . spurious reason for blocking Sezthem, the Czech SE.)

Thx for pointing out the error of my spurious ways. Very much so.

whitespace

8:47 pm on Feb 26, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month





<If "%{HTTP_USER_AGENT} =='bad_bot'">



%{HTTP_USER_AGENT} is the full user-agent as sent in the request and 'bad_bot' is the literal string bad_bot (not the value ("1") of the bad_bot environment variable set with SetEnvIfNoCase). So, this expression is never going to work.

I think (looking at the docs [httpd.apache.org]) it would need to be something like:


<If "-T reqenv('bad_bot')">


Which tests if the environment variable is true. In this sense the string "1" is considered true. (Very much untested)

whitespace

12:06 am on Feb 27, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



<If "-T reqenv('bad_bot')">


Well, I can't get this to work. I suspect reqenv() is the wrong function anyway, but crucially I don't seem to be able to access the environment variable set with SetEnvIf[NoCase]. SetEnvIf should be running early in the request and <If> runs late as far as I can tell. However, none of the following appear to work:


<If "-T reqenv('bad_bot')">
<If "-T env('bad_bot')">
<If "-T %{ENV:bad_bot}">
<If "-T %{ENV:bad_bot} == '1'">


Despite seeing some examples that suggest that all the above should work!? My basic example, which should set an HTTP response header, but doesn't:


SetEnvIf User-Agent . bad_bot
<If "-T %{ENV:bad_bot}">
Header set X-Blocked Yes
</If>


However, you don't need to test the condition with SetEnvIf, you can do it all inside the <If> construct (and optionally set the environment variable inside that?)


<If "%{HTTP_USER_AGENT} =~ /(archive.org_bot|ia_archiver|ahrefsbot|baiduspiker|cukbot|dotbot|domainappender|feedfinder|extract|getintentcrawler|getintent|g00gle|grapeshotcrawler|harvest|meritrix|maxpoint|maxpointcrawler|miner|mj12bot|naver|netseer|nikto|nutch|oBot|paperlibot|planetwork|plukkie|postrank|proxic|purebot|pycurl|python|qwantify|seekspider|semrushbot|seznambot|siclab|skygrid|sogou|sqlmap|sucker|turnit|w3c-checklink|winhttp|wotbox|xxxyy|yandexbot|youda|zmeu|zune)/i">
Header set X-Blocked Yes
SetEnv bad_bot 1
</If>


The above at least "works". (Looks a bit messy though.)

lucy24

12:56 am on Feb 27, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, dear. And here I was thinking I could do a direct conversion, where 2.2 says
Order Allow,Deny
Allow from all
Deny from env=blahblahblah
Deny from env=blahblahblah (long list of environmental variables)
and then 2.4 would be
Require all granted
<RequireNone>
Require env blahblah
Require env blahblah (same list again, globally converted)
</RequireNone>
:: wandering off to pore over docs again ::
The Allow, Deny, and Order directives, provided by mod_access_compat, are deprecated and will go away in a future version. You should avoid using them, and avoid outdated tutorials recommending their use.

Well, that's reassuring. The familiar Allow/Deny/Order won't out-and-out go away; we've got time to change them.

About every other week my host sends out a form letter about server maintenance. Each time I get all excited thinking we're moving on up to 2.4 ... and then it's just routine housekeeping, darn it.

Webwork

3:17 am on Feb 27, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I'm a bit miffed. I followed the guidance given @ Apache.org here [httpd.apache.org ]

Under the section "Blocking Robots", which states
"Rather than using mod_rewrite for this, you can accomplish the same end using alternate means, as illustrated here:"


SetEnvIfNoCase User-Agent "^NameOfBadRobot" goaway
<Location "/secret/files">
<RequireAll>
Require all granted
Require not env goaway
</RequireAll>
</Location>


Argh. I'm not at all familiar with <Location>. I want to boot the bots from every virtual host on the entire VPS by placing the direction in the Pre VirtualHost config file.

lucy24

5:40 am on Feb 27, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<tangent>
"^NameOfBadRobot"
I am beginning to suspect that you're one of the many, many people who have been fed a wildly mistaken idea of what the symbol ^ means. It's got nothing to do with Apache syntax; it's a RegEx anchor meaning "beginning of the test string". In defining User-Agents, it's used when the target text has to come at the beginning of the UA string. It's also useful when the target text, if it occurs, will de facto always happen to come at the very beginning; if it isn't there, the server can stop looking right away.
</tangent>

I'm not at all familiar with <Location>.

Are we in the config file? Yes, we are. Like <Directory>, but unlike <Files> and <If>, <Location> can only be used in the config file. (99 times out of 100, "in config" includes "in a virtual hosts section".) The difference between "Location" and "Directory" is that "Directory" refers to a real, physical directory* on the server, while "Location" refers to a part of an URL.

If you're not sure you need it, it's pretty certain that you don't need it.


* It is inordinately difficult to explain this without inadvertently saying "physical location", which is exactly what Location doesn't mean :(