Welcome to WebmasterWorld Guest from 54.227.126.69

Forum Moderators: Ocean10000 & incrediBILL & phranque

Apache 2.2 / CentOS - Blocking bad bots

Unable to get bad-bots blocked properly

     
3:53 pm on Nov 25, 2017 (gmt 0)

New User

joined:Nov 25, 2017
posts: 7
votes: 0


I'm having all kinds of problems trying to restrict bad bots on my Apache 2.2 server, and am hoping somebody can assist.

I have banged my head on the wall for days trying to get this working, and used several different methods, but none seem to work properly.

I have several sites on one machine, and I could, of course, deny bad bots in individual .htaccess files for each site - but that's a pain to maintain. So, I want to put the restrictions in httpd.conf.

The first method I was using (which I thought was working) was to use a <Location "/"> section, e.g.
 <Location "/"> 
SetEnvIfNoCase User-Agent "lwp-trivial" bad_bot
SetEnvIfNoCase User-Agent "libwww" bad_bot
SetEnvIfNoCase User-Agent "Wget" bad_bot
Deny from env=bad_bot
</Location>

However, I found that although this did block the bots, there was an issue because it then allows hidden files, such as .htaccess and .htpasswd to be served up, even though there is code in httpd.conf to disallow it. I played around with the order of the <Files ... block (which does the stuff blocking file access) and the <Location ... block, but no matter which one had precedence it still allows hidden files to be served. If I take out the <Location ... block then the server prevents the hidden files being served, as it should do.

I've also tried doing rewrites in httpd.conf but that doesn't seem to work either (the block is at the foot of the files, but I've tried it above the virtual hosts section too), e.g.
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} AhrefsBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} AlphaBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR]
RewriteRule ^(.*)$ - [L,R=403]
</IfModule>

I get no errors using either method, but they're not doing what I want. This second method simply doesn't appear to block the bots.

I've also tried stuff like the following, also without success:
<Location "/var/www/sites/">
SetEnvIf User-Agent BLEXBot GoAway
Order allow,deny
Allow from all
Deny from env=GoAway
</Location>

... and
RewriteCond %{HTTP_USER_AGENT} "blexbot" [nocase]
RewriteRule ^.*$ – [forbidden,last]

... and seemingly every other combination of things which is possible. But I can still only block bots with individual .htaccess files, or with the <Location "/"> section (which allows the revealing of hidden files).

As can be seen, one of the user-agent strings I'm testing with is "Blexbot" and variations of it, and so the last thing I've tried is with modsecurity.

However, I don't seem able to get that working properly either: here are a couple of examples which I've tried:
SecRule REQUEST_HEADERS:User-Agent "BLEXBot" "deny,status:403,id:5000218,msg:'Badbot test for Blexbot'"
SecRule REQUEST_HEADERS:User-Agent "@pmFromFile badbots.txt" "id:350001,rev:1,severity:2,log,msg:'BAD BOT - Detected and Blocked. '"

If I look in /var/log/modsec_audit.log then I can see that modsecurity does identify the user-agent, and provides a log entry to that effect, but it doesn't actually prevent the pages from being served (which is kinda the whole point).

I do note that the modsec_audit.log has entries of `Engine-Mode: "DETECTION_ONLY"`, which might explain the pages still being served, but I'm not familiar with much of modsecurity at all, so I'm not really sure about what it's doing.

If anyone can assist it would be truly appreciated! I just need a single method to work, but I kind of like the idea of using modsecurity if I can do, as it seems I can just put any bad-bot entries in a single separate file.

Disclaimer: I have asked this same question elsewhere, but have gotten no repsonse.
8:25 pm on Nov 25, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14316
votes: 560


<Location "/var/www/sites/">
That's not a location. It's a directory.

There are situations where <Location> is appropriate--obviously, or the form wouldn't exist in Apache in the first place--but they are exceedingly rare. In fact, I cannot remember seeing a post in this subforum that involved <Location>. What you need is <Directory>. Start with a universal
<Directory />
Order Allow,Deny
Deny from all
AllowOverride none
</Directory>
--it should already be there, unless you've accidentally trashed it--and then put in exceptions for specific directories.

In the default config file that came with your Apache, there should be a package something like this:
<FilesMatch "^\.ht">
Order allow,deny
Deny from all
Satisfy All
</FilesMatch>
Do not change this. It is what prevents all visitors, everywhere, all the time, from seeing your .htaccess or .htpasswd files. Similarly, never ever ever say <Files *> or similar in htaccess, because it will cancel this barrier.

To counterbalance the never never never directives: Always always always make a copy of the config file before you start mucking about with it.
8:44 pm on Nov 25, 2017 (gmt 0)

Junior Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 188
votes: 13


Bot killing is an escalating game of wits, so to ask for a definitive solution is to ask the impossible. They attack, you parry, they change names, you add entries, they move servers to another country and IP range, you find this out and counter...The world is very large, with ~7B+ people, and bot software is freely available on Git and other places, so there are too many of them against too few of us.

Here is my way to protect numerous web sites using one htaccess. I use a shared service so don't have access to httpd.conf, I don't think.

Put all your web sites into individual directories. Put the htaccess in the root. Use SetEnvIf statements and not Rewritecond. The SetEnvIf has some inheritance that will take care of all your sites if they are in subdirectories. Rewritecond does not reliably inherit. Check to ensure your Apache level does this, as I found there are variations. Over time, given hard work, your bot traffic will go down. There are some very bad companies out there, such as CoCro/HVH, B2N, Azn, but we cannot mention them here by name, according to the TOS. For these I ban all their IP ranges.

SetEnvIf User-Agent ZoominfoBot keep_out

order allow,deny
allow from all
deny from env=keep_out


If you don't believe you have readers in China or Russia, then ban all their IP ranges. Africa, the Middle East? Turkey? Thailand? These decisions vary with web site and target audience, so no one can make these decisions but you.

Others wiser than I will come by. Soon.

[edited by: phranque at 12:45 am (utc) on Nov 26, 2017]
[edit reason] assuming "SetEndif" was a typo [/edit]

9:19 pm on Nov 25, 2017 (gmt 0)

New User

joined:Nov 25, 2017
posts: 7
votes: 0


Lucy24:
-----------
Thanks for your message. <Location "/var/www/sites/"> was just an example to demonstrate having tried everything I could think of. The unmodified example, with <Location "/"> was the original code which was used. (which did block bots, but made hidden files visible).

Yes, I understand the <Filesmatch ...> stuff.

However, having tried stuff like:
SetEnvIfNoCase User-Agent "blexbot" badbot
<Directory />
Order Allow,Deny
Deny from env=badbot
AllowOverride none
</Directory>

it makes no difference. I'm guessing your suggestion is that the <Directory /> directive is inside a virtual host, but I've tried it inside and outside and the result is the same. However, whatever solution I end up with I want it applicable to all sites (i.e. the whole idea being that there's a single place to edit, rather than multiple instances that need updating every time something changes).

ToronotoBoy
-----------------
Thanks for your comment. I have root access to my own server, so I can do what I want. The idea with putting bot rules in httpd.conf is to avoid making duplicate copies of everything for .htaccess files and updating them individually, so I'd like to avoid .htaccess solutions if possible, as I know it's possible to do it from within httpd.conf - I just can't seem to get it working. Each of my sites is in a separate directory under /var/www/sites/. Alternatively, if I could get the modsec bits working then that too would be helpful, because then I can just update a single file of bad-robots.txt without even needing to modify httpd.conf to do it.
10:44 pm on Nov 25, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14316
votes: 560


They attack, you parry, they change names
I read this too fast and thought you were saying that robots get married and change their names, which creates an interesting mental picture.

Each of my sites is in a separate directory under /var/www/sites/.
In that case, all your access-control rules should go in a <Directory> section for that overall directory. And as TorontoBoy said, don't use RewriteRules. Because of mod_rewrite's wonky inheritance, it should be reserved for individual sites--whether those end up being .htaccess, or site-specific <Directory> sections. A lot can be done with mod_setenvif, since it works nicely in combination with mod_authzthingwhatsit (exact name changes from one Apache version to the next), and inherits consistently.

By the usual yawn-provoking coincidence I was only just yesterday trying to figure out whether site-specific rules are better placed in a <VirtualHost> section or in a <Directory> section. For this we need someone who speaks Apache.

:: looking around vaguely before wandering off to Apache docs ::
12:57 am on Nov 26, 2017 (gmt 0)

New User

joined:Nov 25, 2017
posts: 7
votes: 0


It was actually the other respondent, TorontoBoy, who was talking about bots changing names, etc, ... but yeah.

Still, the fact remains that whatever setup I've tried so far, the only one that actually blocks bots is the <Location "/"> section, but, for whatever reason, it negates the <FilesMatch .. section, no matter which one has precedence.

As you can see, I've tried several methods, and I test each one after an Apache restart with my own script, which sets the user-agent to whatever I want, but no dice. Nothing so far is working the way I need it to. (or even at all).
2:32 am on Nov 26, 2017 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11115
votes: 110


welcome to WebmasterWorld [webmasterworld.com], Cheddar!

you should read carefully apache's Configuration Sections [httpd.apache.org] documentation, especially the sections on "Filesystem and Webspace" and "How the sections are merged".
2:34 am on Nov 26, 2017 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11115
votes: 110


lucy24 was probably looking for:
What to use When

Choosing between filesystem containers and webspace containers is actually quite easy. When applying directives to objects that reside in the filesystem always use <Directory> or <Files>. When applying directives to objects that do not reside in the filesystem (such as a webpage generated from a database), use <Location>.

It is important to never use <Location> when trying to restrict access to objects in the filesystem. This is because many different webspace locations (URLs) could map to the same filesystem location, allowing your restrictions to be circumvented.

https://httpd.apache.org/docs/2.2/sections.html#whichwhen

and...
Sections inside <VirtualHost> sections are applied after the corresponding sections outside the virtual host definition. This allows virtual hosts to override the main server configuration.

https://httpd.apache.org/docs/2.2/sections.html#mergin
5:34 am on Nov 26, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14316
votes: 560


Sections inside <VirtualHost> sections are applied after the corresponding sections outside the virtual host definition.
Ah thanks, phranque, it was the before-or-after that I couldn't find. I did find the bit about not using <Location> for access control. (It now occurs to me that the reason you almost never see <Location> in the present subforum is that it can't be used in htaccess, which is what most CMS users are limited to, and it's rarely meaningful outside of database-driven sites.)

Now, if it's the same person running all the sites on the server, it makes most sense to put your access-control rules in the appropriate <Directory> section. As always, watch out for leading and trailing slashes, because having or omitting the wrong one can make or break a rule. For example, it would be
<Directory /var/www/sites>
with leading, without trailing slash.
2:42 pm on Nov 26, 2017 (gmt 0)

New User

joined:Nov 25, 2017
posts: 7
votes: 0


Okay, I'm making a bit of progress, but I do note the Apache 2.2 docs say ".... Use <Location> to apply directives to content that lives outside the filesystem. For content that lives in the filesystem, use <Directory> and <Files>. An exception is <Location />, which is an easy way to apply a configuration to the entire server." - which is what I was originally doing.

Anyway, as I originally said, any single method which works the way I want it to is fine, so I've played again with the <Directory .. section and made some progress, but still a slight distance from the finish line. Still, thanks very much for the <Directory ... comment; that put me in the right direction.

So, I'm able to use the following syntax, within each virtual host block, while the bad-bots are defined outside the virtual host blocks (e.g. SetEnvIfNoCase User-Agent "lwp-trivial" bad_bot):

This works:
<Directory "/var/www/sites/mysite1">
Deny from env=badbot
</Directory>

... and this now works. This is excellent since I need only define the bad-bots in a single place (see foot of post: these are now included in a separate file).

I don't quite understand why, but using the same "Deny from env=badbot" line, in the original, existing <Directory ... block doesn't work:

This doesn't work:
<Directory "/var/www/sites/mysite1">
Allow from all
AllowOverride AuthConfig Indexes Limit
Options +FollowSymLinks
Deny from env=badbot
</Directory>

Only when I have both blocks, (and I opted to put the smaller one below the original one), does it work. Of course, I took the "Deny" line out from the top block. Curious why I can't simply have the one Directory block and put the Deny directive in there. It's no big deal, but I don't understand the logic of why it wouldn't work. So, what I have now is the below, which works:

This works:
<Directory "/var/www/sites/mysite1">
Allow from all
AllowOverride AuthConfig Indexes Limit
Options +FollowSymLinks
</Directory>

<Directory "/var/www/sites/mysite1">
Deny from env=badbot
</Directory>

I'm still marginally curious about a couple of things though. Firstly, even though the 403 is produced, as noted in the log file, and as determined from using my own script, the server still defaults to its own noindex.html page (/var/www/error/noindex.html) even though I do have a 403 error doc set up in the virtual host which I'm testing. This sort of makes sense in a way, as I've already told the server not to serve anything from that virtual host to the bad bot requesting it. I can live with that, since I can easily modify the default apache error pages if needs be. However, is there a way, using the SetEnvIf idea, which I'm now using, with rewrites (or any other method) in the <Directory ...> section to redirect the bad-bots?

Also, I still don't know why, when using the <Location "/"> block idea, that the <FilesMatch "^\.ht"> section gets ignored. Remember, this blocked the bots effectively, but allowed access to the hidden files. Quite literally, you can use any other parameter with <Location ..., other than "/", and the <FilesMatch "^\.ht"> section will work, but the blocking of bad bots will not. I find this very curious. Although <Location ...> blocks get processed after <FilesMatch ... blocks, there seems nothing about the contents of the <Location ...> block that should/would compromise anything elsewhere. After all, it just contained a bunch of SetEnvIf directives for bad-bots.

If anyone is able to shed light on the bits I'm still curious about then that would be good. Especially, if there is a way to produce a redirect from within the <Directory ... block, to send the bots into the the barren wilds of the web then that would be good too. An advantage with redirecting them would be that I don't need to use any bandwidth or resources serving up error pages.

Slightly tangential, I was looking to see if there is a way to "include" a list of SetEnvIf directives from an external file, into httpd.conf, rather than directly editing httpd.conf itself each time some change to the bot list is required. So, I did solve this by simply adding:
Include /etc/httpd/conf/badbots.conf
to httpd.conf (above the virtual hosts section), and then having a big list of SetEnvIfNoCase directives in the badbots.conf file, in the location noted.
7:26 pm on Nov 26, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14316
votes: 560


This sort of makes sense in a way, as I've already told the server not to serve anything from that virtual host to the bad bot requesting it.

That's exactly right. Therefore you need a <Files> or <FilesMatch> section that names your custom error document and says Allow from all. In fact, if you look at your error logs you will see the original denied request followed by a cascade of requests for the error document itself, with response “Request denied by server configuration" until the infinite-loop runs out. (The number is set by your server; I don't remember the default. I think 30 or so.)

This, incidentally, applies to each module that is capable of issuing a 403. You always need an exemption for the error document itself. While you're at it, make sure you poke a similar hole for robots.txt. Sure, most malign robots aren't compliant, but some are, and the only thing better than a blocked request is a request that isn't made in the first place.
8:24 pm on Nov 26, 2017 (gmt 0)

New User from RU 

joined:Feb 17, 2016
posts:12
votes: 0


Well, what I tend to do is simply:

BrowserMatch "(lwp-trivial|libwww|Wget)" STOP=yes
RewriteEngine on
RewriteCond "%{ENV:STOP}" "yes"
RewriteRule ^ - [R=404]

Which seems to work fine.

And really nice along with:
ErrorDocument 404 "404"
8:40 pm on Nov 26, 2017 (gmt 0)

New User

joined:Nov 25, 2017
posts: 7
votes: 0


Lucy:
Okay, I'll look at doing the <Files> section for error docs later on. I've been playing with this stuff for hours now, and a bit tired. Good tip; thx.

Andova:
Thanks for the suggestion. I'm curious that it works, because I thought BrowserMatch pretty much behaved identically to SetEnvIfNoCase UserAgent, and I didn't think that apache environment variables set by SetEnv were available to the Rewrite module, because of the order of processing. I did already try something similar, and figured - eventually - that it didn't work for that reason. Maybe there was something else instead.

In actual fact one of the sites I set up with a custom 403 page (related to the bit that Lucy mentions), and the other one, partly as a test, I set up with:
ErrorDocument 403 http://www.baidu.com
which conveniently redirects 403 recipients to baidu.com. (or wherever). The only thing with doing the external redirect - and I suppose the same would be true if was an internal redirect - is that the log files will log it as a 302, rather than a 403.
9:21 pm on Nov 26, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14316
votes: 560


I thought BrowserMatch pretty much behaved identically to SetEnvIfNoCase UserAgent

BrowserMatch and its counterpart BrowserMatchNoCase are shortcuts in mod_setenvif. But don't use NoCase--in any module--unless it is absolutely necessary to trap all possible configurations. Although it's easy from our end, like clicking “Ignore Case” in a text editor, it really means that the server has to check for [Bb][Rr][Oo][Ww][Ss][Ee][Rr] when at most the options are (browser|Browser|BROWSER)--and even those three are very unlikely. For most purposes, "B" is as different from "b" as it is from "C".

In general, mod_setenvif executes before mod_rewrite, so you may be able to make use of environmental variables. But because of its inheritance issues, mod_rewrite should be a last resort in access control.

PLEASE resist the temptation to redirect unwanted visitors to some other site, regardless of your feelings about that other site. You wouldn't redirect spammers to fbi.gov either. If you really need the visceral satisfaction, you can make do with 127.0.0.1 or similar. A redirect to the originating IP, if you're sure it isn't spoofed, may also work, because then eventually the unwanted robot will get kicked off their server. That's assuming they follow the redirect, which malign robots rarely do.

Yes, whenever you give a full protocol-plus-domain URL in an ErrorDocument directive, whether the target is your own site or someone else's, the response will turn into a 302.

ErrorDocument 404 "404"

Well, sure, if you never ever in the entire history of your site ever get a human requesting a nonexistent page or malformed URL--as can happen very easily when a link is auto-generated, for example by forums software. Error documents are for humans. Even 403s can arise from legitimate error.
12:33 pm on Nov 27, 2017 (gmt 0)

New User

joined:Nov 25, 2017
posts: 7
votes: 0


I'm sure the BrowserMatchNoCase and SetEnvIfNoCase are marginally slower, but I imagine their method of checking strings will likely be to simply convert both strings to the same case and compare. No need for checking each character. Still, a bit quicker not to need converting at all.

Not quite sure what your reference to "ErrorDocument 404" is about. I've not mentioned any 404-things.

I'm sure I'll be back on here again sometime, but thank you very much for your assistance; much appreciated. I was getting pretty frustrated, but everything is working properly now (I think).
9:22 pm on Nov 27, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14316
votes: 560


Not quite sure what your reference to "ErrorDocument 404" is about.
Other post, between yours and mine I think.

I imagine their method of checking strings will likely be to simply convert both strings to the same case and compare
Come to think of it, I imagine you are right--but that's still a conversion that has to take place. To our human brains, a > A and b > B is obvious, but to a computer, it's a matter of adding some number* to selected codepoints in the original string ... and if you're outside of plain ASCII, it won't always be the same number.


* I looked it up. For vanilla ASCII “typewriter” characters in the Big Three cased scripts--Roman (including Latin-1 diacritics), Cyrillic, Greek--it's 20, i.e. 32. But as soon as you're in Latin-Extended-anything, the pattern changes entirely.
9:32 pm on Nov 27, 2017 (gmt 0)

New User

joined:Nov 25, 2017
posts: 7
votes: 0


Oh, yeah, right. I see; I didn't realize what you were referring to with the 404 bit.

Indeed, best to avoid any unnecessary conversions if possible.

Thanks again for the help.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members