Forum Moderators: phranque

Message Too Old, No Replies

Banning by User Agent but Allowing UA Access to Robots.txt

Bot ban working via SetEnvIf but now I wish to allow access to robots.txt

         

Webwork

2:08 am on Mar 8, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



HTTP Server is Apache 2.4x

I've believe I've got the following working on Apache 2.4 via an include file loaded at Pre VirtualHost Include:
 <Location />
SetEnvIfNoCase User-Agent ".*ahrefsbot" badbot
SetEnvIfNoCase User-Agent ".*Alexibot" badbot
SetEnvIfNoCase User-Agent ".*archive.org_bot" badbot
SetEnvIfNoCase User-Agent ".*BlackWidow" badbot
etc etc etc
Require expr %{HTTP_USER_AGENT} != 'badbot'
</Location>


At least [httpd.apache.org ] tells me this is the right code AND my VPS didn't kick it out when I wrapped the directives in a <Location /> container.

I see now the error of my ways in not allowing certain not-so-bad-bots access to my robots.txt file. I'm sure there's a simple solution to granting access "to only robots.txt" but, after 5 hours of searching and reading "that solution" (directive / syntax) has escaped me.

How do I maintain the ban whilst simultaneously allowing some bots, i.e., the presumably nicer ones, access to robots.txt . . whilst otherwise banning them?

lucy24

2:47 am on Mar 8, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The Apache 2.2 version is:
<Files "robots.txt">
Order Allow,Deny
Allow from all
</Files>

so in 2.4 the inner two lines would be
:: shuffling papers ::
Require all granted
and that's all, isn't it? afaik there hasn't been any change to the <Files> locution as such.

User-Agent ".*blahblah"

I know Apache is fond of the non-final .* locution, but honestly, you do not need to do this. Since there's no anchor and you're not capturing, just give the text you want to match. (The quotation marks are also not essential unless you're enclosing literal spaces, though I admit they will do no harm. It's just another 2 bytes--per line--of overhead.)

:: back to envious sulking because I want to play with 2.4 too, dammit :( ::

Webwork

3:17 am on Mar 8, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Lucy, thanks for pitching in.

When I omit/remove the <Location /> </Location> wrapper my VPS rejects the directive as follows:

Sorry, your changes have introduced a syntax error in pre_virtualhost_global.conf. Please correct the issue.

Error:
Configuration problem detected on line 554 of file /usr/local/apache/conf/includes/pre_virtualhost_global.conf.tmp:Require not allowed here --- /usr/local/apache/conf/includes/pre_virtualhost_global.conf.tmp --- 548SetEnvIfNoCase User-Agent ".*wotbox" badbot 549SetEnvIfNoCase User-Agent ".*xxxyy" badbot 550SetEnvIfNoCase User-Agent ".*yandexbot" badbot 551SetEnvIfNoCase User-Agent ".*youda" badbot 552SetEnvIfNoCase User-Agent ".*zmeu" badbot 553SetEnvIfNoCase User-Agent ".*zune" badbot 554 ===> Require expr %{HTTP_USER_AGENT} != 'badbot' <=== 555 556# JAL Sets Files for Mod Deflate January 29 2016 557 558SetOutputFilter DEFLATE 559 560# JAL Mod Deflate --- /usr/local/apache/conf/includes/pre_virtualhost_global.conf.tmp ---


Since Apache is planning to deprecate "allow, deny" directives I'm doing my best to stick with "Require all denied/granted".

lucy24

5:19 am on Mar 8, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: detour to look something up ::

<Location> sections are processed in the order they appear in the configuration file, after the <Directory> sections and .htaccess files are read, and after the <Files> sections.

Urk. I've never personally worked with <Location>, since it's not allowed in htaccess and I'm not sure I'd want to anyway. Are you absolutely positive the error is triggered by omitting the <Location> envelope? Seems odd. In fact, it seems backward, since the docs [httpd.apache.org] say that the "Require" directive is only allowed inside <Directory> sections (including htaccess).

You're putting the <Files> envelope inside a <Directory> section, right? Not loose in config?

:: further detour to confirm that <Files> can be nested within <Directory> -- it would be weird if it couldn't, since <Files> can be used in htaccess -- although it doesn't have to be ::

Yes, Allow/Deny will be deprecated and technically isn't even part of 2.4. But right now, as we speak, with today's robots, do you have mod_thingamajig installed? The one that allows backward compatibility with old Allow/Deny statements?

:: further lookup ::

mod_access_compat. I should be able to remember that.