Forum Moderators: phranque

Message Too Old, No Replies

SetEnv and Modrewrite mix and match

Should this be done in htaccess?

         

grandma genie

5:34 pm on Dec 6, 2010 (gmt 0)

10+ Year Member



Hi,
I just posted a question in the Spider User Agent ID forum but think it belongs here. My htaccess file is using modrewrite, but I want to allow all bots to see my robots.txt file, but I'm not sure how to do that. So I posted this:

Can you mix and match SetEnv and Mod-rewrite? My htaccess file uses this to block bad bots:
RewriteEngine on
#
# Return 403-Forbidden to unwelcome/malicious user-agents
RewriteCond %{HTTP_USER_AGENT} ZmEu [NC]
RewriteRule ^ - [F]

So, if I want to allow all bots to see my robots.txt file, I need to do it with rewrite, I assume. So, would I add the

RewriteRule ^/robots\.txt$ - [L]

to the bottom of the htaccess file? Or will that cause all kinds of issues I can't imagine?

Also, if I have a bot blocked in robots.txt, I assume I must not also have it blocked in htaccess. Is that true? And if I determine that the bot does not obey robots.txt, I should remove the robots.txt entry and then include it in htaccess. Would having the bot in both places cause a "self-inflicted denial of service attack?"

I've had some suggestions to use SetEnv, but don't think I can mix and match.

wilderness

6:27 pm on Dec 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There are literally thousands of examples mod_setenvif and mod_rewrite mixed together within the same-htaccess-file in the forum archives.

You CANNOT however mix the syntax of the two in the same expression.

g1smd

6:45 pm on Dec 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you deny a user agent in robots.txt, and if the user agent can physically read the file, then the user agent can choose whether to obey the instruction or not.

If you deny the user agent in .htaccess, the user agent cannot physically access the server at all (or can access only what you specify).

grandma genie

11:43 pm on Dec 6, 2010 (gmt 0)

10+ Year Member



After much searching, I came up with this. What I am attempting to do is allow bots to see the robots.txt file with a pass in htaccess. Please let me know if this coding is correct:

php_flag display_errors 1
#
SetEnvIf Request_URI "(robots\.txt)$" pass
#
order allow,deny
deny from 38.0.0.0/8
allow from all

ErrorDocument 404 /notfound.html

RewriteEngine on
#
# Return 403-Forbidden to unwelcome/malicious user-agents
RewriteCond %{HTTP_USER_AGENT} ^Allrati [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ZmEu [NC]
RewriteRule ^ - [F]

Also have bans to hotlinking and some 301 redirects

If I have a certain bot's IP range blocked in htaccess, like the 38 range, will the SetEnvIf Request still allow it to see the robots.txt file? I was attempting to block PSI Cogentco, but it looks like Discovery Bot is in that range, too. And only G*d knows who else.

wilderness

3:23 am on Dec 7, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Please note; your NOT required to implement mod_setenvif as tangor adivsed you in this thread [webmasterworld.com]
there is a similar definition in mod_rewrite available.


Your SetEnvIf is missing the "envelope definition" (env=), and thus your "pass" for robots.txt would never materialize. Might even generate a 500-error.

It should read (at least with the word "pass" that you've copy and pasted (note; you may use any word, just as long as you are conisitent in the words use)).

Allow from env=pass

Also the generally accepted order of SetEnvIf is Deny,Allow.

Thus your modified lines:

SetEnvIf Request_URI ^path-to-your-custom-403-page\.html$ pass
SetEnvif Request_URI ^robots\.txt$ pass
Order Deny,Allow
deny from 38.0.0.0/8
Deny from all
Allow from env=pass

or combine the two lines above into one:

SetEnvIf Request_URI "/(custom-403-page\.html|robots\.txt)$" pass