Forum Moderators: coopster
[2]SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>[/2] What exactly is the string inside the SetEnvIf meant to be doing?
It looks to me like "If the user is requesting a file called "403<#*$!>.htm" or "robots.txt" set the env to allowsome. I'm kind of confused because it doesn't look like regular RegEx to me (the grouping and the forward slashes look odd to me).
Before I ask a load (more?) of silly questions, am I reading this correctly?
More to follow :)
It looks to me like "If the user is requesting a file called "403SomethingOrNothing.htm" or "robots.txt" set the env to allowsome."
I'm wondering why this statement is here at all?
Doesn't the creation of the line "SetEnvIf Remote_Addr ^99.99.99.99$ getout" cover it all?
I guess you want to feed the robots.txt to the nice bots and let them at it so they will just stop rather than hammering your site.
As for the 403, I don't know, but I assume it refers to the custom 403 file to prevent an infinite loop where they keep getting sent to the 403 page and then denied and sent there again?
Anybody who's used the script want to give a REAL answer?
I guess you want to feed the robots.txt to the nice bots and let them at it so they will just stop rather than hammering your site.
But you've made me think... does that mean that:
[2]SetEnvIf Remote_Addr ^99.99.99.99$ getout[/2] ... sets env to getout for the badbot and then
[2]SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome[/2] ... "unsets" it to allowsome if I've sent them with my 'deny' to a custom 403 page ¦ they have decided to behave
... and everything else (no env set) happily just drops through?
If that's so... the reformed bots should be sent to a script that deletes the line
[2]SetEnvIf Remote_Addr ^99.99.99.99$ getout[/2] In other words - and if you're correct about the 403 loop - the psuedo code is (sorry haven't worked out the htaccess coded yet):
if env=getout then
if they've requested robots.txt then
(delete the getout setenv) and (send them to robots.txt
(or just reprocess the request))
else
if I've sent them to my custom 403 then
give it to them
else
deny
ErrorDocument 403 /403.htm
To re-word / re-phrase my misunderstanding:
Why do you allow a banned robot to read your robots.txt if you continue to ban it from every other page on your site (except your 403)?
Am I missing something?
I wrote the code above and posted in in Apache. The purpose is to allow universal and unrestricted access to robots.txt and the 403 error page, as you surmised.
> ... sets env to getout for the badbot and then
>SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
> ... "unsets" it to allowsome if I've sent them with my 'deny' to a custom 403 page ¦ they have decided to behave.
There are two variables, getout and allowsome. They are independent. They are combined using the precedence rules specified in the Order directive, which determines whether "allow" or "deny" has priority. As written, any access denied by 'getout' can be overriden and allowed by 'allowsome'. So getout blocks the access unless the request is for a few special files/pages, in which case we "allowsome" access by that user-agent/IP-address, whatever.
> Why do you allow a banned robot to read your robots.txt if you continue to ban it from every other page on your site (except your 403)?
The simple answer is, "Because it's the right thing to do." Some screwed-up robots will come back after being denied access and try to read robots.txt, and *then* find out they are Disallowed. Basically, this helps the robot authors debug their code.
The real answer is that if you allow universal access to robots.txt (and your 403 error page), you'll save yourself a lot of trouble with 'mis-implemented' robots of all kinds. For example, some banned robots will go into a loop trying to read robots.txt and getting denied repeatedly, eating up your bandwidth. If you don't let them in, they keep trying. This rather defeats the purpose of scripts and access controls. Bascially, allowing universal robots.txt acces makes your site more robust.
Think of robots.txt as a sign on the door that says "No admittance." The script and .htaccess code are then the guy standing inside and behind the door with a bludgeon.
Jim
In other words, there's not much point in allowing them in if they seem to start "behaving".
I was (incorrectly) assuming that because they read your robots.txt this meant they were not so bad. The fact that they read your robots.txt is irrelevant. They read it and ignore it. Perhaps, even, they read it to see precisely where they are not meant to go - a red rag to a bull.
Thanks again, Sam.