Forum Moderators: phranque

Message Too Old, No Replies

Selective SetEnvIf rules

         

keyplyr

9:44 am on Nov 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Trouble with learning regex and how to implement conditions and rules is that we never get to see the .htaccess files from other websites, and I often find the Apache.org pages hard to digest.

There is one particular research bot (X1)who redundantly requests my robots.txt hundreds of times a day. I've decided to ban it until it's developers (we have spoken) rewrite it to behave. I have also decided that letting it load any other file except a default 403 serves no productive purpose, however I wish to offer other offenders (X2, X3) my robots.txt or custom403 page (respectively.)

Is this an appropriate way to do the above?


SetEnvIf Remote_Addr ^XXX\.XX\.XX\.X[b]1[/b]$ ban
<Files *>
Order Deny,Allow
Deny from env=ban
</Files>
SetEnvIf Referer ^XXX\.XX\.XX\.X[b]2[/b]$ ban
SetEnvIf Remote_Addr ^XXX\.XX\.XX\.X[b]3[/b]$ ban
SetEnvIf Request_URI ^(robots\.txt¦custom403\.html)$ allowit
<Files *>
Order Deny,Allow
Deny from env=ban
Allow from env=allowit
</Files>

keyplyr

11:32 pm on Nov 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



anyone?

jdMorgan

6:03 am on Nov 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



key,

Just combine all of it:


SetEnvIf Remote_Addr ^XXX\.XX\.XX\.X1$ ban
SetEnvIf Referer ^XXX\.XX\.XX\.X2$ ban
SetEnvIf Remote_Addr ^XXX\.XX\.XX\.X3$ ban
SetEnvIf Request_URI ^(robots\.txt¦custom403\.html)$ allowit
<Files *>
Order Deny,Allow
Deny from env=ban
Allow from env=allowit
</Files>

The Order directive (which has a misleading name - it should probably be called something like "precedence" or "priority") causes the Allow directive(s) to have precedence over the Deny directive(s), and so accomplishes what you want.

Jim

keyplyr

6:19 am on Nov 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I guess I wasn't very clear Jim.

While I want even those UAs I ban from my site (x2, x3) to get my custom403 page, or the robots.txt if they request it, I do not wish to let X1 access any file. It is an experimental bot poorly written by some science lab students at a university which is requesting robots.txt redundantly dozens of times per visit. I want it to go away - LOL

jdMorgan

6:47 am on Nov 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ah, I see...

I would avoid having multiple Order statements in one .htaccess file - that seems to cause the second group of denies to be ignored (you can try it, though... I could use a second test case.)

So that leaves either a stand-alone Deny from in a <Files> container (with no Order or Allow from directives in that container), or you could use a mod_rewrite deny to whack that 'bot separately:


RewriteCond %{REMOTE_ADDRESS} ^x\.x\.x\.x$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot1$
RewruteRule .* - [F]

Jim

keyplyr

6:56 am on Nov 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Great - If I used a 'mod_rewrite deny' for this bot, that would let me safely use 'Order Deny' directives for other temporary pests. Trouble is, I use a custom 403:

ErrorDocument 403 /forbidden.html

So every error that this bot creates by requesting robots.txt is doubled since it's also denied forbidden.html.

What could I add to this code to stop that?


RewriteCond %{REMOTE_ADDRESS} ^x\.x\.x\.x$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^bad_bot1$
RewruteRule .* - [F]

Thanks

jdMorgan

7:00 am on Nov 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nothing you can do about that - you said you wanted to deny ALL files! :)

Well OK, allow the 403 page but not robots.txt by changing the Rule:


RewriteRule !^custom403\.html$ - [F]

Jim

keyplyr

7:09 am on Nov 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hits have increaded to over 2 hundred a day for robots.txt. If I allow the forbidden.html then that's alot of useless page loads, but by not allowing the forbidden.html page, the error_logs are bigger. Ah the bitter taste of irony.

jdMorgan

7:14 am on Nov 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yeah, about the only thing you can do is to ask your host to 'black hole' their IP address at your router. If you can get them to do it, you won't see any requests at all... but then you won't know if they ever fix their 'bot, so there's a second helping of irony.

Jim

keyplyr

7:25 am on Nov 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




about the only thing you can do is to ask your host to...

Oh no - I've learned to never, ever ask my host to do anything; they'll screw it up! I called them once, and I'd swear the guy posted here to find out how to do it!