Forum Moderators: phranque

Message Too Old, No Replies

htacess rule condensing SetEnvIfNoCase

         

cyberdyne

1:48 pm on Feb 22, 2012 (gmt 0)

10+ Year Member



I'm trying to reduce the size of my htaccess by firstly condensing some 'SetEnvIfNoCase User-Agent' block rules.

Can someone please confirm that if I have 2 similar rules, eg:

SetEnvIfNoCase User-Agent ^Python u_a
SetEnvIfNoCase User-Agent ^Python-urllib/2.7 u_a
<Limit GET POST PUT HEAD>
Order Deny,Allow
Deny from env=u_a
</Limit>


  • The first rule will block the second U-A due to the '^' ?
  • As I've used SetEnvIfNoCase I do not need to end the lines with [NC] .

Thank you as always.

wilderness

2:07 pm on Feb 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In this instance, the following covers both of your examples.

SetEnvIfNoCase User-Agent ^Python

cyberdyne

2:13 pm on Feb 22, 2012 (gmt 0)

10+ Year Member



Many thanks wilderness.

Presumably if I've used SetEnvIfNoCase then the case, anywhere, in the u-a string is irrelevant ?

wilderness

2:54 pm on Feb 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Presumably if I've used SetEnvIfNoCase then the case, anywhere, in the u-a string is irrelevant ?


Yes, and when referencing to letter case.

NOT to be confused with the leading-anchor (i. e., begins with) and the location of the keyword in the UA.

cyberdyne

3:04 pm on Feb 22, 2012 (gmt 0)

10+ Year Member



Thank you.

Are any character escapes '\' necessary ? eg:

SetEnvIfNoCase User-Agent ^C(CBot/1.0|egbfeieh|fetch|yberian) u_a


Should the forward slash and dot be escaped:

SetEnvIfNoCase User-Agent ^C(CBot\/1\.0|egbfeieh|fetch|yberian) u_a

wilderness

3:17 pm on Feb 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are any character escapes '\' necessary ? eg:


Generally speaking NO, escapes are used in mod_setenfif, however there are some very rare and confusing exceptions that are just not consistent.

Less is better in this instance.

Personally, I prefer long-term black-listing of UA's in mod_rewrite and only use mod_setenvif for temporary UA's.

You have some issues here:
SetEnvIfNoCase User-Agent ^C(CBot/1.0|egbfeieh|fetch|yberian) u_a 


1)What's the leading C for?
2) You've used the caret (begins with) anchor and most of these UA's DO NOT "begin with".

change to contains (omitting any leading or trailing anchors:

SetEnvIfNoCase User-Agent (CBot/1.0|egbfeieh|fetch|yberian) u_a

wilderness

3:20 pm on Feb 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The more universal and cross-applicable your able to make black-listed UA's the lesser overall quantity you'll be required.

You should always be attentive to making one-word or a portion-of-a-word useful for multiple UA's.

cyberdyne

3:26 pm on Feb 22, 2012 (gmt 0)

10+ Year Member



I've just read elsewhere on the site and realised my error.

I was trying to condense by using "^ but missed out the "

"^C(CBot/1.0|egbfeieh|fetch|yberian)" u_a


This rule is for: CCBot/1.0 Cegbfeieh Cfetch and Cyberian

I found the info here:
[webmasterworld.com...]

The page I've linked to also says I should escape spaces, which I've also not done, yet.

Hope that makes more sense.

wilderness

3:43 pm on Feb 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



if you simple reduce Cfetch to fetch, you'll catch some others pests using the latter in their UA.

The page I've linked to also says I should escape spaces, which I've also not done, yet.


I would advise to do with caution.
You simply avoid the entire issue in mod_setenvif by enclosing in quote.
Ex:
"Cfetch 1.0 is here and gone"

does NOT require the use of escapes and the quote anchors are know as the fourth and little-known container "exactly as".

Begins with
ends with
contains
exactly as

are minimum comprehension for the beginner.
Anchors are the key to simplicity.

cyberdyne

3:56 pm on Feb 22, 2012 (gmt 0)

10+ Year Member



Ah, I see, seems very useful indeed then.

Definitely worth looking into so I'll put it onto my 'very-soon-todo' list.

Many thanks for your help as always.

(edited my confusing reply) lol

lucy24

9:33 pm on Feb 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In some cases, escaping and quotation marks do the same job. Above everything else, this applies to spaces, which must be Dealt With in some way.

You can say
"widget foobar"
with quotation marks

or you can say
widget\ foobar
with an escape

Both of those mean "the space in the middle is a literal character that's part of the text I am looking at".

But if you say
widget foobar

then the space reverts to its normal Apache meaning: separating two parts of a statement. Depending on where you do it, you'll either get a very very unintended result, or your server will give up and serve nothing but 500 errors. This applies even in situations where you don't normally have to escape, like parentheses or grouping brackets. A dot [.] in brackets is a literal dot, but a space [ ] in brackets is still an Apache syntactical space.

cyberdyne

9:40 pm on Feb 22, 2012 (gmt 0)

10+ Year Member



Thank you Lucy, understood and noted ;-)

wilderness

3:23 am on Feb 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In some cases, escaping and quotation marks do the same job. Above everything else, this applies to spaces, which must be Dealt With in some way.


with different effects on your logs and their view, thus why risk it simply to be politically correct.

see my explanation near the bottom of this page [webmasterworld.com]

lucy24

6:30 am on Feb 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



cyberdyne, feel free to zone out. What follows is an information dump that has nothing directly to do with your question.

with different effects on your logs and their view

Depends where the escapes and/or quotation marks are. I've never managed to do anything within mod_setenvif or mod_rewrite that affected log format. Only the core directives. Someone in another thread came up with a fairly plausible explanation of why this would happen.

I took some time off here to experiment in my art studio's site, which never gets any human visitors so it doesn't matter if I make a mistake that results in 97 more errors than I intended. Hahaha.

Interesting discovery: The quotation marks have no effect on anything except spaces. I tested by locking myself out using a piece of my UA that contains literal spaces, literal periods, literal parentheses-- and a very uncommon browser name.

I can say
Mybrowser/2\.1\.1\ \(like\ Otherbrowser/3\.6\.27\)
or I can say
"Mybrowser/2\.1\.1 \(like Otherbrowser/3\.6\.27\)"

Both work to lock me out, and neither has any effect on logs.

But if I accidentally say
Mybrowser/2\.1\.1\ \(like Otherbrowser/3\.6\.27\)
forgetting to escape one space, I get whapped with a 500 error-- one that's so severe, it can't even display the custom 500 document.

And if I say
"Mybrowser/2\.1\.1 (like Otherbrowser/3\.6\.27)"
trusting the quotation marks to preserve the literal parentheses, it doesn't work. That is, I am not locked out, because mod_rewrite thinks they are capture-parentheses, not part of the string.

Moving on to mod_setenvif, I can escape spaces and say
BrowserMatch Mybrowser/2\.1\.1\ \(like\ Otherbrowser/3\.6\.27\) keep_out

or I can use quotation marks and say
BrowserMatch "Mybrowser/2\.1\.1 \(like Otherbrowser/3\.6\.27\)" keep_out

but if I say
BrowserMatch Mybrowser/2\.1\.1 \(like Otherbrowser/3\.6\.27\) keep_out
... nope, guess again. It does lock me out; it doesn't yield a 500 error. Hint: You can randomly change anything after the "Mybrowser" part, for example to say Otherbrowser/3\.7\.27 with a different number, and nothing will change. (Matter of fact I'm pretty pleased with how fast I figured this out. Normally I have to throw myself on someone else's mercy for an explanation.)

Finally moving to the core directives. If I change the minor robotic block
Deny from 38.101.148.96/27

to
Deny from 38\.101.148.96/27

I get walloped with another 500 error. It's been hit with two mutually exclusive ways of representing the same numerical range.

If I cut off the /27 and say only
Deny from 38\.101.148.96

...that's when the logs switch over to resolved-IP mode. Instead of
aa.bbb.ccc.ddd
I am now
adsl-aa-bbb-ccc-ddd.dsl.snfc21.pacbell.net

... but only for the duration of the change. When I set everything back the way it was, logs go back to normal too.

wilderness

7:03 am on Feb 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



lucy, lucy, lucy. . . .

As I explained in the other thread, . . .

If your able to provide a URL for Apache or PCRE and the correlation (designation) of special characters as applied to mod-setenvif and mod_access?

I'd certainly be interested in reading it.

cyberdyne

4:00 pm on Feb 23, 2012 (gmt 0)

10+ Year Member



Unfortunately, the following rule in my htaccess did not seem to work last night for CCBot/1.0 ('CCBot' is also disallowed in robots but as expected it completely ignored that).

U-A: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)

SetEnvIfNoCase User-Ugent "c(cbot/1.0|egbfeieh|fetch|yberian)" bad_bot
<Limit GET POST PUT HEAD>
Order Deny,Allow
Deny from env=bad_bot
</Limit>


Can anyone please advise as to why this might have happened?

I was under the impression IfNoCase negated the need for the case to match.

Thanks in advance

wilderness

4:22 pm on Feb 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How do you spell Agent?

"User-Ugent"

cyberdyne

4:24 pm on Feb 23, 2012 (gmt 0)

10+ Year Member



Haha, what a fool I am!

User-Ugent

Thanks wilderness ;)

lucy24

12:08 am on Feb 24, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As I explained in the other thread, . . .

The other thread is nothing but explanation from bottom to top, except when it digresses to quote other threads. I can't figure out which bit you're referring to :( But it feels good to spell "refer" with the correct number of r's

Anyway, my post was about actual real-life experimental results derived from changing one thing at a time.

I found the explanation I was thinking of. Over here [webmasterworld.com] phranque said:
my guess is that anything in an Allow or Deny directive that isn't obviously a simple IP address including comments and regular expressions may look like a possible hostname and the double reverse DNS lookup is in effect.
once you have the remote hostname it uses that for the %h value in the default common log format.

i would assume you could have it both ways by using a custom log format that specifies %a (Remote IP-address) in the first column.

(When you're on shared hosting, I in turn would assume that setting a custom log format isn't an option.) That was referring to something that was correctly formatted, so it didn't spit up a 500 error, it just changed the log format.

I think we're going at it backward anyway. afaik, the Deny from / Allow from directives are the only place where you can use CIDR format; everywhere else it's either literal text or a Regular Expression. But I can go test that too.