Forum Moderators: phranque

Message Too Old, No Replies

"Nasty-User-Agent Firewall" .htaccess file

Could someone help explain this person's usage or regex?

         

MickeyRoush

8:51 am on Mar 8, 2012 (gmt 0)

10+ Year Member



I've learned a lot on the forums here, but I was hoping someone could explain the coding usage that this person is using to block user agents via .htaccess.

Here is a snip-it:

RewriteCond %{HTTP_USER_AGENT} \bj(?:a(?:karta.commons|va)|e(?:nnybot|tcar)|ikespider|oc.web.spider) [NC,OR]

The whole file can be seen here:
[zipsbazaar.co.uk...]

\bj
I'm assuming that's some type of word boundary?

?:a
What does this do? Is it some type of back reference?

Thanks for anyone's input.

lucy24

11:55 am on Mar 8, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, \b is a word break. It means that the preceding character is anything other than an alphanumeric or _. In most Regular Expressions, a hyphen - or apostrophe ' would count as a word break. This may or may not be desirable in mod_rewrite.

(?:blahblah) is simply a non-capturing group. It saves you from having to keep track of which ones you're using so you don't end up with a rule that says

/$3/directory/file$1.php?%4%2%7

But if they're not doing anything with the parentheses anyway-- I assume this all leads to a bald [F] --it doesn't really make any difference. [Edit: After looking at the file I see the rule ends up with [G]. I think there's a sort of meta-rule against lying to robots, isn't there? You can refuse to let them in the store, but you can't say you don't carry the item they want.]

Lessee now...

\bj(?:a(?:karta.commons|va)|e(?:nnybot|tcar)|ikespider|oc.web.spider) [NC,OR]

\bj(nnn)
= contains a word that starts with j

(a(nnn)|e(nnn)|ikespider|oc.web.spider)
= next bit is "a" plus some stuff OR "e" plus some stuff OR "ikespider" (hah! I've blocked that one by IP) OR "ocNwebNspider" where N can be any single character including a space or punctuation.

a(karta.commons|va)
= "a" can be followed by "kartaNcommons" OR "va" with N as above

e(nnybot|tcar)
= "e" can be followed by "nnybot" OR "tcar".

So the options in this line are:

jakartaNcommons
java
jennybot
jetcar
jikespider
jocNwebNspider

The whole file can be seen here:
[zipsbazaar.co.uk...]
or, in my case, here:
[zipsbazaar.co.uk...]

--which says, accompanied by insulting graphic, "This Site No Longer Supports Your Current Browser". This strikes me as a load of ### since I'll bet they never did "support" my browser. Since when does a plain .txt file require browser support? It opened fine in Safari, and would have opened fine in Camino if I'd felt like jumping through hoops to spoof some other browser.*

But y'know what? That just reinforces the impression I got from the sample line of code, which to me suggests someone who isn't as smart as he thinks he is. Consider only...

.
(get it?)
.
(hint)
.
(and so on)


* Just a couple of days ago I saw a brand-new (2011) site that contained the line
To view this website in its proper format using Internet Explorer, please use IE 7 and higher.
If only I could figure out the syntax maybe I could... Oh, never mind.

g1smd

8:05 pm on Mar 8, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Using a non-capturing group with the (?:pattern) is a little bit quicker than without the ?: part.

When processor cycles count, this is a good thing to get into the habit of using. I use it from time to time, but rarely include it in code posted here since most participants already have problems getting their head round the basic functionality.

iamzippy

10:10 pm on Mar 8, 2012 (gmt 0)

10+ Year Member



@MickeyRoush

There's a URL in the text file that lets you contact me directly if you have questions about, or comments on, the firewall expressions. I've had a few by now. I'm puzzled why you brought it here.

Certain wags might read '\bj' as something NSFW, but you're right, it's matching a word boundary followed by the letter 'j'. Most undesirable UA substrings don't start (^) or end ($) the UA string, but I found skipping to word boundaries (as a type of anchor) moved things on apace. I did a LOT of testing.

The reason that the strings are chopped up so much is to reduce the amount of 'backtracking' done by the regex engine. The reason for the non-capturing '(?:' is to avoid maintaining unnecessary back-references, which slow the engine down. When you have as many alternations and repetitions as this filter does, you need to try to make it as light as you can.


@g1smd

Spot-on. And when you have a lot of parentheses doing alternation, you don't want to drag your heels capturing back-references you don't intend to use.

The text file we're talking about is not currently published on (nor linked-to within) my site. The only link I ever posted to it was a reference in a reply to a comment at perishablepress.com, and MickeyRoush was referred to it from there. The only other link I'm aware of at this time is the one (now) here at WebmasterWorld. I'm grateful for that much, I guess.


@lucy24

Yeah, turning a [G] into a [F] is such a fag.

I guess 'charitable' isn't your natural instinct, eh? Looks like you fell foul of the filter you're trashing.

I picked up your Camino visit from the logs. I can assure you that 'Camino' is not blocked. Your visit was redirected to the advisory page because Camino pulls a name-dropping ricket in its UA string -- "Camino/2.1.1 (like Firefox/3.6.27)". You don't need to jump through many hoops to be 'Camino' without being 'like Firefox', but if it's too much effort perhaps you can take it up with your browser vendor.

I block Firefox versions below 4.0 because they lack support for the HTML5 'required' and 'aria-required' attributes on form inputs among other things. Besides, according to the Mozilla page for Firefox 3.6 it is officially 'Out Of Date'.

FWIW, I qualitatively block legacy versions of all major browsers because with few exceptions, they're prima facie evidence of spamming or hacking attempts. Or WebSense.

In your case, the reason you were redirected to a support page rather than being flat blocked is that you were using HTTP/1.1.

Oh, and would you be kind enough to enlarge upon:

"someone who isn't as smart as he thinks he is. Consider only...

.
(get it?)
.
(hint)
.
(and so on)"

You lost me there, Lucy. (10/10 for your analysis of the regex, btw)

[edited by: iamzippy at 10:48 pm (utc) on Mar 8, 2012]

g1smd

10:33 pm on Mar 8, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The comment was maybe a little harsh, but corrects the oft-seen error of unescaped literal periods.

That is " \. " matches a period, but " . " matches ANY character.

iamzippy

10:55 pm on Mar 8, 2012 (gmt 0)

10+ Year Member



It's by design. Who knows what a malfeasant might change that character to, in order to swerve the filter?

Anyone who sees that as a coding gaff doesn't understand the subject well enough to comment on it.

MickeyRoush

4:32 am on Mar 9, 2012 (gmt 0)

10+ Year Member



@ iamzippy.

Sorry for not asking you directly. I've just always go first to the teachers that have helped me the most in the past, since I've already obtained sort of a base understanding with them already. If they couldn't have helped me, than the next step for me would have been to contact you.

Thanks for everyone's replies. It's helped me a lot.

MickeyRoush

4:50 am on Mar 9, 2012 (gmt 0)

10+ Year Member



This brings up another question. The usage of \b will not work with Apache 1.x servers. Or am I wrong?

Believe it or not, there are quite a few of those still being in use. I know, I know, it's no longer supported. :(

MickeyRoush

6:12 am on Mar 9, 2012 (gmt 0)

10+ Year Member



And I have another question regarding the usage non-capturing group with the (?:pattern)

In this example, author'd by g1smd, would it be wise to implement it?

RewriteEngine On
RewriteCond %{QUERY_STRING} concat[^\(]*\( [NC,OR]
RewriteCond %{QUERY_STRING} union([^s]*s)+elect [NC,OR]
RewriteCond %{QUERY_STRING} union([^a]*a)+ll([^s]*s)+elect [NC]
RewriteRule .* - [F]


To this:

RewriteEngine On
RewriteCond %{QUERY_STRING} concat[^\(]*\( [NC,OR]
RewriteCond %{QUERY_STRING} union(?:[^s]*s)+elect [NC,OR]
RewriteCond %{QUERY_STRING} union(?:[^a]*a)+ll([^s]*s)+elect [NC]
RewriteRule .* - [F]

lucy24

7:15 am on Mar 9, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The usage of \b will not work with Apache 1.x servers.

They changed RegEx dialects ("flavors") somewhere along the line. See if \W works instead. It means exactly the same thing when you're not capturing. Futile of course if they don't recognize basics such as \w and \d

But in any case it's not a huge issue. A robot that's sneaky enough to replace a blocked \. with an unblocked . is also smart enough to bypass the block by sticking some random letter onto the front of its name. I've seen them :)

iamzippy

11:05 am on Mar 9, 2012 (gmt 0)

10+ Year Member



@MickeyRoush:

OK buddy, no offence taken. The patterns are part of a larger suite of .htaccess filters I use when I can't fiddle with the main server config. I'm about to post an article about the project; it makes sense to have this kind of discussion through comments on the blog itself, as a handful of folks have already pointed out.

Apache 1.x will probably choke on more than the \b in the patterns. There are look-aheads and look-behinds in there, too. Anyway, I don't give a hoot about legacy server software any more than I do about legacy browsers. They're all bringing the Web to its knees as it is.

As for the g1smd-authored rewrite examples you posted above, the patterns are so simple the gain will be insignificant. I wouldn't bother. Using the non-capturing syntax really pays off when iterating long lists of complex alternations, and in tight loops. HTH.


@Lucy24:

The advantage of \b is that it's a zero-width assertion, like ^ and $, so it requires no storage. \W on the other hand, will match any non-word character. That requires a comparison against the classed subset [A_Za-z0-9_] at each position, so it's not exactly a substitute. I don't know where you're going with that.

The .htaccess firewall is in a constant state of flux, and it's updated almost daily. I agree it's trivial (think:Whackamole) for a scumbag to modify the UA, and I've recorded many malformations that are clearly intended to circumvent regex filtering (of the naive kind). My patterns are tested against thousands of UA strings taken from the wild. I run daily access logs through a script that checks to see what got caught, what got away, and what's new. When I pick up a new variant from the logs, it usually takes no more than a few minutes to figure out or change a pattern to nail it -- whether they're adding, changing or removing stuff, it doesn't matter. Regular expressions always win.



Thanks to all for your feedback, it's been a buzz.

iamzippy

3:34 pm on Mar 9, 2012 (gmt 0)

10+ Year Member



That should read '[A-Za-z0-9_]'. Fat fingers ;)

MickeyRoush

6:02 pm on Mar 9, 2012 (gmt 0)

10+ Year Member



@ iamzippy

Thanks for your replies and insight. I look forward to any discussion on your blog.

g1smd

7:34 pm on Mar 9, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



[A-Za-z0-9_]

Note that
[a-z0-9_] 
with
[NC]
flag will parse 33% faster.

Use the ?: syntax if you like, there's nothing to lose, and potentially a small gain to be had.


Apache 1 and early Apache 2 uses POSIX for Regular Expressions. Apache 2 point something onwards allows for PCRE to be used. I forget which version saw the change.

iamzippy

7:44 pm on Mar 9, 2012 (gmt 0)

10+ Year Member



Spot-on, g1. But the point I was making was that \W is shorthand for [^A-Za-z0-9_]. Case not relevant.

g1smd

7:58 pm on Mar 9, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That'll teach me to scan read while editing some PHP code in another window. :)