Forum Moderators: phranque

Message Too Old, No Replies

RegEx help with RewriteCond

         

keyplyr

2:11 am on Nov 19, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




I'm blocking bots with abbreviated Mozilla variants:

RewriteCond %{HTTP_USER_AGENT} ^Moz(illa(/[1-9]\.[0-9]+)?)?$ [NC,OR]


Without adding a separate RewriteCond, how would I include blocking this UA?

Mozilla/4.0 (compatible;)


Thanks

wilderness

2:37 am on Nov 19, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



keyplr,
Pretty sure you may modify the following to work:

RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9.]+\ \(compatible[^;)]

lucy24

3:23 am on Nov 19, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do any legitimate UAs call themselves, say, "Mozilla 5.2 (compatible;)"? If not, it's probably easier just to shove in the "compatible" part, even if it will include numbers that never really occur. Besides, they probably will occur sooner or later.

Don't forget to escape the literal parentheses and the space!

keyplyr

3:39 am on Nov 19, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



it's probably easier just to shove in the "compatible" part - lucy24

Yes, that's what I'm asking to be exemplified.

lucy24

7:24 am on Nov 19, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, come on. You can do it.

:: grumble, grumble, OK, but only because you've been here since 2001 ::

Tack it on to the end of your existing formula.

Moz(illa(/[1-9]\.[0-9]+)?)?(\ ?\(compatible;?\ ?\))?$

Looks horrible, doesn't it? :) The " (compatible)" part is in its own separate parentheses after everything else. I tossed in a ; and closing space for good measure because robots like to do that kind of thing, and made everything optional.

Now just watch. Someone will come out with a "Mozilla/0.3" and then you'll be sunk. May as well change both to "0-9". (I use \d, but that's just because I assume things will work, unless apache digs in its heels and refuses to play along.)

keyplyr

8:03 am on Nov 19, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Terrific, thanks Lucy. You are a kind woman :)

Ya know, I've been writing RegEx for over 10 years, but even though I know what I want to do, sometimes I sit here staring at it and just can't get the perspective.


Also thanks wilderness, but I was looking for more specificity.

keyplyr

12:15 pm on Nov 20, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What are legit uses for?

Mozilla/4.0 (compatible;)

Mozilla/5.0 (compatible;)

Other than the occasional nefarious threat, I usually only see image/script caching from ISPs and EDU. Are there any known good-guys?

lucy24

1:37 pm on Nov 20, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Now this is odd. I thought I was just making it up about the trailing semicolon, but riffling though this month's logs, that's the only form I do find. That is, (compatible;) as opposed to (compatible) with cleverly deployed format tags to avert unwanted winks. And the stranger thing is that they appear to be human.

:: looking more closely ::

This month: One repeater who just stopped by to download something. One human doing an authorized hotlink (site tracking), with multiple visits to same page. Last month (Spotlight doesn't "do" punctuation, so this is manual and two months is enough): Another hotlinker, this time just once. And an evident human getting a picture-intensive ebook without referer for images, implying that either their browser or their IP has an add-on.

Matter of fact, three of the four must have caught my attention in one way or another at the time, because I've already noted their IP addresses.

For the three that I've got more than one record for, the first hit starts out "Mozilla/4.0 (compatible" but then goes on into assorted forms of Internet Exploder 7 on Windows Vista. I smell a browser add-on, especially in conjunction with that one guy with a blocked referer.

Two of the four are from apparently legitimate regional government offices. Different countries. But, ahem, proper countries. Not Ukraine, in other words. The other two look like corporate IPs. Hm, cruising the web at work eh ;)

Wonder if the folks down in User Agent Identification can shed light?

g1smd

4:25 pm on Nov 20, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



with cleverly deployed format tags to avert unwanted winks
...or you can tick the
disable smilies for this post
box.

keyplyr

6:47 pm on Nov 20, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Since adding this block to stop the couple drive-by image scrapers, I was surprised just how many hits had this UA. In one day I got about 30 different IPs. As I said, mostly what I think are image & script caching by ISPs, but also EDU and GOV, with the handful of yet-to-be-determined.

Not very much considering the site gets close to 10k page loads daily, but still more than I expected. However, I still suspect these are machines not humans. Rise up against the machines!

lucy24

6:49 pm on Nov 20, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ah, but then I wouldn't be able to use the intentional smileys ;)

Are you back from vacation already? How time flies.

Pfui

6:55 pm on Nov 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Mozilla/4.0 (compatible;)


I see that going for graphics simultaneously with legit browser hits from .mil, .gov, and assorted public sector Hosts. I've long blocked it with no apparent consequences (other than its log bloat).

keyplyr

8:11 pm on Nov 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks Pfui