Forum Moderators: open

Message Too Old, No Replies

RegExp Needed for Ugly UA strings

I've seen it here somewhere

         

blend27

3:32 pm on Nov 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I see some scrapers getting back to using random UA's like iscdAv1gAtAgnm1g or xncAbylnlcgdyAm, all kind of nonesence

I've seen a RegExp here at WW to catch that stuff, but can't seem to locate it.

Can someone point me to the right direction.

Thanks in Advance!

Mokita

2:45 am on Nov 6, 2007 (gmt 0)

10+ Year Member



I use this:

RewriteCond %{HTTP_USER_AGENT} !^Mozilla
RewriteCond %{HTTP_USER_AGENT} !^NSPlayer
RewriteCond %{HTTP_USER_AGENT} ^[a-z0-9\ ]{15,}$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} [b-df-hj-np-tvwxz]{5,} [NC]
RewriteRule .* - [F,L]

I think it was originally supplied here by jdMorgan, but I had to modify it slightly when I found it was blocking legitimate Windows Media Player users. If you don't have any media files on your site you can omit the NSPlayer line.

[edited by: Mokita at 2:52 am (utc) on Nov. 6, 2007]

jdMorgan

3:29 am on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Interesting, Mokita...

I don't see how the NSPlayer UAs would need a special exception, based on the two active patterns above as compared to the NSPlayer UA samples I have. If by chance you captured a sample NSPlayer UA that was caught by the two patterns before you added the exceptions to the rule, I'd like to see it and add it to my list.

As with many general rules like this, these simple patterns are somewhat dangerous; There's always a chance that a new and legitimate user-agent might be released that matches one or both patterns, so it's a good idea to review your 403 error log often if you use rules like this, and add exceptions as needed. This thread illustrates that point well.

Jim

Mokita

4:26 am on Nov 6, 2007 (gmt 0)

10+ Year Member



Hi Jim,

Here are some samples from the logs of four different UAs for Media Player that were caught by the rules:

123.3.41.nnn - - [28/Feb/2007:16:40:28 +1100] "GET /media/file1.wmv HTTP/1.1" 403 - "-" "NSPlayer/9.0.0.3265 WMFSDK/9.0"

129.94.6.nn - - [01/Mar/2007:10:18:30 +1100] "GET /media/file1.wmv HTTP/1.1" 403 - "-" "NSPlayer/10.0.0.3702 WMFSDK/10.0"

211.30.190.nnn - - [02/Mar/2007:08:52:57 +1100] "GET /media/file1.wmv HTTP/1.1" 403 - "-" "NSPlayer/10.0.0.4054 WMFSDK/10.0"

121.44.237.nn - - [02/Mar/2007:12:03:59 +1100] "GET /media/file1.wmv HTTP/1.1" 403 - "-" "NSPlayer/11.0.5721.5145 WMFSDK/11.0"

As soon as I became aware there was a problem and added the NSPlayer exception, the media files became accessible to legitimate users.

HTH.

keyplyr

7:16 am on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for info on the NSPlayer UA Mokita.

I also use a similar rewrite rule to filter out random number/letter UA strings, but was unaware of any potential to accidentally block NSPlayer requests. The longer UAs may be recent?

Mokita

7:55 am on Nov 6, 2007 (gmt 0)

10+ Year Member



keyplyr wrote:
The longer UAs may be recent?

As you can see from the log entries I posted above, my wake-up call happened in very early March this year. The rules had been in that site's .htaccess ever since Jim first posted them here (not sure exactly when that was), but the site did not contain any media files until 28 Feb 2007. So I have no way of knowing if older Media Player UAs might have been affected.

I'm embarrassed to admit it took three days before I noticed the problem.

[edited by: Mokita at 8:08 am (utc) on Nov. 6, 2007]

blend27

1:03 pm on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



!^Mozilla
!^NSPlayer
^[a-z0-9\ ]{15,}$ [NC,OR]
[b-df-hj-np-tvwxz]{5,} [NC]

Thanks Mokita, there is one thing i forgot to mention, I am not that Big on RewriteCond, i am not on Apache server.
The RegEx would be used with in Coldfusion function isValid("regex" value, pattern). What I am looking for is pattern.

So Could you be so kind and explain what that was, or maybe Jim will step in.

Thanks in Advance!

jdMorgan

2:32 pm on Nov 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, so it's the trailing WMFSDK/9.0 that is tripping the second RewriteCond.

Blend27, the patterns are straight regular-expressions and can be interpreted independent of the RewriteCond context shown here. I'd rather not explicitly describe the patterns, since that would make them easy to find with search, and possibly defeat the purpose of posting them here -- We do not know who reads here, but can be assured that some of the scrapers do.

Jim

blend27

2:48 pm on Nov 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks Jim,

I did figure out the stuff later on and put it on the back burner for now(testing it on several sites :))

Blend27

Mokita

10:20 pm on Dec 2, 2007 (gmt 0)

10+ Year Member



Some bot owners appear to have figured out how to get around the RewriteRule mentioned above, kindly provided by Jim.

I've just seen this UA coming from a dedicated server's IP:

Mozilla4.0VKUUXFITHOMQJNYXQSDPAGDUEKDCZFNVPHQNMZHUNUKXHJVXBY

jdMorgan

10:46 pm on Dec 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmmm... All-caps, no spaces (hint, hint).

Jim