Forum Moderators: phranque

Message Too Old, No Replies

Whitelist browser user agents to reduce overhead?

making mod_rewrite eat fewer cycles

         

denyer

11:15 pm on Jan 25, 2007 (gmt 0)

10+ Year Member



Apologies for any newbie mistakes, this is theoretical and the user agents are examples, not necessarily what I'm trying to block.

RewriteEngine On

RewriteRule ^robots\.txt$ - [L]

RewriteCond %{HTTP_USER_AGENT} ^Mozilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Opera [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Lynx [NC]
RewriteRule ^.* - [L]

RewriteCond %{HTTP_USER_AGENT} ^WebReaper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC]
# longer list of blocked user agents here
RewriteRule ^.* - [F,L]

Would this be reasonable and serve the intended purpose of the WebReaper/etc list not being processed, or am I missing something basic?

jdMorgan

11:38 pm on Jan 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The majority of user-agents --both good and bad-- start with "Mozilla/[3-5]\.[0-9]\ \(compatible;\ ", so you will need to be quite a bit more specific with that "Mozilla" pattern.

I've used a browser whitelist, and it had 30 or 40 entries in it just to support the "popular ones" while rejecting most of the spoofers.

A combined whitelist/blacklist approach as you've shown is indeed the best way to keep the file samller, but the result will never be "small." Try to concentrate on the user-agents that actually abuse your site regularly; A fully-comprehensive ban list in your .htaccess file would likely slow your server to an unserviceable speed.

Look into using the PERL and PHP bad-bot trap scripts found here on WebmasterWorld as well; They will reduce your time spent reacting to abuse.

Jim

denyer

12:17 am on Jan 26, 2007 (gmt 0)

10+ Year Member



No worries, this is more a learning exercise ahead of possibly needing it than anything else. Currently it's just a gentle reminder to anyone using a site ripper with default settings, and thinking about how the .htaccess logic could be optimised as I'm new to most of the concepts.

Slightly related... would I be correct in guessing from the "User-Agent: " prefix that the second one in the list below isn't actually a browser, but something else failing to spoof correctly?

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

Thanks for your help. :)

jdMorgan

3:05 am on Jan 26, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, actually I think it's a content filter or something. I have no problem with content filters that declare themselves, but that UA is so lame... I wonder how good their filter is if they can't even write a good spoofed UA string.

Jim