Forum Moderators: phranque

Message Too Old, No Replies

How to allow only certain IP's and user-agents?

If I "allow,deny", then IP's are blocked before I can check user-agent

         

MichaelBluejay

11:34 am on Dec 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



incrediBill says the wave of the future is to whitelist good bot IP's & good user-agents rather than trying to blacklist a mushrooming number of bad bot IP's. Okay, I'm game to try it. But I'm unclear as to the syntax. If I do something like:

order allow,deny
allow from [ip #1]
allow from [ip #2]

...then all the hosts not listed are blocked before I can even check the user_agent for "Mozilla" with RewriteCond.

(Yes, I know that bots can spoof user-agent strings; I'll deal with that later -- first I need to even be able to test the user-agent string!)

Also, for those of you running scripts to blacklist bad bots, how many entries for bad bots do you have in your .htaccess file? I'd like to compare that to the number of good IP's I'd need for whitelisting, because looking at the list of IP's used by the major search engines, it's not trivial.

jdMorgan

3:28 pm on Dec 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



SetEnvIf is easy to use for ORing variables, such as "set GETOUT if bad-bot-1 OR bad-bot-2," which is how it's usually used.

It's not so easy to use when trying to AND variables, such as "set GOODBOT if googlebot AND 66\.249(\.[0-9]{1,3}){2}

It *can* be done by use of DeMorgan's Theorem, which says that (a AND b) = !(!a OR !b) -- That is, the logical expression NOT((NOT a) OR (NOT b)) is equivalent to (a AND B). The converse is also true: (a OR B) = !(!a AND !b).

However, whenever I use this method to make SetEnvIf logically AND conditions together, I often end up with too many variable names and the code just looks too complicated and hard-to-maintain for my tastes.

So, I usually just use mod_rewrite, which has the AND and OR operators built-in to RewriteCond, and can "combine" both variables and patterns to be matched on one line. I end up with something like:


RewriteCond %{REMOTE_ADDR}<->%{HTTP_USER_AGENT} !^66\.249(\.[0-9]{1,3}){2}<->Mozilla/[5-6]\.[0-9]+\ \(compatible;\ Googlebot/[2-3]\.[0-9];\ \+http://www\.google\.com/bot\.html\)$
RewriteRule .* - [F]

Note that "<->" is not any kind of special pattern operator -- I use it as a unique string simply to "mark" the end of one variable and the beginning of the next, so that unintentional errors don't creep in due to the "boundary" between the variables being ambiguous.

I picked that sequence because it will almost never appear in 'real' variables that I'm testing, and because it implies concatenation. You could just as well use "~", "@", ",", "::", ";;", ">>", "<=>", or <~> if you like -- Those (and several variations that I didn't type) should all be sufficiently-unique and also "safe" in that they are not special regular-expressions operators.

I can't answer your question about blacklist size: I used a combined blacklist/whitelist approach, and only go after the worst offenders -- the ones that truly cause me problems. So, my lists are likely shorter than most. If no-one else chimes in on this particular question, you can take the old "almost-perfect htaccess ban list" threads as guides for comparison.

Jim