RewriteCond %{HTTP_USER_AGENT} ^(.*)email* [OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)bot* [OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)collect* [OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)crawl* [OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)leach* [OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)reap* [OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)rip* [OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)spider* [OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)extract* [OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)strip*
RewriteRule ^.* -[F]
That would be to try and stop the obvious and then the list would go on with the specific, more difficult to summarize robots & spiders.
But, would it be too rude? I mean would I run the chance to block legitimate bots?
Or is it ok?
You should clean up your RewriteCond directives a bit, too; They are probably not doing exactly what
you expect them to do, due to the trailing "*". Mod_rewrite uses "Regular Expressions" - a
Unix/Posix pattern-matching syntax. In regular expressions "." and "*" do not mean the same thing
they do in - say - file matching on a Windows machine.
In regular expressions, the trailing star in this example
RewriteCond %{HTTP_USER_AGENT} ^(.*)email* [OR]
means, "accept zero or more l's on the end of 'email' as meeting the requirements for a match."
Thus, you will block "emai", "email", "emailllllll", etc.
In the first part of the pattern, the "." means "any character" and the "*" means, "any number of
the preceding character", so ".*" translates to "anything".
Also, there is no need to use parentheses unless you want to define a back-reference to be used
later, or to define the boundaries of a set of alternatives.
You can accomplish the goal of banning anything containing "email" anywhere in the UA by
simply stating:
RewriteCond %{HTTP_USER_AGENT} email [OR]
Since you don't care where in the string "email" occurs, you don't need the start anchor ("^")
or the end anchor ("$"), and since you're not using a back-reference, you don't need the
parentheses, either.
For more anchor-related info, see msg #9 in this thread [webmasterworld.com].
Unless you are absolutely sure of the case of the letters in the user-agents you want to block,
you may wish to add "NC" into the flag at the end of each RewriteCond, i.e. "[NC,OR]". This will
make the pattern-matching case-insensitive.
For more information on regular expressions, this is a useful reference [etext.lib.virginia.edu].
A final technical comment... Mod_rewrite will continue processing any rewrite sets you have
following the
RewriteRule ^.* - [F]
at the end, unless you include the "L" flag.
RewriteRule .* - [F,L]
(I also cleaned up the redundant "^" and fixed the missing space between "-" and [F]")
Generally, be very careful casting a wide net... I know it's frustrating keeping up with all the
bandwidth leeches on the 'net, but you may accidentally reject a legitimate visitor or worse yet, a
desirable search engine spider, by using RewriteConds that are too general.
Hope this helps,
Jim
[edited by: jdMorgan at 9:31 pm (utc) on July 29, 2002]
And thanks for that reference, mod_rewrite is so helpful but so difficult to get ir right.
I see your point, a broad net such as this might be too risky... I have already removed the spider, crawl and bot lines.
I'll keep an eye on my logs to see if any "good" robots fall.