Forum Moderators: open
Many of these give themselves away by:
a) Requesting only .html pages, never the associated images ..
b) Fast rate of requests. more like a spider than a human ..
c) Deliberately obscure alterations to the usual use agents
A good example of c) is "Mozilla/4.0 (compatible ; etc. "
==> Note the space between 'compatible' and the semicolon ';'.
I want to disallow 'compatible ;', with the strangely placed space -BUT- I have to be careful!
If .htaccess ignores the space as 'whitespace', I will throw away 2/3 of my organic traffic!
1) Does anybody have a known good bullet-proof way to do this?
2) Am I disallowing by USER_AGENT like this?
RewriteCond %{HTTP_USER_AGENT} Java/1 [NC,OR] ..
or is it {HTTP_SOMETHING_ELSE}?
Help much appreciated! -Larry
## Supposedly mostly private network-related
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3\.0.\(compatible\;\) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3\.01.\(compatible\;\) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*\(compatible\;\) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0.\(compatible\;\) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.01.\(compatible\;\) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0.\(compatible\;\) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.01.\(compatible\;\) [NC,OR]
(I've not yet seen the 5.x versions but I reckon they're imminent so I pre-include them:)
Also:
SetEnvIfNoCase User-Agent "Java" no_way
Alternatively (and be sure to make the broken vertical line a true pipe):
RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*(Jakarta¦Java) [NC,OR]
Caveat:
In the snippets above, escaped periods \. are where periods/dots occur in the original strings and UNescaped periods are where blank spaces occur. Thus this --
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4.0.\(compatible\;.MSIE.\;.Mac_PPC\) [NC,OR]
-- rewrites this totally bogus, spaced-out UA:
Mozilla/4 0 (compatible; MSIE ; Mac_PPC)
Jim always cleans up my snippets because I learned to insert periods for white/blank spaces rather than escaping the blanks and still get goofed up doing things the proper way:) Also, darn near everything can be streamlined and he's really, really good at that!
I also stop bots that make SGML and HTML errors in page requests, stupid stuff like & in a page name or \#top, \# and other silly things.
If you want to set a snare, put some javscript activated links on your site and eventually you'll see bots that try to be clever and read that javascript looking for chunks of javascript or variable names instead of the link they were trying to decode.
Wilderness: you wrote:
" I use a single line for these:
SetEnvIf User-Agent "compatible ;" keep_out ..
Do you put this line directly into .htaccess, or someplace else?
Are you sure I don't have to do something special (escape etc.) with the space?
I do NOT want to disallow compatible; with the semicolon properly placed of course.
If it goes into .htaccess, where exactly do I put it?
My present .htaccess looks like this:
RewriteEngine On
# By Referer:
RewriteCond %{HTTP_REFERER} forumxx\.#*$!xx\.xxx [NC,OR]
RewriteCond %{HTTP_REFERER} forumyy\.fok\.nl [NC,OR]
# By user agent:
RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Java/1 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Missigua [NC]
RewriteRule .* - [F]
- - - -
I just want a single foolproof line I can insert somewhere in the above. -Larry
"(compatible; MSIE 5.0)"
"(compatible; MSIE 6.0)"
"(compatible; MSIE 7.0)"
If the agent contains the exact string it is a 99.99% chance it's a bot as MSIE always tacks on the platform to their user agent strings.
I like the simplicity of:
SetEnvIf User-Agent "compatible ;" keep_outWhere EXACTLY, and how do I put that line? -Larry
Larry,
Including the UA phrase in quotes results in an EXACTLY as.
As far as the line itself?
What I provided is somewhat incomplete.
I use both SetEnvIf (I can never recall the module name, even though I used it before I even began with Rewrite condition) and Rewrite conditions in my htaccess.
Even the examples that I provided three years ago:
[webmasterworld.com...]
do NOT provide the complete and necessary lines for operation.
You may use the condition provided today in your regular Rewrites.
EX:
Add line
RewriteCond %{HTTP_USER_AGENT} "compatible ;" [NC,OR]
or you may add the followling lines to your htaccess:
(when using the SetEnvIf or deny from the visitors are denied without access to robots.txt, although I seem to recall somebody writing an exception that allows reading)
Options -Indexes
<Limit GET>
SetEnvIf User-Agent "compatible ;" keep_out
order allow,deny
deny from #*$!.xx.xxx.
deny from xx.xxx.xx.xxx
allow from all
deny from env=keep_out
</Limit>
Some additional notes!
The opening and closing lines of
Options -Indexes (and other requirements)
<Limit GET> (with other options possible beyond GET)
</Limit>
vary from host to host.
It may require some tinkering.
I suggest non-peak hours till you see that it works properly.
Also you may use any words you desire as opposed to "keep_out"
The KEY is that you MUST use the identical words in both the deny from statements and SetEnvIf statements as WELL as your closing deny from env=
Don
RewriteEngine On# By Referer:
RewriteCond %{HTTP_REFERER} forumxx\.#*$!xx\.#*$! [NC,OR]
RewriteCond %{HTTP_REFERER} forumyy\.fok\.nl [NC,OR]# By user agent:
RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Java/1 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Missigua [NC]
RewriteRule .* - [F]
Larry if you use the additional options that I provided in Msg#9 of this thread?
Than your previously installed Rewrites would go below what I provided in your htaccess.
so it would now read:
Options -Indexes
<Limit GET>
SetEnvIf User-Agent "compatible ;" keep_out
order allow,deny
deny from 000.xx.#*$!.
deny from xx.xxx.xx.xxx
allow from all
deny from env=keep_out
</Limit>
RewriteEngine On
# By Referer:
RewriteCond %{HTTP_REFERER} forumxx\.#*$!xx\.xxx [NC,OR]
RewriteCond %{HTTP_REFERER} forumyy\.fok\.nl [NC,OR]
# By user agent:
RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Java/1 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Missigua [NC]
RewriteRule .* - [F]
edited by wilderness.
In edition you may combine your UA's in your example (three to a single line)using the "or pipe charcter"
RewriteCond %{HTTP_USER_AGENT} larbin¦Java/1¦HTTrack [NC,OR]
BTW, don't forget that the forum breaks the pipe chracter into a split line and these MUST be corrected.
RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR] If you add it at the end of your list, be sure to remove the OR:
RewriteCond %{HTTP_USER_AGENT} ^Java [NC]
I don't want to create a whole new class of problems by SetEnvelope or whatever just for one class of pests.
Your suggestion:
RewriteCond %{HTTP_USER_AGENT} "compatible ;" [NC,OR]
.. will fit right in with the others already banned.
Some questions though [others please chime in as need be!]
* Are you sure about the full quotes around the "compatible ;" user agent?
* Will this ban ANY request the includes that?
* How about the space? Isn't an escape character needed as in compatible\ ;?
* A very similar exclusion has the carat ^ symbol first, as in
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [NC,OR]
Does the carat indicate 'INCLUDES', and wouldn't I need that as well?
For now, I put in the line, as you gave it with quotes, and tested it.
It didn't shut my site down (a BIG fear), and regular compatible; (no space) requests get thru.
Its too soon to say if it worked as intended though. The weenies have to strike first.
Many thanks for all help so far! -Larry
You wrote: .. re the Java thing, lop off that /1 because you want to block ALL Java-only UAs.
And seeing as how you're already using mod_rewrite, just add this:
RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR]
If you add it at the end of your list, be sure to remove the OR: "
- - -
Thanks again! Very good idea. Please explain the carat symbol ^ however!
By the use of the carat ^, am I saying in effect:
" BAN all UAs that INCLUDE the term Java? "
.. as opposed to all UAs that are exactly equal to Java?
Very important. NObody just says "Java", there is always some other junk, numbers, whatever.
I want to disallow all of them, and also disallow any UA that INCLUDES (compatible ;
[note the space before ;]
regardless of what other crap comes with it. -Larry
* Are you sure about the full quotes around the "compatible ;" user agent?
* Will this ban ANY request the includes that?
* How about the space? Isn't an escape character needed as in compatible\ ;?
* A very similar exclusion has the carat ^ symbol first, as in
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [NC,OR]Does the carat indicate 'INCLUDES', and wouldn't I need that as well?
Larry the line works exactly as provided. I've been using it for some years. The are multiple examples of this in the Perfect Htaccess threads.
The carat (^) means BEGINS without [without the parentheses.]
When using begins with, ONLY the unique leading characters are necessary. The full UA is a waste.
The dollar ($) means ENDS without [without the parentheses.]
When using ends with, ONLY the unique trailing characters are necessary. The full UA is a waste.
NO leading or trailing character means CONTAINS [any location within the UA]
When using contains, ONLY the unique keyword characters are necessary. The full UA is a waste.
Surrounding a string in quotes is very similar to the deprecated html <pre></pre> in that the string will be compared EXACTLY as typed.
These four options make the entire procedures quite simple.
One may handle most UA's with these options using SetEnVIf or RewriteCond (whatever your preference, or any combination of the two.)
Jim and many others are able to provide complicated strings and expressions to convert unknown phrases and/or characters in a string.
I do NOT uses any of these types of strings, nor do I understand them. And yet I have no difficulty in creating unique lines for the UA's that keep appearing.
As far as the escape (\) character?
In UA's (and SetEnvIf) I likley have used it in less than than a handful of lines (with over 400 lines of UA's). (I suppose eventually
I'll need to condense these 400+lines of UA's and move them to Rewrite and using the OR pipe character, however I've grown accustomed to looking at them alaphabetically.)
Using the escape character for RewriteCond IP ranges is an entriely different issues and it must be used prior to period that separates every CLASS of an IP range.
(Actually, once I started thinking of regex akin to a language, like, oh, French, and went _s_l_o_w_l_y_, literally from word to word, I started to get it, well, un peu:)
Here are four quick tips Jim recapped in one of his always, always helpful replies. I saved it as its own note to refer to again and again until I finally memorized it --
^Agent = Must start with "Agent" (may be followed by any number of characters)
Agent$ = Must end with "Agent" (may be preceded by any number of characters)
^Agent$ = Must exactly match "Agent" (no additional characters allowed for a match)
Agent = Must contain "Agent" (may be preceded or followed by any number of characters)
Here's the thread:
How to make .htaccess smaller, leaner
[webmasterworld.com...]
So this --
RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR]
-- translates, as per Jim's tips, as:
The Agent must start with the word Java, AND may be followed by any number of characters, AND is case insensitive.
And I know you're not using SetEnv sections, but this has the same effect vis-a-vis stopping a name-starts-with-Java agent, because of the ^ --
SetEnvIfNoCase User-Agent "^Java" no_way
HTH!
Today I had a "full" crawl using four IP ranges from three different providers.
The crawl was a bit confusing. At one point it was just taking pages, then later added pages and images, then later images alone. No consistency.
The IP's are irreleavant, however should somebody desire them?
Sticky me and I'll send.
The UA (not sure how the forum will deal with an "ends with blank space" that I used in my rewrite. I've attempted to insert and extra blank in an attempt to overide the forum.
Mozilla/4.0 (compatible; MSIE 4.0; Windows NT; ....../1.0 )
No Robots and the crawl was very slow.
Don
# Missing Windows NT version number
RewriteCond %{HTTP_USER_AGENT} Windows\ NT
RewriteCond %{HTTP_USER_AGENT} !Windows\ NT\ (4\.0¦5\.[b][0-2][/b])(\)¦;\ [^)])
RewriteRule .* - [F]
Jim
Is this a good, all-around Rule?
RewriteRule .* - [F]
I ask because I've got a mix of these:
RewriteRule ^.*$ - [F,L]
RewriteRule ^/.+ - [F,L]
RewriteRule .* - [F,L]
And even things like this:
RewriteRule ^awstats$ - [NC,F]
(Something tells me I've got a lot a lot of redundancy going on.)
I know I should sit down with a regex cheat sheet and pound the details into my brain. But until that day dawns -- yes?
RewriteRule .* - [F]
[edited by: Pfui at 3:18 am (utc) on July 15, 2006]
About these variations from assortmed posts, above:
SetEnvIf User-Agent "compatible ;" keep_out
SetEnvIf User-Agent "(compatible; MSIE 5.0)" keep_out
SetEnvIf User-Agent "(compatible; MSIE 6.0)" keep_out
SetEnvIf User-Agent "(compatible; MSIE 7.0)" keep_out
I thought we needed to escape
SetEnvIf User-Agent details, a la: SetEnvIf User-Agent "compatible\ \;" keep_out
SetEnvIf User-Agent "\(compatible\;\ MSIE\ 5\.0\)" keep_out
SetEnvIf User-Agent "\(compatible\;\ MSIE\ 6\.0\)" keep_out
SetEnvIf User-Agent "\(compatible\;\ MSIE\ 7\.0\)" keep_out
The first set is certainly neater but before I blow a bunch of visitors into 500 purgatory, I just wanted to double-check because I'm currently using code 'in' SetEnvIf and they're working A-OK:
SetEnvIf User-Agent "^EI" keep_out
SetEnvIf User-Agent "^EZ" keep_out
SetEnvIf User-Agent "^FDM" keep_out
SetEnvIf User-Agent "^Ken" keep_out
SetEnvIf User-Agent "^Microsoft\ Data\ Access\ Internet\ Publishing\ Provider\ DAV" keep_out
Or maybe it doesn't matter?
[edited by: Pfui at 3:41 am (utc) on July 15, 2006]