Welcome to WebmasterWorld Guest from 54.80.185.137
Forum Moderators: Ocean10000 & incrediBILL & phranque
Whee - what a great discussion.
[edited by: Marcia at 11:23 pm (utc) on Oct. 13, 2003]
[edited by: jdMorgan at 12:24 am (utc) on Nov. 19, 2003]
[edit reason] Corrected URL [/edit]
I am a bit lost with all of this so I have a simple question about .htaccess syntax (yes, again, sorry), it will be fast:
Is the following command line correct for banning everything containing the string in question?
-> RewriteCond %{HTTP_USER_AGENT} ^.*WebZIP.*$ [OR]
And with spaces:
-> RewriteCond %{HTTP_USER_AGENT} ^.*Program\ Shareware.*$ [OR]
HUGE thanks if you can reply! :)
It'll work, but you can shorten it. A start anchor "^" followed by ".*" and an end anchor "$" preceded by ".*" are redundant:
RewriteCond %{HTTP_USER_AGENT} WebZIP [OR]
Is entirely equivalent.
Ref: [etext.lib.virginia.edu...]
Jim
Is the following command line correct for banning everything containing the string in question?-> RewriteCond %{HTTP_USER_AGENT} ^.*WebZIP.*$ [OR]
Max; You put in a lot more than is needed to ban WebZip agents. Here is all you really need to block any agent containing that string, case insensitive, anywhere in it's U-A string:
RewriteCond %{HTTP_USER_AGENT} webzip [NC,OR]
And with spaces:-> RewriteCond %{HTTP_USER_AGENT} ^.*Program\ Shareware.*$ [OR]
Again, you have included more than is needed to block this UA. My version of this rule reads:
RewriteCond %{HTTP_USER_AGENT} ^Program.?Shareware [NC,OR], but you can also write it as: RewriteCond %{HTTP_USER_AGENT} ^Program\ Shareware [NC,OR]
The ^ indicates the absolute beginning of a regexp string, while the $ sign means the absolute end. By leaving these out of the expression you allow for a match anywhere within the User Agent string. However, in my experience, Program Shareware is always the beginning of the name, so I anchor the beginning with a ^ but leave off the $, because there may be version numbers appended to it. Notice that I replaced you / with a .? where the space occurs. The reason for this is to allow for creative obfuscation by the users of these programs who might change the space to a dash or underscore, or even a forward slash, in the hopes of breaking our rules. The .? catches zero or more of any character(s) between Program and Shareware, including non-printing spaces. [NC] means No Case.
IMHO, Wiz
Anyone take a look at what a 30 line .htaccess does to your server load? Just a thought. I know that I have sites that I wouldn't put 30 lines of regular expressions into my php code for every page on a heavily loaded site, and this seems the same.
I've got a couple of sites which have up to 800-line .htaccess files, but because the rules are carefully written, and because the sites get thousands of hits per day instead of tens or hundreds of thousands (or more), they do just fine. The bottom line is that each server and hosted site is different, and you have to test to find out how big is too big for your CPU and your traffic level.
Taking a wider view, the point was made earlier that most sites won't need such a large, comprehensive set of rules, and each Webmaster should use only those rules which provide a real benefit to offset the performance loss they cause.
Jim
So using this templateRewriteCond %{HTTP_USER_AGENT} string [NC,OR]
will ensure that every U-A containing "string" anywhere in the U-A will be banned?
Correct-a-mundo, Max
Don't forget that if the User Agent contains non-alphabet characters or spaces you can put .? between the last letter of name one and the first letter of name two, with no space between the letters. For example: web.?extract [nc,or] (will catch "website extractor 1.09"), which otherwise would have to be written longhand as: ^Website\ Extractor\ 1\.09$ [or]. The long method would fail if somebody used version 1.10 instead of 1.09.
I personally group all common expressions in one long rule, separating each one with a vertical pipe symbol (which is displayed as a broken pipe on this forum, ala: ¦). Here is one such grouped condition from my .htaccess:
RewriteCond %{HTTP_USER_AGENT} ^(BlackWidow¦Crescent¦Disco.?¦ExtractorPro¦HTML.?Works¦Franklin.?Locator¦
Green\ Research¦Harvest¦HLoader¦http.?generic¦Industry.?Program¦IUPUI.?Research.?Bot¦Mac.?Finder¦NetZIP¦
NICErsPRO¦NPBot¦PlantyNet_WebRobot¦Production.?Bot¦Program.?Shareware¦Teleport.?Pro¦TurnitinBot¦TE¦
VoidEYE¦WebBandit¦WebCopier¦Websnatcher¦Website\ Extractor¦WEP.?Search¦Wget¦Zeus) [NC,OR]
Notice that the board has changed my pipes into broken vertical pipes, so you would have to re-type them correctly to use this group rule. The line of User Agents is anchored at the beginning with a ^, because these UAs are known to display in logs as typed, but there is no ending $ anchor. This allows for other characters after the main name, such as version numbers. I have another group rule that is not anchored at the beginning to catch strings that may not be at the beginning of a UA.
These represent my personal choice of which agents to block with a 403 message, and may not apply to other people.
Wiz
[edited by: jdMorgan at 3:08 pm (utc) on April 22, 2004]
[edit reason] Edited long line to fix horizontal scrolling [/edit]