homepage Welcome to WebmasterWorld Guest from 54.196.24.103
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Accredited PayPal World Seller

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
.htaccess check?
I'm a newbie to this, but I *think* I've done this correctly
Canton




msg:4291026
 10:20 pm on Apr 1, 2011 (gmt 0)

I apologize in advance if this question is very rudimentary. Before I go on, I should note that what I'm about to add here is the result of a lot of research already. I can't claim to understand regular expressions very well, but what I'm facing lately is massive bandwidth suck from UAs that, frankly, I never need/want to see. So, if anyone would be so kind as to comment on what I've got below, including any rookie mistakes, please let me know. I already know this works to block some UAs, but I don't know about all of them - that would suggest it's properly formed, at least for those agents I've confirmed that are blocked. I'm particularly concerned about the "Java" user-agent and I don't know if I need to add a wildcard after it to get every version, or if the ^ character will cover that.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^YandexBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [OR]
RewriteCond %{HTTP_USER_AGENT} ^libwww-perl [OR]
RewriteCond %{HTTP_USER_AGENT} ^Java [OR]
RewriteCond %{HTTP_USER_AGENT} ^Sogou\ web\ spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Exabot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Ezooms [OR]
RewriteCond %{HTTP_USER_AGENT} ^Gigabot [OR]
RewriteCond %{HTTP_USER_AGENT} ^discobot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Purebot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Sosospider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Speedy\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^AboutUsBot\ Johnny5 [OR]
RewriteCond %{HTTP_USER_AGENT} ^Python-urllib [OR]
RewriteCond %{HTTP_USER_AGENT} ^Yeti [OR]
RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^GoScraper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Kehalim [OR]
RewriteCond %{HTTP_USER_AGENT} ^DoCoMo [OR]
RewriteCond %{HTTP_USER_AGENT} ^SurveyBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^spbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^BDFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^example [OR]
RewriteCond %{HTTP_USER_AGENT} ^EasyDL [OR]
RewriteCond %{HTTP_USER_AGENT} ^CamontSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^GoScraper [OR]
RewriteCond %{HTTP_USER_AGENT} ^oBot [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Chilkat [OR]
RewriteCond %{HTTP_USER_AGENT} ^ZmEu
RewriteRule ^.* - [F,L]

Again, please be gentle, but if you have any thoughts, questions, suggestions, etc., they are all most welcome.

~Canton

 

wilderness




msg:4291044
 11:51 pm on Apr 1, 2011 (gmt 0)

Not sure how you could spend the time gathering all those UA's and not read the material that accompanied the UA's which would have assisted you in comprehending additional endeavors.

First suggestion is that you take the time to alphabetize (and keep them alphabetized) all these UA's you have copied from other places (Webmaster World or otherwise). You'll find the time spent much easier to manage and determine if a new name appears in your raw logs.
Additionally many of the UA's you've copied are outdated, and have been replaced by UA's that appear close to standard browser UA's (there are even flaws in this if your able to understand the association between browsers and the use of specific segments of the UA).

Second suggestion is that you become of aware of UA words or synonyms that are detrimental to your websites and used by amateurish bot software and/or harvester software's in their default UA's.

Third suggestion is that you create lines in which you may insert multiple names (understanding that your not required to use the complete name, rather unique names that are only a portion of the appearing UA and would potentially apply to multiple harvesters.

I've provided some below.
Please keep in mind that you should limit you names-per-line to somewhere in the 6-9 range (I've found to be excessive. Then keep them alphabetized upon supplementing a name (will occasionally require adding additional lines)

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (capture|crawl|download) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (fetch|finder|harvest) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Java|larbin|libww|library|link) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (nutch|proxy|Retrieve) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (scraper|siphon|spider|tool) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (web) [NC]
RewriteRule .* - [F]

g1smd




msg:4291052
 12:28 am on Apr 2, 2011 (gmt 0)

That's pretty much what I was about to post, nothing more to add.

Canton




msg:4291054
 12:38 am on Apr 2, 2011 (gmt 0)

Copied? Sorry...I should have clarified. These all came out of today's logfiles. I figure the best place to find which bots to block are by looking in my own logs so...that's what I did. I don't believe anything in that list is outdated (Indy Library is the only entry I took from another source). I have a very large site, so my main concern is not so much harvesters but bots that just keep grabbing one page after another, bloating my logfiles and, occasionally, slowing my server down.

I actually alphabetized locally (GoScraper appears twice because of my initial failure to do this) as soon as I initially posted that this would probably be the first suggestion. So I'm good there too.

I like the third suggestion there though, wilderness. Does the combination of the parentheses and pipe character indicate some sort of wild card approach? That's what I'm really not clear on. Also, I noticed you use the "NC" directive in each of these, presumably in order to avoid the necessity of proper case for given bots/UAs?

wilderness




msg:4291074
 1:14 am on Apr 2, 2011 (gmt 0)

the pipe charter signifies the expression of "or".

Enclosing the words in parentheses, and separated by the pipe character (or) translated to "any of the words" and reduces the quantity of lines in your file.

The NC (no case) designation simply eliminates a required redundancy (web, Web, Web or WwEeBb) however and just like everything else in expressions, it must be used carefully and with full comprehension.
Additionally there will be instances, where you do not wish to use "no case" for a better focus.

FWIW, you'll need to build IP ranges on server farms, colo's and such to eliminate the non-beneficial harvesting by many of those.
Amazon is bar far the biggest pest for webmasters when it comes to server farms. Pfui has a two year thread going here [webmasterworld.com]

Canton




msg:4291077
 1:44 am on Apr 2, 2011 (gmt 0)

Thanks wilderness...I'm going to have to do more research before I put this into production. Perhaps I'll just test on a site that I don't care too much about. But, again, thank you for the information and the link to the thread on Amazon.

wilderness




msg:4291089
 2:09 am on Apr 2, 2011 (gmt 0)

IMO & FWIW, you'd better serve yourself if you implemented these things in fragments.

Adding multiple sections and many lines simultaneously is simply "asking for disaster".

Canton




msg:4291166
 10:36 am on Apr 2, 2011 (gmt 0)

Thanks wilderness. I believe I'll do just that. And, again, thank you for the informative and timely replies.

~Canton

tangor




msg:4291181
 1:01 pm on Apr 2, 2011 (gmt 0)

Looking at your list... there's a SIGNIFICANT number that honor robots.txt...which means you don't have to deal with them in .htaccess. Might address robots.txt first then deal with the violators in .htaccess

wilderness




msg:4292391
 12:46 pm on Apr 5, 2011 (gmt 0)

FWIW, you'll need to build IP ranges on server farms


had a visit from ezooms yesterday. You may add their backbone to the above category.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved