Welcome to WebmasterWorld Guest from 54.196.208.6

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

.htaccess check?

I'm a newbie to this, but I *think* I've done this correctly

     
10:20 pm on Apr 1, 2011 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 13, 2001
posts: 125
votes: 0


I apologize in advance if this question is very rudimentary. Before I go on, I should note that what I'm about to add here is the result of a lot of research already. I can't claim to understand regular expressions very well, but what I'm facing lately is massive bandwidth suck from UAs that, frankly, I never need/want to see. So, if anyone would be so kind as to comment on what I've got below, including any rookie mistakes, please let me know. I already know this works to block some UAs, but I don't know about all of them - that would suggest it's properly formed, at least for those agents I've confirmed that are blocked. I'm particularly concerned about the "Java" user-agent and I don't know if I need to add a wildcard after it to get every version, or if the ^ character will cover that.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^YandexBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [OR]
RewriteCond %{HTTP_USER_AGENT} ^libwww-perl [OR]
RewriteCond %{HTTP_USER_AGENT} ^Java [OR]
RewriteCond %{HTTP_USER_AGENT} ^Sogou\ web\ spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Exabot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Ezooms [OR]
RewriteCond %{HTTP_USER_AGENT} ^Gigabot [OR]
RewriteCond %{HTTP_USER_AGENT} ^discobot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Purebot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Sosospider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Speedy\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^AboutUsBot\ Johnny5 [OR]
RewriteCond %{HTTP_USER_AGENT} ^Python-urllib [OR]
RewriteCond %{HTTP_USER_AGENT} ^Yeti [OR]
RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^GoScraper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Kehalim [OR]
RewriteCond %{HTTP_USER_AGENT} ^DoCoMo [OR]
RewriteCond %{HTTP_USER_AGENT} ^SurveyBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^spbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^BDFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^example [OR]
RewriteCond %{HTTP_USER_AGENT} ^EasyDL [OR]
RewriteCond %{HTTP_USER_AGENT} ^CamontSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^GoScraper [OR]
RewriteCond %{HTTP_USER_AGENT} ^oBot [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Chilkat [OR]
RewriteCond %{HTTP_USER_AGENT} ^ZmEu
RewriteRule ^.* - [F,L]

Again, please be gentle, but if you have any thoughts, questions, suggestions, etc., they are all most welcome.

~Canton
11:51 pm on Apr 1, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


Not sure how you could spend the time gathering all those UA's and not read the material that accompanied the UA's which would have assisted you in comprehending additional endeavors.

First suggestion is that you take the time to alphabetize (and keep them alphabetized) all these UA's you have copied from other places (Webmaster World or otherwise). You'll find the time spent much easier to manage and determine if a new name appears in your raw logs.
Additionally many of the UA's you've copied are outdated, and have been replaced by UA's that appear close to standard browser UA's (there are even flaws in this if your able to understand the association between browsers and the use of specific segments of the UA).

Second suggestion is that you become of aware of UA words or synonyms that are detrimental to your websites and used by amateurish bot software and/or harvester software's in their default UA's.

Third suggestion is that you create lines in which you may insert multiple names (understanding that your not required to use the complete name, rather unique names that are only a portion of the appearing UA and would potentially apply to multiple harvesters.

I've provided some below.
Please keep in mind that you should limit you names-per-line to somewhere in the 6-9 range (I've found to be excessive. Then keep them alphabetized upon supplementing a name (will occasionally require adding additional lines)

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (capture|crawl|download) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (fetch|finder|harvest) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Java|larbin|libww|library|link) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (nutch|proxy|Retrieve) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (scraper|siphon|spider|tool) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (web) [NC]
RewriteRule .* - [F]
12:28 am on Apr 2, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


That's pretty much what I was about to post, nothing more to add.
12:38 am on Apr 2, 2011 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 13, 2001
posts: 125
votes: 0


Copied? Sorry...I should have clarified. These all came out of today's logfiles. I figure the best place to find which bots to block are by looking in my own logs so...that's what I did. I don't believe anything in that list is outdated (Indy Library is the only entry I took from another source). I have a very large site, so my main concern is not so much harvesters but bots that just keep grabbing one page after another, bloating my logfiles and, occasionally, slowing my server down.

I actually alphabetized locally (GoScraper appears twice because of my initial failure to do this) as soon as I initially posted that this would probably be the first suggestion. So I'm good there too.

I like the third suggestion there though, wilderness. Does the combination of the parentheses and pipe character indicate some sort of wild card approach? That's what I'm really not clear on. Also, I noticed you use the "NC" directive in each of these, presumably in order to avoid the necessity of proper case for given bots/UAs?
1:14 am on Apr 2, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


the pipe charter signifies the expression of "or".

Enclosing the words in parentheses, and separated by the pipe character (or) translated to "any of the words" and reduces the quantity of lines in your file.

The NC (no case) designation simply eliminates a required redundancy (web, Web, Web or WwEeBb) however and just like everything else in expressions, it must be used carefully and with full comprehension.
Additionally there will be instances, where you do not wish to use "no case" for a better focus.

FWIW, you'll need to build IP ranges on server farms, colo's and such to eliminate the non-beneficial harvesting by many of those.
Amazon is bar far the biggest pest for webmasters when it comes to server farms. Pfui has a two year thread going here [webmasterworld.com]
1:44 am on Apr 2, 2011 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 13, 2001
posts: 125
votes: 0


Thanks wilderness...I'm going to have to do more research before I put this into production. Perhaps I'll just test on a site that I don't care too much about. But, again, thank you for the information and the link to the thread on Amazon.
2:09 am on Apr 2, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


IMO & FWIW, you'd better serve yourself if you implemented these things in fragments.

Adding multiple sections and many lines simultaneously is simply "asking for disaster".
10:36 am on Apr 2, 2011 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 13, 2001
posts: 125
votes: 0


Thanks wilderness. I believe I'll do just that. And, again, thank you for the informative and timely replies.

~Canton
1:01 pm on Apr 2, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:6142
votes: 281


Looking at your list... there's a SIGNIFICANT number that honor robots.txt...which means you don't have to deal with them in .htaccess. Might address robots.txt first then deal with the violators in .htaccess
12:46 pm on Apr 5, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


FWIW, you'll need to build IP ranges on server farms


had a visit from ezooms yesterday. You may add their backbone to the above category.