homepage Welcome to WebmasterWorld Guest from 54.211.100.183
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
htaccess and banning bots
code is not working for all bots
bthom62



 
Msg#: 4579553 posted 5:37 pm on May 30, 2013 (gmt 0)

Afternoon all:

Trying to ban some search bots, because they are chrashing our server.

I am currently using the below code

(the -- working and failed note is not in the htaccess file)

This code is at the top of the page

Any Ideas why it is not working?

I was using Deny from 168. xxx etc but I was told by the server that it took uo too much processor time

Thank you


RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Yandex [OR] --Working
RewriteCond %{HTTP_USER_AGENT} ^msnbot [OR] --Working
RewriteCond %{HTTP_USER_AGENT} ^Owlinbot [OR] -- Working
RewriteCond %{HTTP_USER_AGENT} ^sistrix [OR]-- Failed
RewriteCond %{HTTP_USER_AGENT} ^genieo [OR]--FAILED
RewriteCond %{HTTP_USER_AGENT} ^proximic [OR]-FAILED
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [OR]--FAILED
RewriteCond %{HTTP_USER_AGENT} ^AhrefsBot [OR]--FAILED
RewriteCond %{HTTP_USER_AGENT} ^searchmetrics [OR]--Working
RewriteCond %{HTTP_USER_AGENT} ^SearchmetricsBot [OR] -- Working
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [OR] -- Working
RewriteCond %{HTTP_USER_AGENT} ^Baidu--Working
RewriteRule ^.* - [F,L]

 

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4579553 posted 11:35 am on May 31, 2013 (gmt 0)

I am not big on .htaccess but the following format does the job on my sites. Notice the PIPE character-separator.

RewriteCond %{HTTP:User-Agent} (?:Yandex|msnbot|Owlinbo|sistrix|genieo|proximic|MJ12bot|AhrefsBot|searchmetrics|SearchmetricsBot|Baidu) [NC]
RewriteRule .? - [F]

jmccormac

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



 
Msg#: 4579553 posted 11:42 am on May 31, 2013 (gmt 0)

It may not be wise to block msnbot or Yandex depending on your market. It is possible to slow down msnbot with a Crawl Delay directive (for example: Crawl-delay: 10
) in robots.txt.

Regards...jmcc

Key_Master

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4579553 posted 11:57 am on May 31, 2013 (gmt 0)

I doubt some of those you have marked as working actually were working. Yandex, Baidu, and Bingbot for example, all use a mozilla user-agent.

I was using Deny from 168. xxx etc but I was told by the server that it took uo too much processor time.

That doesn't sound right. I think you should figure out why your previous disallows were crashing the server in the first place.

bthom62



 
Msg#: 4579553 posted 12:07 pm on May 31, 2013 (gmt 0)

For years looking at my search engine traffic

80% came from Google
10% came from Bing
and the rest was yahoo and msn

If 90% of my traffic is from US and Canada then I do not need a Russian bot relentlessly crawling the site.

I noticed that when my site went offline (Pingdom) that it happened a number of times in the early hours.

I did everything I could do to fix the traffic and have since learned that spiders were causing my grief

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4579553 posted 1:56 pm on May 31, 2013 (gmt 0)

I was using Deny from 168. xxx etc but I was told by the server that it took uo too much processor time


Absurd!
A single denial of a Class A causes no such server load.
A server load, however could occur caused by loops and malfunctioning Error Documens.

I've an approximate dozen Class B's in the 168 denied, otherwise the A is open access. May have some 168's in custom rules.

I doubt some of those you have marked as working actually were working.


I agree with Key_master.
In order for most of those UA's to function you'll need to remove the "begins with" anchor.

A few of those bots are compliant and will honor robots.txt.

You don't seemed to have addressed the major SE's previews, which don't fall into the category of regular bot and masquerade themselves as standard browsers.

MSN-Bing potentially become real real crawling pests if left unrestricted.

Do your site (s) also contain images and/or other types of media files that are not restricted for bot crawling?

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4579553 posted 2:02 pm on May 31, 2013 (gmt 0)

You might try these lines which catches quite a few pests (There are likely a few other common terms of harvesters/bots that I've overlooked.:

RewriteCond %{HTTP_USER_AGENT} (Access|appid) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Capture|Client|Copy|crawl|curl) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Data|devSoft|Domain|download) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Engine|fetch|filter|genieo) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Jakarta|Java|Library|link|libww) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (MJ12bot|nutch|Preview|Proxy|Publish) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (scraper|spider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Wget|Win32|WinHttp) [NC]
RewriteRule .* - [F]

bthom62



 
Msg#: 4579553 posted 4:17 pm on May 31, 2013 (gmt 0)


<blockquote>I was using Deny from 168. xxx etc but I was told by the server that it took to too much processor time</blockquote>


I had 20 or so URL done this way. It was not just one.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4579553 posted 4:36 pm on May 31, 2013 (gmt 0)

had 20 or so URL done this way. It was not just one.


That's still not enough to cause an issue.
I've thousands.

The issue is a result of some other server load.

Generally speaking, the only excess server loads resulting from htaccess files are 500's caused by loops, or domain name lookups.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4579553 posted 8:29 pm on May 31, 2013 (gmt 0)

I was using Deny from 168. xxx etc but I was told by the server that it took to too much processor time

Agree. With the responses, not with "the server" (assume you mean your host). A straight numerical IP block should be the least server-intensive thing you can do.

Also as already noted, the big problem in your original file is that every single UA has an opening anchor. Possibly you've been misinformed about what an anchor does; if you post back, someone will sort you out :)

:: insert obligatory plug for doing simple UA blocks in mod_setenvif using BrowserMatch or BrowserMatchNoCase in conjunction with "Deny from env=such-and-such" ::

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4579553 posted 8:49 pm on May 31, 2013 (gmt 0)

> I do not need a Russian bot

Yes you do. Yandex has for a while now been operating a US-based SE which occasionally sends visitors, especially those fed up with google. The bot itself is reliable and obeys robots.tx directives (including crawl speed as far as I know).

I suppose you could block the RU IP range and only enable the US range but I suspect they share data. I certainly would were it my bot/SE.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4579553 posted 9:41 pm on May 31, 2013 (gmt 0)

dstiles,
hope you and Yandex are happy together ;)

BTW, Yandex is one of the robots.txt compliant bots I mentioned.

Unfortunately the bot generally uses IP's that I've denied and gets denied anyway.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4579553 posted 1:09 am on Jun 1, 2013 (gmt 0)


RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Yandex [OR] --Working
RewriteCond %{HTTP_USER_AGENT} ^msnbot [OR] --Working
RewriteCond %{HTTP_USER_AGENT} ^Owlinbot [OR] -- Working
RewriteCond %{HTTP_USER_AGENT} ^sistrix [OR]-- Failed
RewriteCond %{HTTP_USER_AGENT} ^genieo [OR]--FAILED
RewriteCond %{HTTP_USER_AGENT} ^proximic [OR]-FAILED
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [OR]--FAILED
RewriteCond %{HTTP_USER_AGENT} ^AhrefsBot [OR]--FAILED
RewriteCond %{HTTP_USER_AGENT} ^searchmetrics [OR]--Working
RewriteCond %{HTTP_USER_AGENT} ^SearchmetricsBot [OR] -- Working
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [OR] -- Working
RewriteCond %{HTTP_USER_AGENT} ^Baidu--Working
RewriteRule ^.* - [F,L]


You might try removing the start anchor (^)

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4579553 posted 8:23 pm on Jun 1, 2013 (gmt 0)

Wilderness - accepted that the IPs used by yandex are a mess - generally two or three per /24 or several small groups within a /24. A few days ago, in an attempt to coax them into a more extensive coverage of one of my clients' sites, I enabled almost all yandex IP ranges completely. I also added a few more UAs - their list of bot UAs on their site is very good and a model to various other bots I will not mention.

For reference, the yandex ranges I have are as follows (most though not all of these are RU):

5.45.202.0 - 5.45.202.255
37.140.141.0 - 37.140.141.63
77.88.0.0 - 77.88.63.255
87.250.224.0 - 87.250.255.255
93.158.128.0 - 93.158.191.255
95.108.128.0 - 95.108.255.255
100.43.64.0 - 100.43.95.255
178.154.128.0 - 178.154.255.255
199.21.96.0 - 199.21.99.255
199.36.240.0 - 199.36.243.255
213.180.192.0 - 213.180.223.255

In the past I have run rDNS checks on the IP ranges to extract bot IPs but their system is too volatile. I figure that a valid bot UA on any of these IPs is probably valid.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4579553 posted 9:30 pm on Jun 1, 2013 (gmt 0)

For reference, the yandex ranges I have are as follows


Many thanks.

FWIW, except for the 199, I've all those Class A's denied, which was the brunt of my explanation.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4579553 posted 10:29 pm on Jun 1, 2013 (gmt 0)

RE: Yandex

They make a statement somewhere on their main web site not to filter them by IP range because of frequent changes.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4579553 posted 2:26 am on Jun 2, 2013 (gmt 0)

not to filter them by IP range because of frequent changes

I remember that line. But they really ought to know better. Claiming to be a major search engine is one of the simplest robot tricks; how many fake googlebots do you meet in the course of an average day?

I don't see a lot of fake yandexbots. Fake yandex searches, yes. Luckily they are very easy to distinguish from the real ones.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4579553 posted 7:39 pm on Jun 2, 2013 (gmt 0)

I think they mean "within the yandex IP ranges".

I've seen extra bing and G bot IPs added from time to time over the past year or so since I initialy checked their rDNS. The yandex IPs I've been seeing don't seem to have changed much: they've added a few more bot UAs (at least, I hadn't allowed them before) but the IP changes have been few.

I have to say that despite yandex being Russia-based I have more faith in their integrity than in G right now.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4579553 posted 3:58 pm on Jul 4, 2013 (gmt 0)

I've been looking further into Genieo. I've had a LOT of blocks this year on the tool and it seems I'm blocking potential genuine visitors. I have now enabled it as a bot, although that is a little simplistic for what it does.

User-Agent: Mozilla/5.0 (compatible; Genieo/1.0 http:// www[.]genieo[.]com/webfilter[.]html)

Brackets [] added by me, plus space after http.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4579553 posted 7:19 pm on Jul 4, 2013 (gmt 0)

I believe the reason I started blocking the genieo UA a few years ago was to keep my sites out of their database. This proved to be unsuccessful since it appeared they were using one of the index dumps available at the time. This may have changed and they may be crawling on their own.

After doing (little) research I discovered that users who installed their browser add-on could capture content and store it for a variety of uses, so I re-added the UA block. Of course almost all modern browsers can download web content to their local machines nowadays, but it was the "service" feature that annoyed me.

At the time I think I considered the fact that doing this blocks all genio users with the genio add-on much like blocking any scraping tool UA also blocks those that use it.

FWIW - SERPs will show a few "remove genio" services/tools accompanied by user testaments of trojan like behavior.

[edited by: keyplyr at 7:59 pm (utc) on Jul 4, 2013]

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4579553 posted 7:49 pm on Jul 4, 2013 (gmt 0)

I found approx 90 blocks on it during the past few months. All looked like ordinary broadband users. As far as I read, the tool stores content only on the local (ie user's) computer, which is little worse than ordinary cache.

It occurred to me, whilst reading their blurb, that this may be a good replacement for a certain ill-reputed search engine beginning with G... :)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4579553 posted 9:41 pm on Jul 4, 2013 (gmt 0)

As far as I read, the tool stores content only on the local (ie user's) computer, which is little worse than ordinary cache.

How is this different from the ordinary "Save" ("HTML Complete", "Web Archive" or whatever your browser may call it) that all browsers can do?

If the content I'm looking for involves searching an eighty-five-megabyte text file, I think I and the site will both be happier if I've got a copy on my HD and don't need to bother their server.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4579553 posted 10:36 pm on Jul 4, 2013 (gmt 0)


How is this different from the ordinary "Save" ("HTML Complete", "Web Archive" or whatever your browser may call it) that all browsers can do?

I was under the impression the user could choose to save to a cloud service.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4579553 posted 11:12 pm on Jul 4, 2013 (gmt 0)

Oh.

YUK.

;)

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4579553 posted 4:40 pm on Jul 5, 2013 (gmt 0)

My point was: my clients were probably losing customers. I'll see how it goes. I may modify the action to block with a message if it gets out of hand.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved