Welcome to WebmasterWorld Guest from 54.167.157.247

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

htaccess and banning bots

code is not working for all bots

   
5:37 pm on May 30, 2013 (gmt 0)



Afternoon all:

Trying to ban some search bots, because they are chrashing our server.

I am currently using the below code

(the -- working and failed note is not in the htaccess file)

This code is at the top of the page

Any Ideas why it is not working?

I was using Deny from 168. xxx etc but I was told by the server that it took uo too much processor time

Thank you


RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Yandex [OR] --Working
RewriteCond %{HTTP_USER_AGENT} ^msnbot [OR] --Working
RewriteCond %{HTTP_USER_AGENT} ^Owlinbot [OR] -- Working
RewriteCond %{HTTP_USER_AGENT} ^sistrix [OR]-- Failed
RewriteCond %{HTTP_USER_AGENT} ^genieo [OR]--FAILED
RewriteCond %{HTTP_USER_AGENT} ^proximic [OR]-FAILED
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [OR]--FAILED
RewriteCond %{HTTP_USER_AGENT} ^AhrefsBot [OR]--FAILED
RewriteCond %{HTTP_USER_AGENT} ^searchmetrics [OR]--Working
RewriteCond %{HTTP_USER_AGENT} ^SearchmetricsBot [OR] -- Working
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [OR] -- Working
RewriteCond %{HTTP_USER_AGENT} ^Baidu--Working
RewriteRule ^.* - [F,L]
11:35 am on May 31, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am not big on .htaccess but the following format does the job on my sites. Notice the PIPE character-separator.

RewriteCond %{HTTP:User-Agent} (?:Yandex|msnbot|Owlinbo|sistrix|genieo|proximic|MJ12bot|AhrefsBot|searchmetrics|SearchmetricsBot|Baidu) [NC]
RewriteRule .? - [F]
11:42 am on May 31, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It may not be wise to block msnbot or Yandex depending on your market. It is possible to slow down msnbot with a Crawl Delay directive (for example: Crawl-delay: 10
) in robots.txt.

Regards...jmcc
11:57 am on May 31, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I doubt some of those you have marked as working actually were working. Yandex, Baidu, and Bingbot for example, all use a mozilla user-agent.

I was using Deny from 168. xxx etc but I was told by the server that it took uo too much processor time.

That doesn't sound right. I think you should figure out why your previous disallows were crashing the server in the first place.
12:07 pm on May 31, 2013 (gmt 0)



For years looking at my search engine traffic

80% came from Google
10% came from Bing
and the rest was yahoo and msn

If 90% of my traffic is from US and Canada then I do not need a Russian bot relentlessly crawling the site.

I noticed that when my site went offline (Pingdom) that it happened a number of times in the early hours.

I did everything I could do to fix the traffic and have since learned that spiders were causing my grief
1:56 pm on May 31, 2013 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I was using Deny from 168. xxx etc but I was told by the server that it took uo too much processor time


Absurd!
A single denial of a Class A causes no such server load.
A server load, however could occur caused by loops and malfunctioning Error Documens.

I've an approximate dozen Class B's in the 168 denied, otherwise the A is open access. May have some 168's in custom rules.

I doubt some of those you have marked as working actually were working.


I agree with Key_master.
In order for most of those UA's to function you'll need to remove the "begins with" anchor.

A few of those bots are compliant and will honor robots.txt.

You don't seemed to have addressed the major SE's previews, which don't fall into the category of regular bot and masquerade themselves as standard browsers.

MSN-Bing potentially become real real crawling pests if left unrestricted.

Do your site (s) also contain images and/or other types of media files that are not restricted for bot crawling?
2:02 pm on May 31, 2013 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



You might try these lines which catches quite a few pests (There are likely a few other common terms of harvesters/bots that I've overlooked.:

RewriteCond %{HTTP_USER_AGENT} (Access|appid) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Capture|Client|Copy|crawl|curl) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Data|devSoft|Domain|download) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Engine|fetch|filter|genieo) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Jakarta|Java|Library|link|libww) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (MJ12bot|nutch|Preview|Proxy|Publish) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (scraper|spider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Wget|Win32|WinHttp) [NC]
RewriteRule .* - [F]
4:17 pm on May 31, 2013 (gmt 0)




<blockquote>I was using Deny from 168. xxx etc but I was told by the server that it took to too much processor time</blockquote>


I had 20 or so URL done this way. It was not just one.
4:36 pm on May 31, 2013 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



had 20 or so URL done this way. It was not just one.


That's still not enough to cause an issue.
I've thousands.

The issue is a result of some other server load.

Generally speaking, the only excess server loads resulting from htaccess files are 500's caused by loops, or domain name lookups.
8:29 pm on May 31, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



I was using Deny from 168. xxx etc but I was told by the server that it took to too much processor time

Agree. With the responses, not with "the server" (assume you mean your host). A straight numerical IP block should be the least server-intensive thing you can do.

Also as already noted, the big problem in your original file is that every single UA has an opening anchor. Possibly you've been misinformed about what an anchor does; if you post back, someone will sort you out :)

:: insert obligatory plug for doing simple UA blocks in mod_setenvif using BrowserMatch or BrowserMatchNoCase in conjunction with "Deny from env=such-and-such" ::
8:49 pm on May 31, 2013 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



> I do not need a Russian bot

Yes you do. Yandex has for a while now been operating a US-based SE which occasionally sends visitors, especially those fed up with google. The bot itself is reliable and obeys robots.tx directives (including crawl speed as far as I know).

I suppose you could block the RU IP range and only enable the US range but I suspect they share data. I certainly would were it my bot/SE.
9:41 pm on May 31, 2013 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



dstiles,
hope you and Yandex are happy together ;)

BTW, Yandex is one of the robots.txt compliant bots I mentioned.

Unfortunately the bot generally uses IP's that I've denied and gets denied anyway.
1:09 am on Jun 1, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Yandex [OR] --Working
RewriteCond %{HTTP_USER_AGENT} ^msnbot [OR] --Working
RewriteCond %{HTTP_USER_AGENT} ^Owlinbot [OR] -- Working
RewriteCond %{HTTP_USER_AGENT} ^sistrix [OR]-- Failed
RewriteCond %{HTTP_USER_AGENT} ^genieo [OR]--FAILED
RewriteCond %{HTTP_USER_AGENT} ^proximic [OR]-FAILED
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [OR]--FAILED
RewriteCond %{HTTP_USER_AGENT} ^AhrefsBot [OR]--FAILED
RewriteCond %{HTTP_USER_AGENT} ^searchmetrics [OR]--Working
RewriteCond %{HTTP_USER_AGENT} ^SearchmetricsBot [OR] -- Working
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [OR] -- Working
RewriteCond %{HTTP_USER_AGENT} ^Baidu--Working
RewriteRule ^.* - [F,L]


You might try removing the start anchor (^)
8:23 pm on Jun 1, 2013 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Wilderness - accepted that the IPs used by yandex are a mess - generally two or three per /24 or several small groups within a /24. A few days ago, in an attempt to coax them into a more extensive coverage of one of my clients' sites, I enabled almost all yandex IP ranges completely. I also added a few more UAs - their list of bot UAs on their site is very good and a model to various other bots I will not mention.

For reference, the yandex ranges I have are as follows (most though not all of these are RU):

5.45.202.0 - 5.45.202.255
37.140.141.0 - 37.140.141.63
77.88.0.0 - 77.88.63.255
87.250.224.0 - 87.250.255.255
93.158.128.0 - 93.158.191.255
95.108.128.0 - 95.108.255.255
100.43.64.0 - 100.43.95.255
178.154.128.0 - 178.154.255.255
199.21.96.0 - 199.21.99.255
199.36.240.0 - 199.36.243.255
213.180.192.0 - 213.180.223.255

In the past I have run rDNS checks on the IP ranges to extract bot IPs but their system is too volatile. I figure that a valid bot UA on any of these IPs is probably valid.
9:30 pm on Jun 1, 2013 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



For reference, the yandex ranges I have are as follows


Many thanks.

FWIW, except for the 199, I've all those Class A's denied, which was the brunt of my explanation.
10:29 pm on Jun 1, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



RE: Yandex

They make a statement somewhere on their main web site not to filter them by IP range because of frequent changes.
2:26 am on Jun 2, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



not to filter them by IP range because of frequent changes

I remember that line. But they really ought to know better. Claiming to be a major search engine is one of the simplest robot tricks; how many fake googlebots do you meet in the course of an average day?

I don't see a lot of fake yandexbots. Fake yandex searches, yes. Luckily they are very easy to distinguish from the real ones.
7:39 pm on Jun 2, 2013 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I think they mean "within the yandex IP ranges".

I've seen extra bing and G bot IPs added from time to time over the past year or so since I initialy checked their rDNS. The yandex IPs I've been seeing don't seem to have changed much: they've added a few more bot UAs (at least, I hadn't allowed them before) but the IP changes have been few.

I have to say that despite yandex being Russia-based I have more faith in their integrity than in G right now.
3:58 pm on Jul 4, 2013 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I've been looking further into Genieo. I've had a LOT of blocks this year on the tool and it seems I'm blocking potential genuine visitors. I have now enabled it as a bot, although that is a little simplistic for what it does.

User-Agent: Mozilla/5.0 (compatible; Genieo/1.0 http:// www[.]genieo[.]com/webfilter[.]html)

Brackets [] added by me, plus space after http.
7:19 pm on Jul 4, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I believe the reason I started blocking the genieo UA a few years ago was to keep my sites out of their database. This proved to be unsuccessful since it appeared they were using one of the index dumps available at the time. This may have changed and they may be crawling on their own.

After doing (little) research I discovered that users who installed their browser add-on could capture content and store it for a variety of uses, so I re-added the UA block. Of course almost all modern browsers can download web content to their local machines nowadays, but it was the "service" feature that annoyed me.

At the time I think I considered the fact that doing this blocks all genio users with the genio add-on much like blocking any scraping tool UA also blocks those that use it.

FWIW - SERPs will show a few "remove genio" services/tools accompanied by user testaments of trojan like behavior.

[edited by: keyplyr at 7:59 pm (utc) on Jul 4, 2013]

7:49 pm on Jul 4, 2013 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I found approx 90 blocks on it during the past few months. All looked like ordinary broadband users. As far as I read, the tool stores content only on the local (ie user's) computer, which is little worse than ordinary cache.

It occurred to me, whilst reading their blurb, that this may be a good replacement for a certain ill-reputed search engine beginning with G... :)
9:41 pm on Jul 4, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



As far as I read, the tool stores content only on the local (ie user's) computer, which is little worse than ordinary cache.

How is this different from the ordinary "Save" ("HTML Complete", "Web Archive" or whatever your browser may call it) that all browsers can do?

If the content I'm looking for involves searching an eighty-five-megabyte text file, I think I and the site will both be happier if I've got a copy on my HD and don't need to bother their server.
10:36 pm on Jul 4, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




How is this different from the ordinary "Save" ("HTML Complete", "Web Archive" or whatever your browser may call it) that all browsers can do?

I was under the impression the user could choose to save to a cloud service.
11:12 pm on Jul 4, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Oh.

YUK.

;)
4:40 pm on Jul 5, 2013 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



My point was: my clients were probably losing customers. I'll see how it goes. I may modify the action to block with a message if it gets out of hand.