Block ALL bots except google

Forum Moderators: open

Message Too Old, No Replies

Block ALL bots except google

Looking for the best way to block all bots/crawlers other than google.

classicman

8:08 am on Sep 1, 2015 (gmt 0)

Hi All,

I'm looking for the best way to accomplish the following items:
1. Block all automated bots/crawlers/spiders
2. Allow bots/crawlers/spiders from google
3. Allow human operated browsers (So people can still visit the site and see content)

SetEnvIfNoCase User-Agent mozilla.* good_guys
SetEnvIfNoCase User-Agent opera.* good_guys
SetEnvIfNoCase User-Agent .*google.* good_guys

<Limit GET POST HEAD>
Order Deny,Allow
Deny from all
Allow from env=good_guys
</Limit>

I am considering using the above in .htaccess, but can't be sure this works against malicious bots that have fake user-agent strings to bypass this type of filtering. Anyone have ideas on other ways to accomplish this?

Side note: Can anyone explain the use of the <LIMIT> </LIMIT> directive.

lucy24

6:22 pm on Sep 1, 2015 (gmt 0)

This may be a textbook case of a situation where NoCase (or the [NC] flag in mod_rewrite) is NOT appropriate. When whitelisting, you want to flag the correct form, including the correct casing. That's, for example, "Googlebot". The form "GoogleBot" (I have really met this) is a malign spoofer. Use anchors wherever possible to make things happen faster:
^Mozilla
^Opera
et cetera. But this isn't enough, because increasing numbers of botrunners have figured out that using a standard robotic UA (like anything containing the elements "lib-www" or "perl" or, for that matter, "bot") will probably get you blocked forthwith. So, paradoxically, legitimate search engines will have distinctive UAs, but unwanted Ukrainians will claim to be some plausible human browser.

There is no need for any of the .* in your examples, because you're not capturing. Do use ^ anchors where appropriate.

There is also no need for a <Limit> envelope unless you explicitly want to set different rules for some particular type of request. In fact, the Apache docs have some blahblah about it not being a good idea in general. Suppose someone shows up with a request method you've never heard of? You don't want to give them a free ride just because their method isn't listed by name. (A special case is PUT-- but that one's probably already blocked by permissions set at the server level. It's the kind of thing that a shared host has to do to protect everyone.)

A <Limit> envelope can be used for supplementary blacklisting: for example, block all POST requests except to specified URLs. But you could also do that in mod_rewrrite.

classicman

8:18 pm on Sep 1, 2015 (gmt 0)

Thanks so much for all the helpful tidbits.

SetEnvIf User-Agent ^Mozilla.* good_guys
SetEnvIf User-Agent ^Opera.* good_guys
SetEnvIf User-Agent ^Googlebot.* good_guys

Order Deny,Allow
Deny from all
Allow from env=good_guys

Would the above have better coverage for allowing the desired user-agents while blocking other bots? I've included some pattern matching (.*) at the end of the string names for different versions of browser user-agents and various bots from google.

Some remaining questions:
1. Would you recommend using all the bot names from google explicitly instead?
2. Isn't it true that different browsers (I mainly care about the most popular: Chrome, Firefox, Opera, IE, etc.) could have some different trailing characters? Ie. Mozilla/5.0 etc.
3. Is there another way, other than using pattern matching in user-agent strings, to allow human operated browsers access?

not2easy

9:03 pm on Sep 1, 2015 (gmt 0)

SetEnvIf User-Agent ^Mozilla.* good_guys

Will allow those bots that are not even honorable enough to say who they are. You can't really block bad bots by UA because as lucy24 mentioned, they wear disguises. We all wish it were that simple, but to keep unwanted traffic out, UA is not as predictable as IP. It takes some access log wrangling to determine what behavior you want to allow, and where the unwanted behavior is coming from.

lucy24

9:23 pm on Sep 1, 2015 (gmt 0)

I've included some pattern matching (.*) at the end

You don't need to. The absence of a closing anchor already means "there might be other stuff after this part".

tangor

9:29 pm on Sep 1, 2015 (gmt 0)

The UA matching is rather iffy. Too easy to defeat (your attempt).

I use a combination of whitelisting and blacklisting, but the true whitelisting genius here is IncrediBILL. Search Webmasterworld for Whitelisting or White Listing and you will have several hundred instructive posts on this subject. And probably a few revelations along the way about how difficult it is to do properly.

keyplyr

11:10 am on Sep 2, 2015 (gmt 0)

IMO true whitelisting is effective on very few web sites. If you have a general audience, it will take a combination of several defensive devices to accomplish the desired results and even then there are new and unexpected threats born every day. As Monty once said " No one expects the Spanish Inquisition."

dstiles

6:54 pm on Sep 2, 2015 (gmt 0)

Not all SEs are good at identifying their bot IPs and some have their IPs scattered wildly across an IP range. My solution is: if it comes from the SE's larger BOT range (eg /24) at all then if it matches their bot UA it should be ok - where bot UA is usually precise insofar as it rejects image bots and similar rubbish.

classicman - why only allow google? There are a good handful of bots that return useful traffic, including bing and yandex.

wilderness

11:57 am on Sep 3, 2015 (gmt 0)

FWIW, search the forum for 'fake google'.

Legitimate Google bots only come from the ranges 66.249.64-79
All the rest of the Google ranges are from their vast catalog of tools.

keyplyr

7:21 pm on Sep 3, 2015 (gmt 0)

And of that "vast catalog of tools" some are used by Google & and some by anyone. Some may be important for your site and some may cause damage to your site.

tangor

8:30 pm on Sep 3, 2015 (gmt 0)

@keyplyr +1!

classicman

9:33 pm on Sep 4, 2015 (gmt 0)

My solution is: if it comes from the SE's larger BOT range (eg /24)

Could someone explain this?

Legitimate Google bots only come from the ranges 66.249.64-79
All the rest of the Google ranges are from their vast catalog of tools.

From everyone's input, would the following handle what you are describing:

SetEnvIf User-Agent ^Mozilla.* browsers
SetEnvIf User-Agent ^Opera.* browsers
SetEnvIf User-Agent ^Googlebot.* good_guys

Order Deny,Allow
Deny from all
Allow from env=browsers
Allow from 66.249.64
Allow from 66.249.64
Allow from 66.249.64
Allow from 66.249.64
Allow from 66.249.65
Allow from 66.249.66
Allow from 66.249.67
Allow from 66.249.68
Allow from 66.249.69
Allow from 66.249.70
Allow from 66.249.71
Allow from 66.249.72
Allow from 66.249.73
Allow from 66.249.74
Allow from env=good_guys

Could anyone offer guidance on syntax, order of precedence etc. Will the above allow browsers although they are not from the IP blocks specified?

classicman - why only allow google? There are a good handful of bots that return useful traffic, including bing and yandex.

I may decide to allow bots from other useful SEs, but want to get the theory down first and slowly open the gates.

Thanks everyone

not2easy

9:57 pm on Sep 4, 2015 (gmt 0)

If you are talking about blocking Bing and Yandex, just use robots.txt. The UA can be set in the browser, it is an easily spoofed parameter that any botrunner can get around. I suggest that you not allow or deny based on such broad terms. You would be blocking at least half of all legitimate traffic and letting automated bots/crawlers/spiders do as they please.

The rules you are trying to add will only accomplish #2. of your intent. Google will be allowed in.

tangor

11:02 pm on Sep 4, 2015 (gmt 0)

robots.txt will accomplish quite a bit... for those spiders that honor it.

Cranking down on ONLY WHAT I LET IN is a tough row to hoe. White listing works, but is very draconian and you can miss/lose valuable traffic.

The real question is why the specification that only Google gets in, and why the need to do it in .htaccess. robots.txt, while not mandatory, does a fair job for those bots that honor it and those that don't are a little easier to slap down.

lucy24

2:43 am on Sep 5, 2015 (gmt 0)

SetEnvIf User-Agent ^Googlebot.* good_guys

I suspect you've been misinformed about what ^ and .* mean.

The Allow/Deny directives work with CIDR ranges. Learn them.

keyplyr

11:49 am on Sep 5, 2015 (gmt 0)

@ classicman

� As said above, use robots.txt

� To block other bots that don't obey robots.txt (these *should* be blocked IMO) you will need to block the bad neighborhoods they come from: server farms, hosting companies, VPNs, collocation, databases & cloud servers. Sometimes they have bot user agents (UAs) and sometimes they pretend to be human by using common browser UAs. IMO the best way to block them is to deny/allow the IP ranges using CIDR.

� The other bots that come from ISPs or are offered at developer/download sites and used by anyone can be blocked by UA since they may come from various ranges.

� There *will* be collateral damage. Any time you block ranges, you also block humans that surf the web from work computers or use a company mobile plan. Mobile apps (example: Facebook for iPhone/Android) also use these server farms. To allow these people, it may be necessary at some point to learn to poke holes in the blocked ranges (a bit more complicated.)

� Also, bad guys pretend to be good guys all the time. Googlebot, Bingbot, Yandex bot, Baidu Spider and other good bots are faked most often. To block the fakers you'll need to use IP/UA filters. These are rewrite rules that allow a UA from only specific IP ranges.

� Be careful not to cut'n paste lists of so-called bad bots from forum posts until you research and decide for yourself that these are in fact bad. What is bad for one site may be good for another site, and vise versa.

� Those are the basics. It takes time to learn the skills to manage this. I suggest spending a lot of time reading here at WW and other webmaster resources.

BTW - don't use that code you posted above :)

wilderness

2:54 pm on Sep 5, 2015 (gmt 0)

BTW - don't use that code you posted above

ditto!
Your looking for a simple solution, when no such thing exists.
In addition, multiple replies have been provided informing you of errors in your syntax and yet you persist.

not2easy

5:07 pm on Sep 5, 2015 (gmt 0)

This, at the beginning, from lucy24:

There is no need for any of the .* in your examples, because you're not capturing. Do use ^ anchors where appropriate.

is telling you to drop the .* in your example. You can use the forum search or visit the Library of this forum for examples and some enduring information about what you are trying to do. People are taking their time to help, but you need to make the effort because there is no easy, one-size-fits all way to do what you have stated are your goals.

If you had tried looking around you would know that a "CIDR" is a range of IP addresses so instead of one line for each IP you have in your list, you can use one line with the CIDR - an example:
If the robot is visiting your site from 144.76.64.115 (a real example) and you block it with "deny from 144.76.64.115" it will likely come back in a few minutes from 144.76.64.165 but since you look up the IP you will notice it is not from an ISP, it is from a hosting company, "Hetzner" so you probably want to block that robot and all the other robots from the server. The simple way to do that without using all the possible IPs the bot might come from is to use the CIDR: 144.76.0.0/16 which you can think of as an envelope that holds all the IPs from 144.76.0.0 to 144.76.255.255

Spend a little time reading through the discussions about UAs and IPs that are here and you can learn how to manage the traffic that hits your sites.