Forum Moderators: open

Message Too Old, No Replies

Block everything from 54. except certain bots

         

physics

3:51 pm on Jan 28, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Like many I was seeing so many bad bots from Amazon's 54. that I tried blocking it. However, this blocks Pinterest's code which tries to fetch an image when people are pinning on my site (which I do want to allow).
I use SetEnvIfNoCase in httpd.conf to globally block spiders like so:


# UA-Based blocking

SetEnvIfNoCase ^User-Agent$ .*(aesop_com_spiderman|ADmantX|alexibot|backweb|bandit|batchftp|bigfoot) HTTP_SAFE_BADBOT
#.. lots more of these

# IP range based blocking
SetEnvIfNoCase Remote_Addr ^54\. HTTP_SAFE_BADBOT
# ...



Then, in the .htaccess where I want to apply these rules (block bots)

Deny from env=HTTP_SAFE_BADBOT


So what I'd like to do is keep the blanket 54. block, but allow requests from that A class only if the UA matches certain strings, e.g. Pinterest.

Any ideas?

wilderness

5:17 pm on Jan 28, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's much easier to accomplish this with mod_rewrite.
With multiple conditions and using exceptions.

Unfortunately and unless you add some type of IP restrictions that confirm Pinterest IP's (easy to do), you leave yourself vulnerable to fake Pinterest (in all honesty the chances that a fake Pinterest would come from an invalid 54/8 are slim):

#deny 54/8 except UA contains Pinterest
RewriteCond %{REMOTE_ADDR} ^54\.
RewriteCond %{HTTP_USER_AGENT} !Pinterest
RewriteRule .* - [F]

lucy24

7:14 pm on Jan 28, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



SetEnvIfNoCase ^User-Agent$ .*(aesop_com_spiderman|ADmantX|alexibot|backweb|bandit|batchftp|bigfoot) HTTP_SAFE_BADBOT
#.. lots more of these

Don't use NoCase unless you absolutely have to. And there's no need for the .* bits at all. List your robots in the casing they actually use-- for example, GoogleBot (sic) is sometimes used by bad robots. Then make a separate NoCase list for only those robots that use so many casings, it isn't enough even to say [Nn]asty[Ss]tinky[Bb]ot.

You also don't need anchors on the header name "User-Agent", unless you're much plagued by visitors using a supplementary header whose name includes the string "User-Agent" somewhere in the middle.

Finally, you may have overlooked the special mod_setenvif notation specifically for UA strings:
BrowserMatch
BrowserMatchNoCase

# IP range based blocking
SetEnvIfNoCase Remote_Addr ^54\. HTTP_SAFE_BADBOT
...
Deny from env=HTTP_SAFE_BADBOT

I gotta say that is a very weird name for your environmental variable, since it sounds as if it means the opposite of what it says. In any case-- haha-- there's absolutely no reason for the NoCase element here, since you're not matching alphabetic text.

To achieve what you want-- "Deny from everyone meeting this condition, except the ones I specify"-- in mod_setenvif, use the ! which means "unset this variable"-- i.e. don't just set its value to 0, false or "" but remove it entirely. Like this:

SetEnvIf Remote_Addr ^54\. bad_bot
(I assume the trailing \. is to protect against IPv6 addresses, since it's redundant in IPv4.)

BrowserMatch (goodbot|othergoodbot|Pinterest) !bad_bot

The "un-set" line with ! obviously has to come after the "set" line.



:: obligatory disagreement with wilderness ;) ::

Sure, mod_rewrite is easy once you've got the hang of it. But it's fairly server-intensive, and thanks to wonky inheritance it's only practical when all requests pass through a single htaccess file.

For myself I prefer a two-pronged approach. First there's an htaccess file in my userspace for any directive that's shared by all sites. That means access control via mod_auththingummy augmented by mod_setenvif, and selected <Files> envelopes and headers for things like robots.txt that occur on all sites. Then each individual site (within the userspace) gets its own htaccess using almost exclusively mod_rewrite.

wilderness

10:55 pm on Jan 28, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sure, mod_rewrite is easy once you've got the hang of it. But it's fairly server-intensive,


lucy,
With all due respect, hogwash.
You make the difference sound like HOURS rather than NANOSECONDS, however your certainly entitled to your preference and exaggeration ;)

Don

physics

11:10 pm on Jan 28, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



wilderness, I use mod_rewrite a lot for other things, but some post somewhere (probably here) convinced me to use SetEnvIfNoCase because it's supposedly faster, less error prone, and was easy to dump into httpd.conf and then use on any site on the server.

lucy24, thanks for the info and solution with the "!" The weird name comes from whoever named it when I originally grabbed it and I've never really cared what it was ... I mean it does say BADBOT in it so it's never been confusing to me.

keyplyr

12:36 am on Jan 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



use SetEnvIfNoCase because it's supposedly faster, less error prone, and was easy to dump into httpd.conf and then use on any site on the server.

I agree, but I don't use the "NoCase" since I only use this for IP ranges.

lucy24

4:25 am on Jan 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



easy to dump into httpd.conf and then use on any site on the serve

That's probably the most important consideration. mod_setenvif is inherited in the normal Apache way, while mod_rewrite isn't. (Even if you say "RewriteOptions inherit" every time, it still doesn't behave quite like other mods.)

Locutions like NoCase or [NC] are easy to type, but they're putting the server to more work. When you say
Googlebot [NC]
it doesn't just mean
Googlebot and googlebot, i.e. [Gg]ooglebot
Instead it's equivalent to saying
[Gg][Oo][Oo][Gg][Ll] ... etcetera. My fingers get tired just typing it.

wilderness, I've got the impression that you're only running one site at your main location. I've now got six. Admittedly this is at least five more than I absolutely need, but it does make global RewriteRules slightly impossible.

wilderness

6:22 am on Jan 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've got the impression that you're only running one site at your main location. I've now got six.


You know that they say about assume ;)

FWIW, were I put implement this procedure in the primary root domain with mod_setenvif, would mod-rewrite in the lower domains run before or after the mod_setenvif primary root domain?

lucy24

7:25 am on Jan 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Edit:
Blast. After doing all this experimenting, I realize I'd misread your question.

mod_setenvif runs before mod_rewrite-- at least on my system. ("Reverse alphabetical order" seems to be a workable informal rule for most though not all situations.) Each module runs its whole course, from config file down to the littlest htaccess, before handing off to the next module.

That means you can invoke environmental variables in mod_rewrite that you set earlier in mod_setenvif. You can also both set and un-set environmental variables. (Says the docs. I haven't personally tested.) This would have applications in access control, since mod_authwhatsit runs after pretty much everything else.

Interesting: The docs for the [ENV] flag say explicitly
VAL may contain backreferences ($N or %N) which will be expanded.

The docs under CO don't say anything about this; I had to find out by experimentation.


It seems to go around in circles.
If you do not say
RewriteOptions inherit
then any mod_rewrite activity in a deeper directory would overwrite mod_rewrite activity from a higher-level (shared) directory.
If you do say "inherit", then
:: detour to docs for exact quotation ::
Rules inherited from the parent scope are applied after rules specified in the child scope.

(emphasis theirs)

But after further experimentation in test site, I'm ### if I can arrive at any permutation of rules and requests that will result in RewriteRules from an outer htaccess being honored if the request ends up passing through an inner htaccess. Maybe another day I'll experiment in MAMP and see what happens if I put stuff in the config file.

I think the short version is: don't even bother to try ;)

And this is coming from someome who's got RewriteRules in a <Files> section because nobody told me you weren't supposed to.

keyplyr

10:16 am on Jan 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




mod_setenvif runs before mod_rewrite

ditto

blend27

4:29 pm on Jan 30, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Correct me if I am wrong, but it seems that Pinterest bot comes from *.compute-1.amazonaws.com .

and Seems like 54.8[0-2]... At lease on my sites, from my old records.

reverse_dns
ec2-54-82-48-153.compute-1.amazonaws.com
ec2-54-81-231-69.compute-1.amazonaws.com
ec2-54-81-223-42.compute-1.amazonaws.com
ec2-54-81-141-43.compute-1.amazonaws.com
ec2-54-81-126-250.compute-1.amazonaws.com
ec2-54-81-126-250.compute-1.amazonaws.com
ec2-54-81-12-223.compute-1.amazonaws.com
ec2-54-81-119-250.compute-1.amazonaws.com
ec2-54-80-59-170.compute-1.amazonaws.com
ec2-54-80-59-170.compute-1.amazonaws.com
ec2-54-80-50-228.compute-1.amazonaws.com
ec2-54-80-47-123.compute-1.amazonaws.com
ec2-54-80-228-160.compute-1.amazonaws.com
ec2-54-80-217-240.compute-1.amazonaws.com
ec2-54-80-208-163.compute-1.amazonaws.com
ec2-54-80-169-179.compute-1.amazonaws.com
-----------------------------------------

54.* is nuked on many of the sites that I run. As well as anything that has .amazonaws.com at the end of RDNS String.

lucy24

7:38 pm on Jan 30, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Obligatory caution: If you're not already doing RDNS lookups, think twice about whether you want to add this step. Don't know about IIS, but it plays absolute havoc with your Apache logs. (It also creates more work for the server, but that part's a judgement call.)

keyplyr

9:36 pm on Jan 30, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@blend27 - While it may also come from other ranges, I recall that when I needed the Pinterest bot to check my site (so I could add a homepage link) I poked a temporary hole in that AWS range... very temporary :)

physics

9:42 pm on Jan 31, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



keyplr - are people able to pin things from your site successfully with the Pinterest bot blocked?

keyplyr

2:43 am on Feb 1, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@ physics No, since I closed that temporary hole. Their upload utility uses an AWS range as I remember. However, they could have changed it since the last time I checked.

I don't have the type of site that I allow stuff like that. In fact, I have taken great measures to block people using my images or other intellectual property at other sites. However I can't block web-savvy users who could take several steps to work around my blocking techniques, but these aren't the guys I worry about.

I can pin by uploading from my machine. People can re-pin my stuff (if that's the correct term) but that's it. This way I control exactly what is shared.

trintragula

12:05 pm on Feb 1, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



70% of the pinterest bots I see are from 54/8.
1% of my active members have posted from 54/8 in the last year.
RDNS is too slow on my site to do on every request. YMMV.

blend27

2:48 pm on Feb 1, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@Lucy
..RDNS... Don't know about IIS, but it plays absolute havoc with your Apache logs.

I am on IIS, but do it programmatically by using java.net.InetAddress utilities - .getByName(address).getCanonicalHostName() in this case.

IIS Logs are clean and script runs on sessionStart - once per visit.

Added bonus is that most ISP providers(at least in US and Canada) have RDNS setup. Lots of Hosting companies also have it(amazon for example). The look up is instantaneous, less than 10ms if the IP has RDNS. On the othe hand I also use IP tables to determine the IP Country, SQL Query. The table is very large, so the scan can take up to a second or more on the busy days. So if I see .comcast.net I know it is US IP, if I see .shawcable.net or .bell.ca I know it is Canada, saves me a DB look up.

On the other hand if there is no RDNS at all(IP is the same getCanonicalHostName()) - there is a chance that it is a new hosting range that I have not caught(yet).

Or If there is a match to ever growing Hosting ranges to something like these:

.amazonaws.com
.your-server.de
.poneytelecom.eu
.server4you.net
.hosteurope.de
.softlayer.com
.theplanet.com
.ovh.net
.allrati.com
.xlhost.com
.serverloft.com
.fastwebserver.de

IP Gets blocked and If I don't have a range blocked, then there is look up in Regular IP Table and range gets added to the hosting Ranges table, temperately until I can verify it manually.

I let the Compooter do the work for me first.

keyplyr

8:51 pm on Feb 1, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




programmatically

:)