Forum Moderators: open

Message Too Old, No Replies

Don't shoot the hostages - sparing DuckDuckGo on AWS

Poking holes in AWS ranges

         

JamesSC

11:47 pm on Mar 5, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



This is something I should probably already know how to do, but it seems I don't: how to shut down bad guys from an AWS (or any) range without blocking DuckDuckGo in the process.

By the way, does anyone have the actual addresses that DuckDuckGo crawls from?

The miscreant's IP falls in the AWS range 52.0.0.0-52.31.255.255 / 52.0.0.0/11 but, since I'm not sure (can't remember/didn't record) which address I last saw DuckDuckGo appear from that range, I'm currently reduced to blocking the single IP only, a losing play.

So as not to shoot the hostages, I'm currently reduced to leaving these entire AWS ranges wide open on the chance that a duck might decide to wander in:

107.20.0.0 - 107.23.255.255 / 107.20.0.0/14
23.20.0.0 - 23.23.255.255 / 23.20.0.0/14
34.192.0.0 - 34.255.255.255 / 34.192.0.0/10
50.16.0.0 - 50.19.255.255 / 50.16.0.0/14
52.0.0.0-52.31.255.255 / 52.0.0.0/11
54.192.0.0 - 54.255.255.255 / 54.192.0.0/10

Surely there is a better way.

How do you, more clever than I, do it?

lucy24

12:38 am on Mar 6, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



does anyone have the actual addresses that DuckDuckGo crawls from?
There's a very recent thread--whch I now can't find, but someone will know--that includes a link to their About Our Robot page, giving their exact aa.bb.cc.dd addresses.

In situations like this I'm most likely to do it with mod_setenvif (assuming you're on Apache). It then looks like this:

SetEnvIf Remote_Addr ^52 bad_range=$0
....
BrowserMatch {authorized-distributed-robot} !bad_range

I think you can also do it with nested Require envelopes--block all of A except for AB, but also block ABC except ABCD--but I haven't yet explored the possibilities of this approach.

Edit: Found it! Not here in SSID but next door in Alternative Search Engines [webmasterworld.com].

not2easy

1:26 am on Mar 6, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I have not looked at this too closely myself - I need to take a look at more of the details to even try to put it together. My thoughts involve adding a UA to the allow pile where badbots = deny. I am not certain that is any better way to go but I don't intend to unlock swaths of AWS. I don't know how definite their UA is yet, how often spoofed. Behavior matters.

JamesSC

2:34 am on Mar 6, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



Thanks Lucy, thanks Not.

There's not really a way to use Mod_Authz_Host, is there; that is, to say

Order Allow, Deny
Allow from 52.5.190.19 (the DDG IP Lucy linked)
Deny from 52.0.0.0/11 (the full AWS range it falls within)

Or is there?

not2easy

2:55 am on Mar 6, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Yes, there is such a thing. Using SetEnvIf you can block some and allow some, the trick is in being certain about the criteria you're using to hand out a pass. Things in one environment get in and things in a different environment do not.

When a visitor contacts me about an accidental 403 (rarely, but it happens) I need to determine the scope to open up. A very large US server block (Cogent) is an example that has both residential visitors and a lot of unwanted bot traffic and they are very private about which is which. It can take some time to determine how big a hole to snip in a blocked range using only IPs. Some are much easier.

lucy24

3:30 am on Mar 6, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There's not really a way to use Mod_Authz_Host, is there
When I talk about using mod_setenvif for access control, that's always in conjunction with mod_auth-whatever. The access-control part will then be
Deny from env=bad_range
or, as the case may be,
Require env bad_range
(assuming a RequireNone envelope).

Apache 2.2 doesn't really allow for much granularity. Depending on your Order directive, all of the Allow lines are read after all of the Deny lines, or vice versa.

If you’re troubled with DuckDuckBot fakers--they're really not very common--you might stick with a fully ip-based setenvif:
SetEnvIf ip ^52 bad_range
SetEnvIf ip ^52\.5\.190\.19$ !bad_range
Deny from env=bad_range
(But don't cut & paste. Most of this is from memory, and I've got cats all over the keyboard.)

JamesSC

4:29 am on Mar 6, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



Thanks again, Lucy & Not.

It's obvious now SetEnvIf is the tool I need, whether by UA or IP.

Dimitri

9:05 am on Mar 6, 2020 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



next door

Hello from the next door guy :)

dstiles

11:13 am on Mar 6, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I allow a couple of robots from Amazon and block all others, using setenv. Note for below: the Amazon ranges are more than either bot requires - belt and braces - but other ranges are killed elsewhere.
# Cliqzbot - amazon
BrowserMatch Cliqzbot cliqz bot=cliqz

# duckduckgo - amazon
BrowserMatch DuckDuckBot|DuckDuckGo-Favicons-Bot duck bot=duck

# amazon bots for cliqz, duck - add more as needed (54.128.0.0/9 includes merck and short CN block)
<if "-R '3.0.0.0/8' || -R '18.128.0.0/9' || -R '13.32.0.0/12' || -R '13.48.0.0/13' || -R '13.56.0.0/14' || -R '13.112.0.0/14' || -R '13.124.0.0/14' || -R '13.208.0.0/14' || -R '13.228.0.0/14' || -R '13.232.0.0/13' || -R '34.192.0.0/10' || -R '35.152.0.0/13' || -R '35.160.0.0/12' || -R '35.176.0.0/13' || -R '50.16.0.0/14' || -R '23.20.0.0/14' || -R '52.0.0.0/10' || -R '52.64.0.0/12' || -R '52.84.0.0/14' || -R '52.88.0.0/13' || -R '52.192.0.0/11' || -R '54.64.0.0/13' || -R '54.72.0.0/13' || -R '54.80.0.0/12' || -R '54.128.0.0/9' || -R '54.192.0.0/12' || -R '54.208.0.0/13' || -R '54.216.0.0/14' || -R '54.220.0.0/15' || -R '107.20.0.0/14' ">
SetEnvIf Remote_Addr .* amazon ips=amazon:$0
</if>

...and then at the end of the file a whole list of Require statements including...
Require expr %{REQUEST_URI} =~ m#/robots\.txt#
Require expr %{REQUEST_URI} =~ m#favicon\.ico|apple-touch-icon\.png|apple-touch-icon-precomposed\.png#i
<RequireAll>
Require env amazon
<RequireAny>
Require env cliqz
Require env duck
</RequireAny>
</RequireAll>

ClosedForLunch

12:14 pm on Mar 6, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



By the way, does anyone have the actual addresses that DuckDuckGo crawls from?


23.21.227.69
50.16.241.113
50.16.241.114
50.16.241.117
50.16.247.234
52.204.97.54
52.5.190.19
54.197.234.188
54.208.100.253
54.208.102.37
107.21.1.8

JamesSC

4:10 pm on Mar 6, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



I do believe I've got it now.

Thanks again to everyone for your generous help.

wilderness

4:36 pm on Mar 6, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{REMOTE_ADDR} ^123\.456\.
RewriteCond %{HTTP_USER_AGENT} !DuckDuckGo
RewriteRule .* - [F]

Note: you may use any range (or multiple ranges (additional lines) in the REMOTE_ADDR

ClosedForLunch

8:16 pm on Mar 6, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



RewriteCond %{HTTP_USER_AGENT} !DuckDuckGo


The DuckDuckBot doesn't always have the text 'DuckDuckGo' in the user agent. For the user agents that do, add [NC] to the RewriteCond because sometimes 'DuckDuckGo' is lowercase.

lucy24

9:02 pm on Mar 6, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's probably less work to say
(DuckDuckGo|duckduckgo)
Less work for the server, that is. (Unlike some situations, this will be the same whether the rule is in config or htaccess, since the flatten-and-compare has to be done over again on every request.) And it's never going to be
dUCKdUCKgO

But if you're got the exact, down-to-the-last-ddd IP available, use it.

Dimitri

10:31 pm on Mar 6, 2020 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Beware, I think that DDG has a its own web browser app (for mobile devices), it's possible this browser includes the DuckDuckGo word in its UA.

wilderness

8:58 am on Mar 8, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



FWIW, the basic lines are simply a guideline.
Either the UA or IP (or multiple lines of each) may be changed to anything you desire.