Forum Moderators: open

Message Too Old, No Replies

Bad Behavior by AmazonAWS and what to do about it?

         

JesterMagic

3:07 pm on Feb 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



AmazonAWS bots have been frequenting my sites a lot lately. It's at the point now that I need to do something about it as the traffic is being showed as direct traffic by Google Analytics and skewing my results from good traffic.

Any reason not to block AmazonAWS to prevent them from chewing up server resources? (beyond the odd few legitimate users)

I did a quick search and a lot of answers are a few years old like this one.

[webmasterworld.com...]

From the topic above it looks like blocking by IP is the way to go in htaccess.

Anyone have an updated list of IP ranges for AmazonAWS (and any other bad behaviour cloud services like your-server.de?)

Any websites recommended to get updates for these IP ranges so I don't have to pester people in this forum?

Thanks

lucy24

5:22 pm on Feb 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Any reason not to block AmazonAWS to prevent them from chewing up server resources?
Nope. Block away!

If you have acceptable robots from AWS ranges, you'll need to make rules that poke holes--for example, by setting and then un-setting an environmental variable, or having multiple RewriteCond. Otherwise you can block them in one fell swoop with a single Require line (assuming Apache).

not2easy

5:39 pm on Feb 5, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



To find them all in these forums, be prepared for some repetition because new ranges will often spawn lists and those will include new and old ranges. See here for example: [webmasterworld.com...]

BTW for whatever reason Google is not the most useful choice for our site search at this time. If you're on desktop, it is in the upper right of any page in the forum, I recommend using DDG.

JorgeV

7:12 pm on Feb 5, 2021 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Anyone have an updated list of IP ranges for AmazonAWS

AWS has it:
[ip-ranges.amazonaws.com...]

JesterMagic

8:46 pm on Feb 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@lucy24 (or anyone else) Any chance you can share your current "deny from" list from your htaccess file for Amazon? I don't mind holes in the list for legitimate bots like the Wayback machine, etc...

lucy24

9:58 pm on Feb 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I no longer use Deny... since my host finally moved up to 2.4. (Crystal ball says this means 2.6 is about to be released, if it isn't already.) In fact, I rarely use IP-based access controls at all; it primarily goes by header using SetEnvIf. Some law-abiding robots use AWS ranges, so I tend to have rules like

SetEnvIf Remote_Addr ^5\.253\.19\b bad_range=$0
...
BrowserMatch goodrobot !bad_range
as opposed to
Require ip 5.253.19
(inside a RequireNone envelope, of course) if I don't need to poke holes.

I put "5.253.19" because it happens to be the first in my numerical list in htaccess, but really I don’t know why I poked the hole; I can’t see any legitimate requests from that neighborhood. Maybe it was used by some honorable robot in 2017.

JesterMagic

10:41 pm on Feb 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ah, I checked my WHM and I am using Apache/2.4.46. It looks like the module mod_setenvif is installed.

I have only been using Apache for a year and a half now coming from an IIS and Windows Server environment.

I've setup rewrites and deny lists in htaccess but have never tried out SetEnvIf.

Any chance you could send me your complete set of rules/directives I would need to add to my htaccess file? From there I should be able to figure out what you are doing and be able to add my own rules for other bots from other cloud services that are visiting my sites.

I understand if you do not want to as you probably have put a lot of work into the list.

lucy24

1:08 am on Feb 6, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Any chance you could send me your complete set of rules/directives I would need to add to my htaccess file?
Sorry, no, it's all just too site-specific. In fact, user-space-specific, since I have SetEnvIf and Require directives in a shared htaccess, and then each site has some further RewriteRules for things that would only apply to one site, or specific filenames. (If your hosting setup uses the primary/addon structure instead of userspace, you would put the shared rules in the "primary" site's htaccess, which is seen by all sites.)

Apache 2.4 comes with a module called mod_compat, whose sole function is to prevent sites from exploding if they still use the old Allow/Deny directives. But it is not recommended to use them concurrently with Require, due to the dreaded Unintended Consequences.

If you've got how-to questions, the Apache subforum is probably the best place to ask.

dstiles

9:29 am on Feb 6, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I block a lot of the amazon range in iptables (firewall) BUT beware:

duckduckgo uses a few single IPs from the range - check here [help.duckduckgo.com...]

As Lucy says, other wanted bots also use amazon - make holes as required.

letsencrypt uses amazon (and MS and other) ips with no published list and blocking them can upset cert renewals. I block amazon (and others) in iptables using (eg)...
-A INPUT -s 52.64.0.0/12 -p tcp -m multiport --dports 443 -j DROP

...which seems to work ok.

JesterMagic

3:10 pm on Feb 6, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the suggestions and the warnings. I just notice that the bad bot is using a specific user agent which makes it easier to spot now.

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/76.0.3803.0 Safari/537.36


This user agent is accessing my site using amazonaws from over 1000 different IP addresses.

I currently have enabled more detail logging to the DB and have created a SQL statement that returns the IPs so I can just copy and paste into my htaccess deny list.

I know it is not the prettiest solution but it should work unless they start randomizing their User Agent.

I know I could block by user agent using a rewritecond but I figured at least when blocking by IP if they start changing their user agent I at least have a list of IPs I am blocking already.

lucy24

5:50 pm on Feb 6, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I know I could block by user agent using a rewritecond
You could. Or you could say
BrowserMatch HeadlessChrome bad_agent
...
Require env bad_agent
Matter of fact, I just checked my own htaccess--the UA sounded familiar but I couldn't remember if it was common enough to block by name--and I actually have this line, except that it winds up with
bad_agent=HeadlessChrome
so it's easy to spot in header logs. (If you don't define a value, the environmental variable is set to 1, which is often enough.)

JesterMagic

2:43 pm on Feb 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So "HeadlessChrome" is not used by any actual browsers then and it is safe enough to block on that?

I see by my logs today that the bot is just using a pile of new ips so I need to block the UA

Sorry I have been trying to find by searching some examples using BrowserMatch and Require but most blogs seem to suggest stuff like:


BrowserMatch HeadlessChrome bad_agent

Order Allow,Deny
Allow from ALL
Deny from env=bad_agent


I want fast code that is easy to maintain and I know from seeing your previous posts over the years you know a lot lucy24.

I notice from the examples I found the Browser Match code snippet above should have had a "not" in it like so right? (or am I missing something that you are doing below the code?)

Require not env bad_agent


As I found an example like this on Stackoverflow: (but I added in our bad bot text)


BrowserMatch HeadlessChrome bad_agent
<RequireAll>
Require all granted
Require not env bad_agent
</RequireAll>


So for an example would something like this work if I want to block user agents, ips, and ip ranges?

# Block Bad Agents
BrowserMatch HeadlessChrome bad_agent

# Block Bad IP Ranges
SetEnvIf Remote_Addr ^5\.253\.19\b bad_range

# Block Bad IPs
SetEnvIf Remote_Addr 5.253.19 bad_ip

<RequireAll>
Require all granted
Require not env bad_agent
Require not env bad_range
Require not env bad_ip
</RequireAll>

lucy24

5:21 pm on Feb 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So "HeadlessChrome" is not used by any actual browsers then and it is safe enough to block on that?
When in doubt about some user-agent, I do a search like this in logs (text editor with RegEx):
\.css .+?HeadlessChrome
Other than search engines, the vast majority of robots only request pages. So if a given UA has never requested a stylesheet, it's safe to assume it is not used by humans.

SetEnvIf Remote_Addr 5.253.19 bad_ip
Careful! SetEnvIf uses regular expressions, so you need anchors and escapes. But you can put IPs in standard form into “Require (not) ip” directives as well.

should have had a "not" in it like so right?
This begins to get into personal-coding-style territory. A line by itself--or a line inside a RequireAll or RequireAny envelope--would need “Require not” if you’re blocking the request. A line inside a RequireNone envelope would have “Require” alone. (Yes, it’s confusing until you are used to it.)

Suppose your 2.2 access controls said
Order Allow,Deny
Allow from all
Deny from blahblah
meaning “Let everyone in by default, unless they meet one of this long list of conditions”. The equivalent in 2.4 could be EITHER
<RequireAll>
Require all granted
Require not blahblah
</RequireAll>
OR
<RequireAll>
Require all granted
<RequireNone>
Require blahblah
</RequireNone>
</RequireAll>
Counting on fingers reveals that if you have 8 or more negative conditions, you start saving bytes by using the <RequireNone> version (29 or 31 bytes for the envelope, vs. 4 each for the “not”.)

Caution: You cannot have a <RequireNone> envelope by itself. It has to be inside a <RequireAll> or <RequireAny>. As usual, 8000 guesses how I learned this.

JesterMagic

8:24 pm on Feb 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks that has been very helpful!