Forum Moderators: open

Message Too Old, No Replies

Two new scrapers

(or maybe AI-related bots)

         

asterickx

3:47 pm on Nov 4, 2023 (gmt 0)

10+ Year Member



Lately (~10 days) I've been noticing two new scrapers:

1) fidget-spinner-bot coming from 44.231.202.44 , 50.112.160.3 and 54.184.159.16
2) my-tiny-bot coming from 100.21.24.205 , 44.230.252.91 , 52.25.208.208

They both get 403.

not2easy

4:18 pm on Nov 4, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



New to me as well, thank you.

I see these are all Amazon hosted:
44.192.0.0/10
50.112.0.0/16
52.0.0.0/10
52.64.0.0/12
54.184.0.0/13
100.20.0.0/14
100.24.0.0/13

Are you blocking via UA?

[edited by: not2easy at 5:29 pm (utc) on Nov 4, 2023]

lucy24

4:24 pm on Nov 4, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: detour to raw logs ::

Oh, look, there they are, out of sight, out of mind. Lots of fidgets, a handful of tiny, first seen on 18 October.

Almost all of mine get 429, which I think means they've been slamming the server with huge clots of requests all at once. A handful of 403s due to header deficits and/or bad_range.

sysadmMgr

7:55 pm on Nov 6, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



Seeing the same thing across hundreds of sites. The bot does not look for, and therefor honor, any crawl delay, it just blasts away with a massive concurrency at dynamic cookied pages. It also seems to like hammering session-based basket, shipping, and similar on ecommerce sites. I haven't captured requests to view its POST payload yet, but it seems to spend more time submitting data to login pages than I'd like to see.

Amazon has of course done what they always do, which is absolutely nothing. Basically some variation of our customer has stated this is a legit crawler, so we're good with it. I have no end of credential stuffing and carding from AWS, possibly from this bot too or someone else jumping on the bogus user agent, and they don't care. You could really run a DDoS as a service off AWS pretty effectively since their network has become too big to block for most legit websites that have integrations.

Oh, as of Nov 6 it's also using "thesis-research-bot" in addition to the other two.

lucy24

8:45 pm on Nov 6, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



their network has become too big to block
That's how I arrived at the bad_range environmental variable. It can be un-set for known, legitimate crawlers, affording a little more granularity than a flat “require ip” block.

sysadmMgr

10:14 pm on Nov 6, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



Unfortunately for me on the service provider side I can't know each user agent that could be of value to our customers, so I'm stuck in a reactive cycle instead of proactive. So each time some sleazy AWS operator changes the user agent, like this entity keeps doing, it's going to harm our customers until we add yet another bad agent. Or, they just use a legit browser user agent, then it has to be mitigated at a per-site level since AWS will just shift the egress traffic from all dynamic allocation entities around their ranges to prevent people from blocking them as a whole.

SumGuy

1:52 am on Nov 7, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



Explain to me again what is the use-case in allowing AWS IP's to hit your website? Your site, my site, anyone's site. ?

Because until there is one that I understand, I'll keep blocking AWS in my router. That's why I'm not seeing these UA's.

not2easy

4:02 am on Nov 7, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



what is the use-case in allowing AWS IP's to hit your website?
Some sites use services such as API that use AWS.

SumGuy

1:26 pm on Nov 7, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



If your site/server initiates contact with an AWS server then that session is not impacted by blocking AWS IP's in a router device (assuming you have such control on your side). When I block IP's in the router, I'm blocking their ability to initiate contact (to make an unsolicited contact) which is the case when a bot is making a get (or put) to my web server.

If your site uses an API hosted on AWS, does that mean that some AWS IP is "hitting" your site when the API is called? And these hits show up in your logs? And hence they would be blocked through HTACCESS rules if that's how you do IP blocking?

sysadmMgr

1:39 pm on Nov 7, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



In my case, I operate a web hosting network, and while outbound calls to AWS-hosted sites are not an issue, web hooks and similar inbound API integrations from entities who themselves host at AWS would be impacted. For example, if you're an ecommerce site and your warehouse inventory system is hosted at AWS, it may make calls to your site to notify it of inventory changes, and if that entity uses dynamic addressing / scaling, they won't have predictable AWS IP addresses. So, AWS has become the preferred home of slightly cleaner than criminal entities who need addresses that are mostly not blocked and an abuse department who won't do much.

SumGuy

12:59 am on Nov 8, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



If the communication channel between anything hosted by AWS and your site is supposed to be private, is it possible to move that to a different port? Something other than port 80 and 443? That way your site can block the garbage coming from any/all AWS IP's on ports 80 and 443.