Forum Moderators: open

Message Too Old, No Replies

Attack of the Robots, Spiders, Crawlers.etc

Part 3

         

Whitey

12:33 am on Jun 16, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



We’re being hammered by bots from all over. Our tech team are onto it, but blocking solutions are being overridden. It’s hard to stay online.

From posts around these forums I see we’re not alone, including Webmasterworld itself.

Last I saw on this subject here at Webmasterworld was here in 2005 [webmasterworld.com...]

Can anyone share some insights as to how to handle this,

Whitey

12:27 am on Jul 29, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What kind of NUMBERS are we talking about above 100k requests per minute?

I don't know, I'd have to ask our dev.

But these are large sites with 50m+ url's each. 42 languages. The majority of pages (maybe 95%) are no-indexed and have many images per url. Our tech team are experienced with bot protection (our CTO instigated bot protection strategies and related dev for a large finance institution), but that said they're being challenged to do this and keeping costs low.

Kendo

8:26 am on Jul 29, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I found the main culprits to be using tencent.com network in Singapore. Their abuse email was useless and their own website seems to be unreachable for me atm.

43.128.64.0/18
43.163.64.0/18
43.134.128.0/18
43.156.0.0/18

43.134.0.0/18
43.153.192.0/18
124.156.204.0/22
124.156.192.0/22
129.226.220.0/22
129.226.144.0/20
101.32.160.0/20

129.226.88.0/23
43.163.0.0/18
43.156.192.0/18
43.133.32.0/19
119.28.108.0/23

101.32.104.0/21
43.159.32.0/19
101.32.126.0/23
129.226.192.0/22
119.28.116.0/23
43.134.224.0/19

So far I have blocked these netblocks at the firewall and the list increases. Some earlier ones are not on this list.

Kendo

9:12 am on Jul 29, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



150.109.16.0/22
124.156.200.0/22
129.226.92.0/23
129.226.196.0/22
150.109.20.0/22
150.109.12.0/22

Kendo

9:44 am on Jul 29, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



150.109.94.0/23
129.226.4.0/23
43.156.64.0/18
101.32.114.0/23
150.109.8.0/22
150.109.4.0/22
129.226.208.0/22
150.109.24.0/22

For all netblocks see [ipinfo.io...]

Whitey

10:47 am on Jul 29, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For the first time in years, I've had to sit down and look up IPs and add a slew of “Require ip” directives. From this I learned that certain typos that, in Apache 2.2, would have thrown the server into Lookups mode, will in 2.4 proceed directly to 500. Ouch.

@lucy24 - a useful tip, gratefully acknowledged by our CTO / thanks
What kind of NUMBERS are we talking about above 100k requests per minute?

Just trying to get a handle on the scope of things.

@tangor - not 100% sure, without an in-depth look, but we estimate in the range of 600-1000 requests a minute.

Our CTO traced the spikes to a flood of one-off requests, each IP hitting once and never returning. This doesn’t look like real users or normal crawlers, but more like a distributed botnet or worm-style viral scan. The traffic burns bandwidth without building sessions, which points to an “economic DoS” or reconnaissance attack. Traditional IP blocking is useless since the source rotates constantly.

If these were aimed at image/CDN resources, it could be an attack to drive up bandwidth bills rather than just to take us offline.

@kendo - thanks, I'll forward that IP info onto our crew to look into.

Kendo

9:14 pm on Jul 29, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



119.28.106.0/23
119.28.122.0/23
119.28.118.0/23
119.28.102.0/23
119.28.104.0/23
124.156.196.0/22

thecoalman

11:33 pm on Jul 29, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you were using Cloudflare you could just use the ASN <click>. Just saying.... :P

Whitey

12:03 am on Jul 30, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@thecoalman – haha, fair point on ASN :P but with these worm like one offs cycling through residential/proxies, it’s really more about behavior filters and smart rate limits than straight blocks.

thecoalman

12:54 am on Jul 30, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'd agree Whitey and you need to be careful with the ASN's, Duckduckgo is hosted on AWS as one example.

Requests from tencent would get filtered by other rules I use. The last rule I have whitelists the US and Canada as that is where I expect 99.999% of my legitimate traffic to come form. It's a niche site, realistically I could only whitelist northeast US and it would still cover 99% of my real traffic.Minimally a request from tencent is getting the "checking your browser..." page.

I understand the reasons people don't want to use a service like Cloudflare but it's matter of weighing the benefits over the downsides. The benefits are overwhelming in this comparison IMO. I moved to Cloudflare because of DDOS, not once but twice and I was completely unprepared for it. Knocked my site off the internet for one week first time and 4 days the second time. For that purpose alone I will continue to use it because I don't see my host or myself being able to manage 2000 requests per second on my little VPS.

Kendo

3:27 am on Jul 30, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Cloudfare to me means a deprivation of freedom, the freedom to use any web browser, and the freedom to access from anywhere regardless of country, language, etc. Why should anyone be dictating who our subscribers can be. It is also about privacy and why should any third party service be caching our intellectual property, especially when that content may have been protected by login or DRM token.

As for needing Cloudfare because it is a simple solution, how do we know that they aren't beating up shopkeepers who don't pay protection money?

Ok, so I had to block most of Singapore at my firewall, but that was good for me because I worked out how to solve that problem for good. All our sites will be soon be protected in "reverse" of what we have been practising so far. Instead of blocking select entities we will be blocking them all and then allowing select access only. For example, if the known IP range of GoogleBots differs to what they advertise, they will be rate limited like everyone else.

Whitey

6:01 am on Jul 30, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Good points both ways. Cloudflare’s hard to beat when you’re under real DDOS fire, but I get the concerns on freedom/privacy too. From what I’ve seen and am starting to understand better, bot traffic shifts in phases; so no single fix works, you mix ASN blocks, behavioural filters, and sometimes a WAF/CDN depending on cost and scale.

Kendo

6:32 am on Jul 30, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is amusing... found 101.33.81.73 with a UA like
mozilla/5.0 (compatible; thinkbot/0.5.8; +in_the_test_phase,_if_the_thinkbot_brings_you_trouble,_please_block_its_ip_address._thank_you.) 

Lookup shows that it belongs to 101.33.64.0/19 which is part of the same network listed above.

Whitey

6:48 am on Jul 30, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Gotta love a bot that politely asks you to block it… while still eating your bandwidth. :P

Kendo

7:46 am on Jul 30, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Gotta love a bot that politely asks you to block it

What else can do they when they find that have been found out and blocked. That user-agent string is brand new and only got though my firewall because the IP had not been used before.

My thought for today on AI blocking... no Cloudfester will help at all. Even if blocking by IP address we have no guarantee that their AI bots aren't hidden among their crawler network, and we all know how honest they have been since day one.

Whitey

8:23 am on Jul 30, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Gotta love a bot that politely asks you to block it


Exactly. Almost feels like they’re testing our patience more than our defences.

On the Cloudflare/“Cloudfester” angle – I’ve had the same suspicions. The bot landscape seems to move in suspicious cycles: traffic spikes, bandwidth surges, CDNs get stressed, then the solution is to “upgrade protection.” You’d be forgiven for thinking some of this covert seeding of activity conveniently fuels the sales pitch.

Not saying it’s provable, but the incentives are there. When the same vendors are both gatekeepers and beneficiaries, it creates fertile ground for conflicts of interest. And the bots we’re blocking today could just as easily be hiding in the “approved” networks tomorrow.

At the end of the day, AI-driven crawlers are a different breed. No amount of IP churn, ASN filtering or polite UAs will change the fact they can mimic human behaviour and blend into legit traffic. Which makes me think that relying on Cloudflare alone is a fool’s errand – they’ll always sell you the next tier of “fix.”

I wonder if the potential for engineered demand exists, like some virus software companies were rumored to be involved in.

Just saying. (No evidence - just a hunch)

tangor

10:33 am on Jul 30, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Almost feels like they’re testing our patience more than our defences.

"Thanks for letting us know this is compromised! We'll switch to a different one."

More like a poke in the eye!

thecoalman

4:40 pm on Jul 30, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



the freedom to use any web browser,


AFAIK the only browsers that you will have issues with are older insecure browsers. They officially support all major browsers and I have also tested with Brave and TOR browser. If you run across a site where these aren't working it's probably the owner blocking it. You can block TOR network entirely if you want but on the flip side they also offer TOR routing if you want to enable it.


Why should anyone be dictating who our subscribers can be.


I'm not sure what you mean by this but you are in control of what is being blocked. The only default blocking mechanism is security related similar to OWASP rule set for mod_security. Like mod_security rules can be disabled.

It is also about privacy and why should any third party service be caching our intellectual property,


By default CF only caches common files based on extension that would typically be cached by browser. CSS, JS, images etc. They do not cache anything .html, .php or .asp unless you modify cache settings.. Cache settings can be modified however you want including disabling it entirely. That said it's one of the better features they have. It reduces server load and increases page speed. CF has a few hundred data centers across the globe, cached files are served from local data center considerably increasing page speed. The cache can be emptied at any time.

As far as privacy in general the one big issue is you need to use their SSL certs on the client side. Exception is for business and Enterprise plans but that starts at $200/month.

...how do we know that they aren't beating up shopkeepers who don't pay protection money?


If that is to imply they might be causing the issue to begin it would be backfiring. The free plan is actually quite good and enough for most smaller or even medium sized sites.

My thought for today on AI blocking... no Cloudfester will help at all.


CF has one the largest private global networks handling about 25 million sites. As such they have copious amounts of data on malicious bot traffic. They have various automatic tools for the most aggressive bot nets and AI you can enable, the effectiveness of these tools increases based on the plan.. From there you can fine tune to your own needs. They even add custom headers like country code so you can manage server side if you want.

FYI rate limiting has been mentioned, they also have this feature but you really need pro plan or better for decent options.

lucy24

5:02 pm on Jul 30, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



mozilla/5.0 (compatible; thinkbot/0.5.8; +in_the_test_phase,_if_the_thinkbot_brings_you_trouble,_please_block_its_ip_address._thank_you.)
Oh, no worries, thinkbot, you have already blocked yourself. (Quick run to raw logs tells me I haven't seen it yet, but even barring header deficits, that UA is an automatic lockout.)

Kendo

10:18 pm on Jul 30, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



CF has one the largest private global networks handling about 25 million sites.


So has Google and look at where that got us, and those figures seem similarly inflated.

thecoalman

3:16 am on Jul 31, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Going off topic here but there is a blog post from Cloudflare CEO a few years ago where he discusses this issue.

[blog.cloudflare.com...]

Earlier today, Cloudflare terminated the account of the Daily Stormer.

.....

Now, having made that decision, let me explain why it's so dangerous.

Whitey

3:36 am on Jul 31, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No doubt Cloudflare delivers, faster page loads, solid DDoS handling, and a free plan that seems to cover most smaller sites. They support all major browsers (Brave/TOR included) and caching can be tweaked or switched off.

But let’s not pretend there aren’t elephants in the room:

Centralized power: handing over SSL and traffic logs means one company sits between you and your users. As CF’s own CEO admitted after the Daily Stormer takedown, that’s a dangerous role to hold.

Metrics hype: “25 million sites” and “AI bot protection” make for good marketing, but sound a lot like Google’s “only we can manage the chaos” pitch.

Bot inflation? When you see waves of new AI flavored bots, you can’t help but wonder: is this genuine traffic noise, or part of the problem being conveniently overstated to sell upgrades?

Great tool, yes. But the bigger it gets, the harder it is not to see Cloudflare as gatekeeper and shopkeeper rolled into one.

And let’s be honest - we’ve all seen protection rackets before, haven’t we?

Kendo

6:43 am on Jul 31, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you were using Cloudflare you could just use the ASN

That could be like dropping a nuke to level a single building. We all have netblocks within larger netblocks and the tenants in any one netblock can be multiple entities from multiple countries. I am finding offenders buried in Google, MSN, Amazon and a host of other hosting services. While some of those netblocks may be in Singapore I am also finding a lot in USA.

As for rate limiting, one needs to try and create such a thing to realise that the idea is fanciful. I have been logging hits with a timestamp (in seconds) and while some of the more sophisticated crawlers like Google can be delicate others do make many requests. Consequently a crawler like Google would not be rate limited. But I have also noticed other crawlers, unwelcome ones, being gentle to avoid attention.

Search engines learned a long time ago to meter their crawling because it was crashing servers. I remember my own was being crashed at the same time every morning. It wasn't until others reported similar problems that we all discovered why.

thecoalman

4:33 pm on Jul 31, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That could be like dropping a nuke to level a single building.


I'd agree and as I mentioned in subsequent post you need to be careful with blocking by ASN. I mentioned Duckduckgo on AWS as one example, as another example I blocked a network hosting a VPN one of my users was utilizing once. To be clear the block wasn't because of the VPN, the VPN just happened to be on same network. That said when you are seeing repeated abuses from constantly changing IP's within the same network where legitimate traffic is unlikely it becomes very easy to drop the nuke.

thecoalman

4:57 pm on Jul 31, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



handing over SSL


This is largely the one thing I don't like about their service but on the other hand many of the performance based tools aren't going to work without it. It's possible to get a dedicated SSL with any plan if you pay for it instead of the universal SSL but it's still not end to end.

ichthyous

5:01 pm on Aug 25, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I noticed a huge spike last night reported via cloudflare. When I traced the ASN it shows that it originated from Clouvider servers. A bit more research and I found that Clouvider is one of NordVPN's providers of servers. The traffic showed up as originating in Atlanta, USA but it was using a VPN. I have to check whether any of it made it through to my server...oddly, the CF response code for the URL they were hitting was 301. So they were requesting an old page that had been redirected I guess

lucy24

7:56 pm on Aug 25, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



the CF response code for the URL they were hitting was 301. So they were requesting an old page that had been redirected I guess
Could also be a canonicalization issue: either http for https, or wrong www, both of which are common with newly arrived robots. And now and then I get a flurry of requests for interior pages with missing directory slash; that’s also a 301.

But why “I guess”? Does CF not tell you the exact URL that was requested?

thecoalman

10:20 pm on Aug 25, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You might want to check if something is spawning "unique" URL's. I know with phpBB the SID is included with on page links for initial page load with new or expired session in case user has cookies disabled. phpBB has lot of guest features and they also get a SID. Since the bots don't accept cookie, multiple IP's and whatever other checks phpBB has it keeps spawning new SID's. It's in part kind of funny because they make request for login link and the only links on that page are behind login so round and round they go.

Note that phpBB has bot groups to identify bots via user agent. SID's are never present, they are redirected if one is requested and links like login are removed. They have clean sailing other than post content.


Does CF not tell you the exact URL that was requested?


Yes but the path and query string are presented separately. It's going to have anything you would expect in Apache log plus.

Service: What category blocked it, e.g custom rules, CF rules etc.
Action taken: Type of Challenge issued, Blocked....
Rule: What rule was triggered, there is link here to take you to management of rule.
Ray ID: This is on page for user when they get blocked, it's unique ID that can be used to look up false positives if user sends it to you
Country and ASN: No explanation needed. :)

tangor

3:26 am on Aug 26, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As an aside, not discounting any of the above, I have been seeing two months of increased bot activity from Russia that accounts for 75% of my bot traffic (which is 85% of ALL traffic) and one can't discount---that in these days and times---that national actors may just be being pests for patriotic reasons. Not saying that is so, just wouldn't be surprised if that turned out to be true.

YYMV

More interesting is the ips:

xxx.xxx.xxx.16 (3)
xxx.xxx.xxx.1 (9)
xxx.xxx.xxx.23 (2)
xxx.xxx.xxx.13 (15)
xxx.xxx.xxx.110 (6)
xxx.xxx.xxx.20 (150)
xxx.xxx.xxx.216 (85)
xxx.xxx.xxx.48 (22)
xxx.xxx.xxx.192 (3)
xxx.xxx.xxx.56 (11)

In a matter of a minute or two, every few hours. All for the SAME existing interior page (one of my more popular). Looking back over the last 10 years of logs never had that many .ru visits. Something is going on and it ain't just "bot".

thecoalman

12:25 pm on Aug 26, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I would presume it's for collecting data for AI. I certainly wouldn't discount the idea it's nation state, at least part of it. Also wouldn't surprise me if it were legitimate companies that have been otherwise blocked with their identified crawlers. On the other hand a lot of it doesn't make sense, if you take the SID issue above certainly at some point someone would realize it. The most recent activity I'm seeing has user agent Mozilla/5.0 (compatible; crawler), perhaps it's just multiple dumbasses.

Over on phpbb.com I've see this activity first being reported perhaps two years ago and it's escalated since. It's good indicator because a lot of users there on shared hosting so many of them getting accounts suspended etc. It's random countries, Brazil one day and Singapore the next. On my own site about half a million requests blocked by a challenge over the past week mostly from AWS data center in Singapore.

Kendo

1:28 am on Aug 28, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's random countries, Brazil one day and Singapore the next.

This could be due to the new arms race... to feed AI.

The new players are hungry for information but do not have the experience of the search engines who have learnt that taking a little bit at a time doesn't kill the host.
This 93 message thread spans 4 pages: 93