Attack of the Robots, Spiders, Crawlers.etc

Forum Moderators: open

Attack of the Robots, Spiders, Crawlers.etc

Part 3

Whitey

12:33 am on Jun 16, 2025 (gmt 0)

We’re being hammered by bots from all over. Our tech team are onto it, but blocking solutions are being overridden. It’s hard to stay online.

From posts around these forums I see we’re not alone, including Webmasterworld itself.

Last I saw on this subject here at Webmasterworld was here in 2005 [webmasterworld.com...]

Can anyone share some insights as to how to handle this,

thecoalman

7:31 am on Jun 16, 2025 (gmt 0)

Cloudflare. even the free plan is actually quite good, The reason it's so good is they have the data for the bot nets and can automatically block them with "Bot fight mode". The effectiveness of that depends on the plan.

From there the WAF has five rules with free plan but each rule can have multiple conditions. First rule you set for skip action for know bots etc, Second rule outright block, you can block by ASN for example Third rule for worst counties and issue a solvable captcha. 4th rule whitelist countries with "does not equal" with AND, action is JSChallenge. This allow whitelisted countries to get nothing and the rest of the world gets the simple "Checking your browser page..." that requires no user action.

They have rate limiting tool but it only has 10 second limits with free plan. Much more useful with paid plan.

Edge

12:46 pm on Jun 17, 2025 (gmt 0)

Like thecoalman suggested we went to Cloudflare and bought the Pro plan. We mitigate and challenge by country and that seems to have got most of the bots.

It was so bad our server was literally being shutdown at load levels 20+, I suspect either AI bots learning from our content or a competitor paying for a DDOS. An advantage of the Pro plan is that they cache serve a lot of our common webpages and scripts so the website load is typically way down and content get served fast. A negative is that we have to be careful with some settings as it can block necessary scripts, etc..

thecoalman

9:14 pm on Jun 17, 2025 (gmt 0)

As per the cache there is no limits on the size of the cache regardless of plan that I'm aware of. They have limits on the size of a file but it's super generous being hundreds of MB's. Retention policy I'm not so sure about, By default they only cache common static files like images, JS and CSS that will be cached by browser anyway based on the extension. They don't cache .html, .php or /sompagewithoutextension but you can set up custom caching rules to cche or not cache whatver you want. .

There is lot of other benefits, You can firewall of ports 80 and 443 except for Cloudflare IP's, bye bye exploit bots bouncing from IP to IP requesting /phpmyadmin and the other 50 paths they look for. This is also critical step for DDOS mitigation. .

Kendo

1:41 am on Jun 18, 2025 (gmt 0)

blocking solutions are being overridden

What are you using to block bots that is overidden?

Are the IPs real, VPN, rotated, assorted?

What are the user agents, if any?

Whitey

2:17 am on Jun 18, 2025 (gmt 0)

@Kendo - The technical side is beyond my understanding, so I'd have to reach out to our tech team.

But in broad terms my understanding is that we implemented a solution through stages.

Initially, IP blocking was done programmatically, but this reached an unworkable scale.

Then we installed Crowdsec & Zenarmor to work in a complementary manner - but this didn't work and one had to be de installed (as far as I know)

Over the last weekend we were hammered and were down or very slow for around 48 hrs.

Are the IPs real, VPN, rotated, assorted?

They rotate through different geo's, but I'm told that a big chunk is coming in via Singapore, which leaves me to suspect Huawei (no idea really), but i picked up on a post on WebmasterWorld which also said they were being hammered by them. So i believe Singapore IP's were blocked by us.

What are the user agents, if any?

I don't know.

In the last 24 hrs our devs have put in place some scripts or refinements (i don't know) and we have had no outages. We had inputs from other members of our non tech crew, some with strong DB and server skills, with varying suggestions to our CTO, but looks like we may have got on top of things.

I'm really out of my depth on hard core tech stuff - but I was panicking given the environment we're in with AI bots and scrapers going hog-wild.

Kendo

4:40 am on Jun 18, 2025 (gmt 0)

@Whitey I don't think many worry about bot traffic. Some like the connoisseurs here might but most site owners are unaware that most of our traffic is bots and malicious scripts looking for database exploits and weaknesses to inject code for SEO. I manage about 30 sites across Windows and Linux servers and the out of the box stat solutions only show unique visitors and total hits. But on my main sites I have several different types of custom logging for traffic research... referrals, hits from search engines, hits from bots and hits that have been blocked, etc. Most of the traffic I see on the Windows server are probing for WordPress weaknesses and I don't run WP on that server at all. So much traffic is a waste of many resources and seeing the extent can be perplexing.

tangor

10:01 am on Jun 18, 2025 (gmt 0)

IN many respects it is a choice of returning 404 or 403... one is no work, the other is nearly as bad as IP denials.

As for the wordpress attacks a few /wp filters in .htaccess makes those 403 which are quick and clean to filter from logs.

Regardless of what you do the site will be hit, either at the server or the router level. Just one of those things that goes with the job...

thecoalman

10:43 am on Jun 18, 2025 (gmt 0)

Regardless of what you do the site will be hit, either at the server or the router level. Just one of those things that goes with the job...

Not with Cloudflare.. Between what is blocked by Cloudflare, caching and fire walling ports 80 and 443 traffic to the origin server is significantly reduced. They even have an API, so for example if you have CSF/LFD, Fail2ban or many other security implementations you can block at Cloudflare if you want to take it that far.

Whitey

1:21 pm on Jun 18, 2025 (gmt 0)

I just got an email back from our tech team:

We were not dealing with normal bots. Normal bots would identify themselves and behave appropriately, or as directed. We were dealing with bots that do not identify themselves, and therefore, their behaviour cannot be controlled.

Thankfully our proposed solution appears to be working

I haven’t asked what they did yet.

lucy24

4:02 pm on Jun 18, 2025 (gmt 0)

Normal bots would identify themselves and behave appropriately, or as directed.

Are you sure about your tech team’s qualifications? “Normal” is a statistical term, so it’s nonsense to synonymize it with “well-behaved”. Normal and appropriate behavior for a malign robot--which is well over half of all robots--is to gobble up everything you are physically able to get, and to demand every last thing on your shopping list, no matter how many consecutive requests are soundly denied.

Whitey

8:45 pm on Jun 18, 2025 (gmt 0)

Are you sure about your tech team’s qualifications?

Yes, very.

But in the context of dealing with the bot attacks, we’ll see.

Kendo

2:20 am on Jun 19, 2025 (gmt 0)

Normal bots would identify themselves

Only if they bother to promote their crawler or service. Most of what I see uses anything from blank to:
curl/8.10.0
Go-http-client/1.1
Google
Mozilla/5.0
python-httpx/0.28.1
python-requests/2.32.3

+ all sorts of fake user-agents of known web browsers.

tangor

11:02 am on Jun 23, 2025 (gmt 0)

Not with Cloudflare..

That's a third party between you and the web. They can also disappear you if they don't like how you smile. If one needs what Cloudflare offers, great. Most of us do not. Also great.

Either way, the bots/spiders/crawlers deserve pest control. Most times it can be done in house rather than calling in a third party...

Kendo

12:23 pm on Jun 23, 2025 (gmt 0)

Selling protection... antivirus software created a lot of new viruses to make them needed. Sometimes I wonder who might be still using stand-over tactics. A have a few clients that have been bombarded into submission.

thecoalman

2:23 pm on Jun 23, 2025 (gmt 0)

They can also disappear you if they don't like how you smile.

I consider it more along the lines of an extension of your host and like your host they can certainly drop you like a rock. You'll be offline until changes to the nameservers propagate, it's the one and only thing that links you to their service. That said the reality is it's the opposite and they keep a lot of sites online that would otherwise never be able to exist because of DDOS. News organizations, activists and content that could be considered shady at best(even illegal in countries outside of the US).

tangor

12:55 pm on Jun 25, 2025 (gmt 0)

A single host is third party enough for me. Adding yet another?

Call it "Just Me."

Kendo

2:56 am on Jun 27, 2025 (gmt 0)

A single host is third party enough for me.

Especially when they start dictating who can access your web site!

I mean, fancy blocking web browsers because they didn't like the fingerprint response they get, demanding that those web browsers be rebuilt to give similar responses like what they get from 3 or 4 popular browsers.

thecoalman

6:11 pm on Jun 27, 2025 (gmt 0)

AFAIK it's only if you have configured blocking rules where the browser becomes an issue. Officially they support all major browser. Edge, Chrome, FF, Safari and Samsung. That said I just installed Brave and it works fine on CF protected site. Where you are going to run into issues is older out of date browsers.

They do log all requests that have been blocked and the reason. I don't believe there is any situation you can't unblock traffic.

Whitey

1:35 am on Jun 30, 2025 (gmt 0)

One of our developers got back to me:

"I've implemented a JavaScript-based check on page load for requests that don’t appear to be from known good bots. It seems to be effective against the current attacks, but we’ll have more clarity by tomorrow morning. So based on my current changes the website will work only on js enabled devices. Due to a high number of requests from multiple IPs in a short period, our current Geo IP API is getting temporarily blocked because of excessive usage. As a future improvement, I suggest we consider adding an alternative Geo IP service as a fallback, so we can switch if one fails. This isn’t urgent but could be useful in a future upgrade.

For around 7 days things seem to be holding ok.

But I thought I'd run this response through ChatGPT:

TL;DR (for technically skilled community):

Our dev implemented a JavaScript-based challenge to block non-JS bots during a DDoS attack — effective short term but excludes JS-disabled clients and may not stop headless browsers (e.g. Puppeteer). The Geo IP API hit its rate limit under load; dev suggests adding a fallback service (good idea).

Suggestions for hardening:

Use a WAF/CDN (Cloudflare, AWS WAF) for Layer 7+ bot and rate protection.

Implement IP rate limiting at the server/proxy layer (e.g., NGINX + fail2ban).

Cache Geo IP results and add multi-provider failover logic.

Apply JS checks only to vulnerable endpoints, not site-wide.

Log anomaly detection and alerting for high-request patterns.

Consider fingerprinting or behavior-based bot detection for future resilience.

I'm not keen on expensive protection subscriptions in the many $10k's+ annually, so prefer to tackle this inhouse if possible.

Open to further thoughts or stack-specific recs. (btw - I'm tech illiterate, so depend on external / specialised inputs for "Dummies").

Kendo

5:54 am on Jun 30, 2025 (gmt 0)

Although bots might be faking a known user-agent, as most do, they are unlikely to support JavaScript, so that front will be effective. But search engines will also be blocked because they won't be JavaScript enabled. If you are not worried about search engine access those pages, no problem.

One of the features of rate limiting on Windows servers is the option to limit how many hits within a given period. While that can be useful, for search engine access it won't be good because after the limit is reached, no more access. What would be a better idea is is that after the limit is reached, that a delay period could be set to queue and let them dribble in. I am just talking of the top of my head here, but I imagine that search engines would then declare the site uninteresting for public consumption... or do they base that on whether they are fast enough to carry their ads or not?

Whitey

6:50 am on Jun 30, 2025 (gmt 0)

Good point, Kendo. We’ve added JS-based blocking for non-JS agents, which is effective for now, but I’m checking whether that’s preventing search engines from crawling us. From what I gather, Googlebot may fail to pass the JS challenge unless explicitly bypassed, so we’ll need to whitelist verified bots via UA + reverse DNS. Appreciate your flag, search visibility matters, and we’ll test crawlability in GSC and adjust to avoid losing indexation.

tangor

7:55 am on Jun 30, 2025 (gmt 0)

Heh. Whatever one does to block AI, one wants g to access so one does not lose any ranking/traffic. G has a cache. Wanna bet their AI bot is running through that on a routine basis? Damned if you do, damned if you don't.

Don't forget, WE BUILT THIS OMNIVOROUS MONSTER OUT OF PERSONAL GREED---and a few dollars dribbled down in the initial roll out---until a dependent culture of webmasters beholden to g came into being for everything, including the pennies now paid.

YOUR STUFF is theirs these days. There's not a lot you can do about it if you want to appear in their serps. If they don't like your face you won't appear anyway, but they will continue to take your money in exchange for vague promises. G holds all the cards. ALL.

Meanwhile, I deny the bots not beneficial when found with ordinary methods, WHITELIST robots.txt for the desired, and deal with other bad actors with rules. It has become a hobby, for my hobby site, which is not a commercial enterprise and generates zero revenue (though I do accept donations without demanding them).

ACTUALLY, while I view the web in a more pragmatic understanding of what is going on---and has been going for at least two decades---what others are doing to fight against copyright theft, lost revenue, AI overreach, lost traffic, the proliferation of bots in general, is FASCINATING! Each of these threads reveals new insights, possibilities, and might---just MIGHT---change my outlook.

Better than any UFC bout I've ever seen!

Whitey

12:50 pm on Jun 30, 2025 (gmt 0)

From my tech guys:

Good bots are all allowed without any restrictions.

Relief ...... let's see if it holds true :)

Whitey

4:08 am on Jul 28, 2025 (gmt 0)

.... well the servers held until now.

In the past month, we’ve seen our CDN bandwidth usage spike from 5GB to over 25GB — a fivefold increase. This seems to coincide with unusual activity on several specific days (e.g. June 11, 18, 22 and July 7, 26), and we suspect a bot-based DDoS or aggressive scraping attack, possibly now hitting our image assets directly.

We’re currently investigating whether the traffic surge matches our server logs or is targeting the CDN directly. We’re also reviewing if known bots (Google/Bing) were responsible for any of it — but early signs suggest otherwise.

Immediate actions:

* Comparing server logs with CDN logs to isolate origin points.

* Blocking high-volume IPs and countries (non-Western, unless verified).

* Exploring a dual-CDN setup to isolate suspected bot traffic.

* Reassessing firewall protections (e.g., AWS WAF, CrowdSec).

* Investigating alternative CDN options with better bot protection and bandwidth value.

Has anyone else seen similar image-level CDN abuse? Any mitigation tactics that have worked for you would be gratefully received.

Kendo

12:14 pm on Jul 28, 2025 (gmt 0)

a fivefold increase

Same here. After some investigation and custom logging I found that they are not DDoS attacks and not intended to be malicious. Just bots running amok. Yes, AI crawlers created by amateurs.

lucy24

4:51 pm on Jul 28, 2025 (gmt 0)

What's striking about this summer's robot activity is how much of it is clearly botnet-based: groups of robots from colos or, in some cases, compromised human machines acting in concert. Two that I especially notice are:

-- clusters of 10-20 requests for the same page (html only), one after the other in brisk succession, from a variety of unrelated IPs

-- request for page with all supporting files--very unusual for robots--but the supporting files are requested by IPs entirely different from the initial page request

For the first time in years, I've had to sit down and look up IPs and add a slew of “Require ip” directives. From this I learned that certain typos that, in Apache 2.2, would have thrown the server into Lookups mode, will in 2.4 proceed directly to 500. Ouch.

Kendo

8:44 pm on Jul 28, 2025 (gmt 0)

Random or not? This is a typical of an AI bot:
52.230.152.0/24, 20.171.207.0/24, 4.227.36.0/25, 20.125.66.80/28, 172.182.204.0/24, 172.182.214.0/24, 172.182.215.0/24

Whitey

11:13 pm on Jul 28, 2025 (gmt 0)

Update: We’re now seeing server response times in the 5–15 second range, but this may be partly intentional. We believe recent config changes were aimed at throttling server response under high load, likely to reduce strain during bot surges.

That said, it’s not ideal. While the server hasn’t crashed since, the slow response is affecting user experience and signals that we may still be under pressure (or dealing with residual effects of the image-based scraping surge). The CDN bandwidth spike remains fivefold over normal.

Currently:

•We’re comparing server and CDN logs to trace persistent high-load patterns.

•IP and country blocks are in place for high-volume low-value traffic.

•Real-time monitoring is active, with dev coverage through the early hours.

•Reviewing whether our current config is helping or hurting under load.

Still looking for input:

Has anyone else purposefully throttled server responses to protect against bot overload? And if so, did it work, or just mask underlying issues?

tangor

12:10 am on Jul 29, 2025 (gmt 0)

I've had some active sites in the past, but apparently not as large as indicated above. At the same time I have seen an INCREASE in bot traffic, but nothing so extraordinary that a few .htaccess directives can't mitigate. Disclosure: It has been several years since I sold off my commercial side and my only recent experience is a personal niche site which might not draw the same kind of bot activity.

What kind of NUMBERS are we talking about above 100k requests per minute?

Just trying to get a handle on the scope of things.

This 93 message thread spans 4 pages: 93