Attack of the Robots, Spiders, Crawlers.etc

Forum Moderators: open

Message Too Old, No Replies

Attack of the Robots, Spiders, Crawlers.etc

Part 3

Whitey

12:33 am on Jun 16, 2025 (gmt 0)

We’re being hammered by bots from all over. Our tech team are onto it, but blocking solutions are being overridden. It’s hard to stay online.

From posts around these forums I see we’re not alone, including Webmasterworld itself.

Last I saw on this subject here at Webmasterworld was here in 2005 [webmasterworld.com...]

Can anyone share some insights as to how to handle this,

Whitey

7:28 am on Sep 5, 2025 (gmt 0)

This is hell >>>>> trying to keep these bots at bay is a non stop task. They seems to get around everything we put in place. No-indexing of pages is absolutely ignored, makes the shrinking of our sites page numbers a serious possibility

thecoalman

7:48 am on Sep 5, 2025 (gmt 0)

Going back to my original post, Cloudflare. I posted this in Brett's topic about increased security for WW because he listed testing CF as one the things he was going to do. If you don't like it change nameservers back.

This is snippet from a KB article I wrote for phpBB but you'll get the gist.

Cloudflare's Automated Tools
Go to Security >> Settings. There are various tools here, the one you are most interested in is "Bot Fight Mode". This will automatically block some of the most aggressive bot traffic Cloudflare has identified as malicious. Optionally, you can also enable some of the AI blocking tools.

Cloudflare Custom Rules
Go to Security >> Security Rules >> Create new Rule >> New Custom Rule. CF has an easy-to-use GUI. With the free plan, you get 5 rules. Each rule can have multiple conditions but only one action. Rules are fired in order so make sure the top rules do not interfere with subsequent rules. The following actions can be applied:

Skip - This will skip further rules based on whatever you select under WAF components to skip
Block - The request is blocked
Managed Challenge - Cloudflare will choose what challenge to issue.
Interactive Challenge - CAPTCHA that requires user interaction
JSChallenge - The "Checking your browser...." page that requires no user interaction.

Rule 1 will be used for whatever you want to allow through and skip the rest of the rules. CF maintains a list of known bots that adhere to robots.txt so you can add that if you are using robots.txt. RSS readers cannot pass the Cloudflare check, that is something else you might want to allow through if you have feeds enabled.
Field: Known Bots Operator: Equals Value: <checked>
OR
Field: URI Full Operator: Wildcard Value: https://example.com/forum/feeds/*
Action: Skip All Remaining Custom Rules
Rule 2 will be used for what you want to outright block. You can block using a variety of criteria like ASN, user agent, country, continent and many others. For this example we are blocking the "country" T1 which is used for the Tor network and the continent of Antarctica. These are just examples, phpBB harbors no ill will toward TOR or penguins :).
Field: Country Operator: Equals Value: Tor
OR
Field: Continent Operator: Equals Value: Antarctica
Action: Block
Rule 3 are phpBB specific rules for phpBB's registration page to help stop spammers from registering and brute force attacks for logins. phpBB has it's own brute force detection but for the convenience of users it's not that strict.
Field: URI query string Operator: Contains Value: mode=register
OR
Field: URI query string Operator: Contains Value: mode=login
Action: Managed Challenge
Rule 4 adds a rule for problematic countries or other conditions you want to elevate the Challenge. For action issue an Interactive Challenge. The Interactive Challenge requires the user to perform some action on screen, usually a check box. In the following example it's issued to India and China.
Field: Country Operator: Equals Value: China
OR
Field: Country Operator: Equals Value: India
Action: Interactive Challenge
Rule 5 allows you to whitelist countries and deploy a blanket policy for the rest of the world. For the action, use the JSChallenge, which is the brief "Checking your browser..." page. Countries listed here will not be challenged, add countries where you expect the bulk of your traffic to come from. It's important to note you need to use the "Does not equal" operator with AND. In the following example the US, Canada and the UK are whitelisted.
Field: Country Operator: Does not equal Value: United States
AND
Field: Country Operator: Does not equal Value: United Kingdom
AND
Field: Country Operator: Does not equal Value: Canada
Action: JSChallenge

Whitey

8:39 am on Sep 5, 2025 (gmt 0)

@thecoalman - many thanks, I’ll run that by our tech crew and see how they respond.

thecoalman

2:33 pm on Sep 5, 2025 (gmt 0)

It's pretty straightforward setup. Get an account, add the domain, they will query current DNS and add it to their DNS records. Review the records, you don't want to proxy mail.example.com* etc. Change the nameservers to the ones they provide and wait for propagation. You can check if site is being proxied by CF by examining headers, cf-cache-status is useful one. As already noted if it's not agreeing with you just switch nameservers back.

*Doesn't matter for these scraper bots but If you want true DDOS protection you need to remove all sources of origin IP. Mail server needs to be on different IP. Firewall ports 80 and 443 except for CF IP's. I had two DDOS attacks many years ago. Completely unprepared for first one but for second one I was behind CF however..... they must of determined hosts IP range and then ran a bot across entire range looking for example.com/uniquefile.jpg . Server spit it out because it was default domain and then they were able to go around CF once they knew the IP.

lucy24

4:12 pm on Sep 5, 2025 (gmt 0)

No-indexing of pages is absolutely ignored

Now I’m confused. I thought the subject was malign robots, not search engines.

For several weeks my main site was absolutely flooded with a variety of unwanted robots. Best description: Apache logs for this site usually top out at around 1MB/day. In August, several days hit 10-20MB--and that’s from page requests only, no supporting files. All direct to https, incidentally. Robots are getting more talented. As of this week, it looks as if
:: fingers crossed ::
they finally got bored and went away, as logs are now back to their usual size. It’s possible they stopped asking after many days of successful whack-a-mole--but when has a robot ever changed its behavior in response to repeated 403? And I do wish someone could explain to me in words of two syllables what a robot wants with plain html.

Gosh, I miss the days when all you had to check for was UA beginning in “Mozilla” and you could be reasonably confident it’s a human.

Whitey

10:12 pm on Sep 5, 2025 (gmt 0)

Now I’m confused. I thought the subject was malign robots, not search engines.

I thought that removing pages that are no-indexed would reduce the load on our servers, since bots ignore it anyway.

But i see from some chatgpt research that removing pages won’t reduce intensity unless access is also cut off (e.g., block IPs, rate-limit, cache 404s).

For server Impact, deleting them would dramatically lighten the server load, because instead of rendering HTML, the server just serves a cheap cached 404/410.

That means: less CPU, less DB load, fewer timeouts - even if the bot traffic continues.

So the net effect:

Search engine bots, crawl intensity will fall.

Malicious bots not, unless they are blocked or throttled.

Server load , some feel relief, because serving 404/410 is much lighter than generating real pages.

tbh - I don't have a lot of ideas, since my expertise is limited, but this thought passed me by, which is why i mentioned it. Our site is enterprise level with over 44m urls. However, the vast majority are no-indexed and use robots.txt directives for search engines.

tangor

4:14 am on Sep 6, 2025 (gmt 0)

Is that actually 44m urls, or 44m ways to find 440,000? (A bit facetious, but seriously!)

no-index means don't put the contents in the serps, it does NOT mean don't open this page to read the instruction that says don't put the contents in the serps. The DRAG remains the same either way. no-index is so wrongly understood and used incorrectly as a "seo" trick. It doesn't actually work that well!

I get chuckles when people say "I deleted 150,000 and..." it turns out the link was switched off but every page/entry is still on the books. This squanders resources and impacts performance. This from someone who grew up with very little RAM and dang near no storage, floppy or otherwise! One piece of data is a bb in a boxcar. 44m bb's will BREAK the boxcar unless more boxcars are added---slowing things down.

How does this relate to bots? If you actually have that many pages of HTML they will be chewing for a very long time. Whew!

Whitey

4:30 am on Sep 6, 2025 (gmt 0)

Fair point, Tangor - the “bb in a boxcar” analogy made me grin.

To clarify: yes, we do have a monstrous URL footprint, mostly generated variants (filters, params, etc.) that add up to ~44m. But under the hood it’s closer to your 440k “real” pages. The rest are noise, and that’s where the server strain comes in. It’s not unusual in our vertical.

What I’ve noticed is:

•no-index/robots.txt stops SERP exposure but doesn’t ease crawl pressure.

•Deleting and returning 404/410 means the server load is lighter (no DB/HTML render, just cheap header). That helps CPU/mem use, even if the bot traffic still hammers away.

•Bot type matters: Search engines will eventually back off. Malicious/unsophisticated bots won’t, unless throttled, cached, or blocked.

So the big win isn’t magically reducing “intensity,” but reducing cost per hit on the server. Multiply that across millions of hits and it becomes noticeable.

Still experimenting at scale — and like you say, the DRAG remains if the endpoints exist. The challenge is figuring out which URLs deserve to be killed outright, and which need to stay alive but shielded.

lucy24

5:20 am on Sep 6, 2025 (gmt 0)

no-index/robots.txt stops SERP exposure but doesn’t ease crawl pressure.

Again, these are different things. The “noindex” flag, whether in a page’s HEAD or in an x-robots tag, will only be seen if the robot first requests the page; its sole function is as an instruction to search engines. And robots.txt is the equivalent of a “No Admittance” or “Employees Only” sign: law-abiding people will heed it (which saves you the annoyance of constantly having the doorknob rattled) but if you absolutely need to keep them out, you need to install a deadbolt. Whether that’s a 403 or a 410 is a matter of personal preference. 404 is a last resort, as the server still has to go looking for the page, unless you choose to return a manual 404 to selected requests. (This approach has its appeal, as you’re not giving the robot any information at all, while a 403 says “I’m onto you”.)

Just glanced at my log files. (Didn’t actually download, just checked the filesize.) Today’s which has a few hours to go, is currently at 12.7MB. Sigh. Now the question is, what proportion of them were blocked.

Whitey

5:55 am on Sep 7, 2025 (gmt 0)

@thecoalman – thanks again for laying that out so clearly.

We’ve looked at CF and it’s definitely on the table. The trade-off we’re weighing is cost, dependency, and flexibility. CF gives quick wins but locks us into their framework and rule limits, whereas our dev is discussing internally the trialling of in-house layers (JS challenge hardening + a “fireball” edge filter) that give us tighter control without ongoing fees.

He’s also moving us onto a local geo-IP cache so that if we do country/ASN filtering, it runs instantly at scale without relying on third-party APIs.

Might still end up adding CF as the outer shield, but for now we’re looking into how far we can get building it ourselves.

Whitey

6:02 am on Sep 7, 2025 (gmt 0)

btw @thecoalman – (I put this through ChatGPT to strengthen my understanding):

yes, Cloudflare does offer a free outer layer: you get SSL, global CDN, basic WAF, and unmetered DDoS protection out of the box — plus Bot Fight Mode, which automatically challenges known bots without extra cost.

If you need more control (like blocking AI scrapers or building honeypots), there are new free tools like AI crawler blocking and AI Labyrinth that can drop bots into decoy pages to waste their cycles and help flag them.

That said, the more granular protections—like Super Bot Fight Mode or full Bot Management with analytics and scoring—do require Pro or Enterprise plans. So, right now, we’re evaluating how much we can get using CF’s free tools, and weighing that against building deeper, self-hosted layers (JS challenge hardening, fireball filters, geo-IP cache, etc.).

I'll see what our dev come back with.

thecoalman

10:01 am on Sep 7, 2025 (gmt 0)

I guess I should define what I'm talking about when I say rules. There is a Rules section but these are mostly for things like redirecting, rewriting or other things you probably want to manage server side anyway. Useful for blanket rules like redirecting http requests to https, it will speed up response and origin server doesn't have to return a response.

The rules I have been talking about are under the security section on the "Security rules" page.

The free plan has limit of 5 rules. The limit on those security rules is the action because each rule can only have one action but there is only 5 actions. Each rule can have multiple conditions, if you have rule with the action to block you can add multiple UA's, ASN's, ip's etc. I don't know if there is limit on conditions. You get 20 rules with pro plan but quite honestly the only benefit I've found with this is you can organize them better, instead of a single blocking rule you can have them labeled ASN block rule, UA block rule etc.

Might still end up adding CF as the outer shield,

That's largely what it is to begin with. Look at it this way, the bare bones is DNS server. It's only when proxy service is enabled on those DNS entries CF is engaged.

There is nothing of consequence server side that changes when you engage the proxy. If anything it can improve what you are doing server side because the request to origin server includes country code and that is going to be dead accurate. You're also going to want to configure mod_remoteip(or something similar) on origin server so you can restore the original IP, also sent as header.

Whitey

6:50 am on Sep 8, 2025 (gmt 0)

Appreciate the detail. Our snag is that the site’s are highly dynamic, so caching at the edge isn’t really an option. If more of it was cacheable, Cloudflare would be the obvious fit (so I'm told). As it stands, our CTO thinks AWS WAF/Shield is the safer route, even if the costs sting a bit (something I'm nervous about).

That said, we’re not closing the door on Cloudflare. Our dev is working through the options with the CTO’s input, and the aim is to land on a setup that balances protection with cost. We want to back their judgment here and build confidence into the team approach that they settle on.

Curious if anyone here has managed to make Cloudflare work well for highly dynamic sites; any tricks or hybrid setups you’ve seen succeed?

thecoalman

8:33 am on Sep 8, 2025 (gmt 0)

Default settings only cache common static files based on extension like CSS, JS, images etc. phpBB forum serves attachments/avatars through php script so by default it's not cached and cache-control is set to private anyway. I've set up cache rule for avatars based on query string and it will ignore cache-control. I still need to make some modifications to script so public files are set to public. Then I can safely cache them without exposing files in PM's and private forums. That's as far as I'm taking it.

If I was going to go any further they have an API and one of the things you can do with it is manage the cache. You can invalidate the cached url when it changes on server.

thecoalman

4:48 pm on Sep 8, 2025 (gmt 0)

If I was going to go any further they have an API and one of the things you can do with it is manage the cache. You can invalidate the cached url when it changes on server.

Just to add if you have logged in users that by itself won't work however you can check for cookie of logged in user. The cache rules use same AND/OR setup as security rules with multiple conditions. I'm not certain if you can use cache for guests/bots and pass the request to sever based on cookie.Seems this would get a bit complicated depending on your setup.

Whitey

2:00 am on Sep 9, 2025 (gmt 0)

Our dev has tightened request verification (checking for real asset loads like CSS/JS/images and obfuscating the JS layer), which is currently holding the attackers back. Fingers crossed.

Still open to hearing if anyone’s had success with a hybrid CF setup for dynamic sites

jmccormac

4:33 am on Sep 16, 2025 (gmt 0)

Deepsixing Tencent/Aceville/Bytedance/Huawei/Petalsearch ranges will help. The problem is that AI maggots will also use mobile phone ranges and residential ranges (the Brazilan ranges seem to be popular) to hit sites. Unless web hosting provider ranges are identified, whitelisting US ranges and CA ranges may not be a good approach. There are different types of scrapers active and much of the Aceville/Tencent straffic seems to be AI orientated.

Regards...jmcc

thecoalman

6:51 pm on Sep 16, 2025 (gmt 0)

Unless web hosting provider ranges are identified, whitelisting US ranges and CA ranges may not be a good approach.

If you are referring to the rules I posted for Cloudflare it's the last rule which is a blanket policy. If they made it that far you whitelist the countries(or other criteria) you expect most of your legitimate traffic so there is no inconvenience to your regular users. If they aren't listed then legitimate traffic can proceed only after the "Checking your browser..." page.*

The rules before that can be used to manage and block traffic from whitelisted countries. If you block a network by ASN located in whitelisted country it never gets to the rule allowing it. Instead of micromanaging global traffic you only need to micromanage traffic that is going fall under whitelisted rule at end.

*The blanket policy has blocked 70K requests from outside of US and CA in last 24 hours on my site, only a few dozen successfully navigated the challenge. This is typical day to day. I could not imagine what it would be if initial request wasn't blocked.

Whitey

1:07 pm on Sep 17, 2025 (gmt 0)

Relentless. An update from our dev:

I checked the logs yesterday and noticed heavy bot activity.

The total number of requests was around 350,000 per hour, and the bots are still active today.

Our new defense is working at the moment, but it looks like they may be trying to reverse-engineer our code and specifically target us. Hopefully, our script will continue to block them effectively

Kendo

10:14 pm on Sep 17, 2025 (gmt 0)

The total number of requests was around 350,000 per hour,

How long do these attacks last, seconds, minutes, hours?

Whitey

10:40 pm on Sep 17, 2025 (gmt 0)

How long do these attacks last, seconds, minutes, hours?

Recent server outage notification alerts have been showing better, maybe 10-15 mins with the script upgrade changes and only once in the last 2 weeks.

I don’t know the length of the attacks though. I’ll check with our dev and update here.

thecoalman

10:47 pm on Sep 17, 2025 (gmt 0)

I've had DDOS attack, actually twice. For the first one It was on average somewhere around 10 million requests per hour sustained over about 6 or 7 days. Complete overkill for the VPS I was on but that is what it was.The only reason I have any data on it is I deployed CF on second day. What was actually interesting is it was all South Asian IP's and you could see the requests increase as morning dawned there and increase over the coming hours, presumably people waking up and turning on their devices. Might of been 5 million when it was night there and peaked at 15 million during the day there.

Whitey

11:36 am on Sep 18, 2025 (gmt 0)

@kendo - this was the response from our dev:

Bots are still active and haven’t stopped, but their request frequency varies. On the production server, we usually see around 12,000 requests per hour as normal traffic. Recently, after optimizing the database server, it has been able to handle more requests without overloading the CPU. As a result, the bots have also increased their request volume. Right now the last hour number of requests is around 27000

Kendo

11:25 pm on Sep 18, 2025 (gmt 0)

If it is the same page over and over again that might suggest a DoS attack... unless you have them blocked and they are merely trying to get more data than a redirect proviudes.

I have been logging "referrer" of late and the logs are looking like a who's who of IT icons. Today's winner (not counting Google) is craigslist.com. Yesterday it was ucla.edu

thecoalman

1:22 am on Sep 19, 2025 (gmt 0)

I have been seeing this:

?id=31156&t=1&sid=67df0b0773641a7f2469fed07414aef4

sid - would of been present on another page load
id - invalid parameter
t - this is for topic but it's set to 1 which would of been "Welcome to phpBB" topic which was deleted 20 years ago.

It just repeats over and over with different ID and SID,

Kendo

4:32 am on Sep 19, 2025 (gmt 0)

sid=67df0b0773641a7f2469fed07414aef4

I am looking forward to seeing SID removed from links. The sooner the next version is released the better for sure.

thecoalman

10:28 am on Sep 19, 2025 (gmt 0)

As I said before it was never really a problem in the past. Removing it doesn't stop the requests, it just allows unnamed bots to scrape more efficiently. Can certainly stop them when they run out of links to scrape. That said you can use it against them.

Check if request contains SID, make sure it's not a legitimate bot in off chance they are following link with SID, check if referrer is blank and check if phpBB cookie is not present. If so issue a challenge.

The only instance I can think of where this will effect regular users is if they have bookmark with SID and not logged in. Another outlier is someone blocking both referrer and cookies. It's minority and only small inconvenience for them.

Kendo

12:02 am on Sep 20, 2025 (gmt 0)

Problem is that each request gets assigned a unique SID, so every time a bot lands, it looks for a new link which it has already crawled.

Can it be removed manually?

thecoalman

2:02 am on Sep 20, 2025 (gmt 0)

Just to be clear Ken, bots don't accept cookies hence the reason SID is present. That's only occurs for bots not added to phpBB's bot list that are using a browser user agent which makes them a guest. Identified bots never get SID's unless it's something like a posted link in which case they get a 301 redirect to URL without SID.

There has been some code posted on phpBB.com but I have no idea if there is any consequences. AFAIK it's only used for guest features.

The question is how helpful that is going to be. Probably quite helpful on small forum but not so much on large forum with a lot of content to request.

Kendo

2:41 am on Sep 20, 2025 (gmt 0)

If SID relies on cookies perhaps it should use a server side session object. In any case I see no advantage in using it as a parameter in a hyperlink... session ID should already available with any page request.

This 93 message thread spans 4 pages: 93