I'm looking for some expertise, and would be happy to share what I've found.
Like many reading this, I had been blocking bot traffic for many years. My site has been online for over 20 years, and I estimate that half of my traffic is bot-based, including people who write scrapers trying to harvest data from my site (my site is data-heavy) - those are the ones I really want to keep out.
I started using Cloudflare - at $20/month, besides the cloud serving, which reduces server load, they have APIs which you can use to programmatically "challenge" IPs that are suspicious. You can also set up various "rules" to challenge IPs (the challenge can be passed by a human, but not a bot) automatically, for example, you can challenge all traffic from China.
However a couple of years ago, I learned that at least one of the bots I was blocking was affecting my site revenue because my site did not have a TRAQ rating, and advertisers will not spend money on sites that do not have such a rating.
This got me looking into the traffic associated with advertisers and marketers. I ran a little test - I created a brand new page, unknown to anyone but me. I placed advertising code on it (I use an tech company which uses header bidding across many supply houses), and then watched my server logs.
Lo and behold, right after I viewed the page in Chrome, several other IPs hit the same new page. Many of these IPs identified themselves with custom user agents - admantx, GumGum, Mediapartners-Google (i.e. Google Ads), Comscore. However some of them came in without identifying themselves, spoofing various User Agents, and also coming in from a host of IP addresses.
Recently, my ad revenue went down quite a bit, and we determined it was due to some of the advertisers discounting ad impressions at high rates. We theorized (though can't prove) that many of the header bidding companies have an automated post-delivery verification process, so when someone comes to your site, they see the ad, and then maybe an hour later the ad company sends a bot to the same page, totally cloaked, not identifying themselves (the only way I know is that the page is secret), and then if that bot gets blocked, they don't pay me for the ad. Once I stopped challenging certain patterns of requests, my number of rejected ads dropped.
Due to this, I have gotten more selective with challenging IPs. However I have one big challenge left: Hetzner.
Hetzner is a company similar to Amazon AWS, based in Germany. I had the entire ASN blocked, they were requesting 25k pages per day with bots. However I traced at least one of those bots to an ad company, so I think that maybe there are other ad companies using that set of servers.
One large source of the traffic, maybe about half, is from something called BlexBot. The rest come from scores and scores of different IPs, the traffic is relatively evenly distributed. For example, I can see via Cloudflare that I have challenged 9,200 requests today, but the highest requesting IP has just 729 requests, but when I look at the requests from that IP, they are requesting with over 15 different User Agents. So they are clearly spoofing the Agent.
I just want to know if anyone here has an experience with this kind of thing? I'd be glad to answer any questions and help people out if they have an interest in this area of bot-blocking.