Identifying server farms is a manual process. It's time consuming and open-ended.
However, every time a visitor does something that's manifestly human they're identifying themselves as not being from a server farm.
If we can collect enough of that information automatically, then the server farms are the other ones.
This is the germ of an idea.
As an example of this idea in practice, here is a list of the /8s that have never posted on my forum (in 5 years). So from my perspective, any candidates for a deny from /8 should be in this list.
0,3,4,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,25,26,28,
29,30,33,34,35,36,39,40,41,42,43,44,45,48,51,52,53,55,56,
57,102,104,111,117,125,126,127,133,135,136,140,148,153,158,
160,161,167,170,177,179,180,181,183,191,196,197,200,215,
221,223,224,225,226,227,228,229,230,231,232,233,234,235,
236,237,238,239,240,241,242,243,244,245,246,247,248,249,
250,251,252,253,254,255
54 is not included because I had posts from that /8 2 weeks before AWS bought them.
Because my forum is small (only around 1000 members) it's not a statistically strong sample (and individual numbers are still getting knocked off this list every few months), but someone with a much larger forum could produce this list the same way I did, with a single line SQL query, and could generate some better data.
I'd be interested...
One way of looking at the bot problem is to think about asking "what IP addresses are you prepared to give the benefit of the doubt?"
I would regard the list above as being candidates for closer-than-usual scrutiny - e.g throwing up a captcha page.
/8 are huge chunks. It would be nice to go a lot finer, but that would need more data. And of course if it's collected from forum posts, you need to make sure they're not spam! That gets harder to guarantee on a large forum. But there may be more reliable sources.
I've not really done anything about this yet, I'm just thinking aloud. There may be some flavours of this approach that could complement what people are doing with all those lists of server farms...
A quick look at today's log reveals that about 3% of my visits are from these /8s, including some Baidu, Synapse, Yisou and a few other bots. I'm already blocking the vast majority of bot traffic, though, so on an undefended site it might be much higher.