Automating the server farm identification

Identifying server farms is a manual process. It's time consuming and open-ended.

However, every time a visitor does something that's manifestly human they're identifying themselves as not being from a server farm.
If we can collect enough of that information automatically, then the server farms are the other ones.

This is the germ of an idea.

As an example of this idea in practice, here is a list of the /8s that have never posted on my forum (in 5 years). So from my perspective, any candidates for a deny from /8 should be in this list.

0,3,4,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,25,26,28,
29,30,33,34,35,36,39,40,41,42,43,44,45,48,51,52,53,55,56,
57,102,104,111,117,125,126,127,133,135,136,140,148,153,158,
160,161,167,170,177,179,180,181,183,191,196,197,200,215,
221,223,224,225,226,227,228,229,230,231,232,233,234,235,
236,237,238,239,240,241,242,243,244,245,246,247,248,249,
250,251,252,253,254,255
_{54 is not included because I had posts from that /8 2 weeks before AWS bought them.}
Because my forum is small (only around 1000 members) it's not a statistically strong sample (and individual numbers are still getting knocked off this list every few months), but someone with a much larger forum could produce this list the same way I did, with a single line SQL query, and could generate some better data.
_{I'd be interested...}
One way of looking at the bot problem is to think about asking "what IP addresses are you prepared to give the benefit of the doubt?"
I would regard the list above as being candidates for closer-than-usual scrutiny - e.g throwing up a captcha page.

/8 are huge chunks. It would be nice to go a lot finer, but that would need more data. And of course if it's collected from forum posts, you need to make sure they're not spam! That gets harder to guarantee on a large forum. But there may be more reliable sources.

I've not really done anything about this yet, I'm just thinking aloud. There may be some flavours of this approach that could complement what people are doing with all those lists of server farms...

A quick look at today's log reveals that about 3% of my visits are from these /8s, including some Baidu, Synapse, Yisou and a few other bots. I'm already blocking the vast majority of bot traffic, though, so on an undefended site it might be much higher.

SELECT concat( f.ip, '/', f.mask ) , substring_index( from_unixtime( max( m.postertime ) ) , ' ', 1 ) FROM ( SELECT * FROM `smf_messages` GROUP BY substring_index( posterip, '.', 3 ) ) AS m INNER JOIN smf_farms AS f ON inet_aton( m.posterip ) & ( -1 << ( 32 - f.mask ) ) = inet_aton( f.ip ) & ( -1 << ( 32 - f.mask ) ) GROUP BY f.ip ORDER BY substring_index( from_unixtime( max( m.postertime ) ) , ' ', 1 ) DESC

54.80.0.0/12 2014-07-26 -- amazon 93.112.0.0/13 2014-07-26 -- voxility 93.115.80.0/20 2014-07-26 -- fullshop romania (voxility?) 54.192.0.0/12 2014-07-14 -- amazon 146.185.0.0/16 2014-06-11 -- HSI 100TB (was netsumo) 5.63.144.0/21 2014-06-09 -- HSI 100TB (was netsumo) 74.115.0.0/21 2014-04-01 -- anchorfree 91.108.180.0/22 2014-03-18 -- webexxpurts 65.192.0.0/11 2014-02-05 -- colostore (now some MCI/verizon?) 69.40.0.0/13 2013-09-04 -- windstream 2.232.0.0/13 2013-07-08 -- fastweb 69.174.0.0/17 2013-03-06 -- scansafe 209.251.192.0/19 2013-02-08 -- tampa time inc 93.114.40.0/21 2013-02-05 -- voxility 213.235.192.0/18 2013-01-26 -- austria tele2 209.68.0.0/18 2013-01-24 -- pairnet 69.48.0.0/12 2013-01-09 -- HSI/intergenia

Automating the server farm identification

an alternative approach

trintragula

Angonasec

trintragula

lucy24

trintragula

keyplyr

lucy24

keyplyr

trintragula

trintragula

lucy24

trintragula

dstiles

jmccormac

not2easy

trintragula

lucy24

trintragula

lucy24

not2easy

trintragula

dstiles

lucy24

trintragula

trintragula

lucy24

trintragula

trintragula

lucy24

trintragula

lucy24

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week