Forum Moderators: open

Message Too Old, No Replies

Automating the server farm identification

an alternative approach

         

trintragula

10:16 am on Dec 9, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Identifying server farms is a manual process. It's time consuming and open-ended.

However, every time a visitor does something that's manifestly human they're identifying themselves as not being from a server farm.
If we can collect enough of that information automatically, then the server farms are the other ones.

This is the germ of an idea.

As an example of this idea in practice, here is a list of the /8s that have never posted on my forum (in 5 years). So from my perspective, any candidates for a deny from /8 should be in this list.

0,3,4,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,25,26,28,
29,30,33,34,35,36,39,40,41,42,43,44,45,48,51,52,53,55,56,
57,102,104,111,117,125,126,127,133,135,136,140,148,153,158,
160,161,167,170,177,179,180,181,183,191,196,197,200,215,
221,223,224,225,226,227,228,229,230,231,232,233,234,235,
236,237,238,239,240,241,242,243,244,245,246,247,248,249,
250,251,252,253,254,255

54 is not included because I had posts from that /8 2 weeks before AWS bought them.

Because my forum is small (only around 1000 members) it's not a statistically strong sample (and individual numbers are still getting knocked off this list every few months), but someone with a much larger forum could produce this list the same way I did, with a single line SQL query, and could generate some better data.
I'd be interested...

One way of looking at the bot problem is to think about asking "what IP addresses are you prepared to give the benefit of the doubt?"
I would regard the list above as being candidates for closer-than-usual scrutiny - e.g throwing up a captcha page.

/8 are huge chunks. It would be nice to go a lot finer, but that would need more data. And of course if it's collected from forum posts, you need to make sure they're not spam! That gets harder to guarantee on a large forum. But there may be more reliable sources.

I've not really done anything about this yet, I'm just thinking aloud. There may be some flavours of this approach that could complement what people are doing with all those lists of server farms...

A quick look at today's log reveals that about 3% of my visits are from these /8s, including some Baidu, Synapse, Yisou and a few other bots. I'm already blocking the vast majority of bot traffic, though, so on an undefended site it might be much higher.

trintragula

11:36 am on Jan 31, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



On the subject of Pinterest, on an average day I seem to be getting more referrals from Pinterest than from Yahoo and Bing combined. And that's despite Yahoo and Bing being whitelisted and Pinterest not being. Mine's a leisure interest site that spans a very broad age demographic, which may have something to do with this.
A lot of browsers show up with a Pinterest keyword in their UA - presumably a mobile app, and I'm not blocking those.
Their bot gets blocked by my bot trap because it looks like a bot. So does their plain-clothed one, because it behaves like a bot.
I could whitelist their bot, but whitelisting the plain-clothed bot may be difficult. Which is ironic when you think about it...

lucy24

8:17 pm on Jan 31, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How do you block behaves-like-a-bot ahead of time? Recognizing robots after the fact is generally trivial, because then you're looking at a package of behaviors. (Or a non-package, which is a red flag in and of itself.) But identifying them up front in a way the server can understand is not so straightforward.

trintragula

12:55 am on Feb 1, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



I use a collection of different kinds of traps. Some block the first request from a visitor, some need to see several requests, some watch to see whether previous visitors that have something in common with the current one (e.g. UA or subnet) have been blocked after several requests, and block this and subsequent examples before they start.
This stops most distributed scrapes after a handful of requests without my having to intervene.
It's a work in progress. My aim is to have the bot blocker do better than I could manually.
This 63 message thread spans 3 pages: 63