Forum Moderators: open

Message Too Old, No Replies

Automating the server farm identification

an alternative approach

         

trintragula

10:16 am on Dec 9, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Identifying server farms is a manual process. It's time consuming and open-ended.

However, every time a visitor does something that's manifestly human they're identifying themselves as not being from a server farm.
If we can collect enough of that information automatically, then the server farms are the other ones.

This is the germ of an idea.

As an example of this idea in practice, here is a list of the /8s that have never posted on my forum (in 5 years). So from my perspective, any candidates for a deny from /8 should be in this list.

0,3,4,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,25,26,28,
29,30,33,34,35,36,39,40,41,42,43,44,45,48,51,52,53,55,56,
57,102,104,111,117,125,126,127,133,135,136,140,148,153,158,
160,161,167,170,177,179,180,181,183,191,196,197,200,215,
221,223,224,225,226,227,228,229,230,231,232,233,234,235,
236,237,238,239,240,241,242,243,244,245,246,247,248,249,
250,251,252,253,254,255

54 is not included because I had posts from that /8 2 weeks before AWS bought them.

Because my forum is small (only around 1000 members) it's not a statistically strong sample (and individual numbers are still getting knocked off this list every few months), but someone with a much larger forum could produce this list the same way I did, with a single line SQL query, and could generate some better data.
I'd be interested...

One way of looking at the bot problem is to think about asking "what IP addresses are you prepared to give the benefit of the doubt?"
I would regard the list above as being candidates for closer-than-usual scrutiny - e.g throwing up a captcha page.

/8 are huge chunks. It would be nice to go a lot finer, but that would need more data. And of course if it's collected from forum posts, you need to make sure they're not spam! That gets harder to guarantee on a large forum. But there may be more reliable sources.

I've not really done anything about this yet, I'm just thinking aloud. There may be some flavours of this approach that could complement what people are doing with all those lists of server farms...

A quick look at today's log reveals that about 3% of my visits are from these /8s, including some Baidu, Synapse, Yisou and a few other bots. I'm already blocking the vast majority of bot traffic, though, so on an undefended site it might be much higher.

lucy24

5:11 pm on Dec 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There's not much point to blocking 224.0.0.0/3 -- or indeed of referring to this sector in any way whatsoever -- since it appears to be perpetually unassigned (yes, even while 185 is being doled out in /22 slivers). Just makes a smidgen more work for the server.

54 is being sold piece by piece to AWS, but some of it still belongs to Merck. (I looked this up once and was astounded to find that Merck is doing very well, thank you, and even paying dividends. If you see the world though IP-colored glasses you'd expect to learn they are slowly going under, wouldn't you.)

Don't reinvent the wheel. Some /8 sectors were assigned from the get-go to assorted corporate or governmental entities that don't involve humans surfing the web on their lunch break. But see above about 54.

trintragula

6:04 pm on Dec 9, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Bear in mind that the list above is generated as a result of a simple SQL query. I only mentioned 54 because I noticed after I'd generated the list that it was absent, after I'd been looking back at a recent topic talking about AWS buying from Merck.

I'm trying to explore a different way of thinking about the bot-blocking problem, that doesn't involve manually enumerating unending lists.

dstiles

8:35 pm on Dec 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



224 to 239 is multicast so will not be used for normal web traffic of any kind. 240 up seems to be reserved for multicasting.

trintragula: The list of servers cannot be determined by your method, even amongst a band of enthusiasts such as this. There are far too many server farms, most of which are short-range IP groups embedded in DSL ranges. There are also ranges owned by nationalities such as Japan that will never hit (say) a Western forum but may still be a valuable visitor. I have a lot of blocked server farms and add several new ones every week. Most are between /18 and /21 but occasionally /16 or bigger - last month a new amazon one, for example.

And, of course, there are the botnets, whose IPs can be, literally, from any IP.

Nice suggestion, but I think you should start logging rejections based on browser headers and THEN ferret out the server farms from the "Aren't I clever" DSL users.

trintragula

11:12 pm on Dec 9, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



I'm actually not looking for a yes/no answer here - I'm trying to stimulate some creative discussion about what's possible.

Any site faced with visitors has three options: trust them, block them, or challenge them. The last option opens us to more choices. As an example, on my forum, I always challenge rather than block.
In my specific case I could reasonably assume at this point that a visitor from the list above is from a server farm (or is otherwise unlikely to be welcome).
I could implement a challenge for such visitors. Once in a while someone will actually meet the challenge and the list will grow one shorter.
If I did nothing else that would keep out 10% of the bots (I've looked).
I think I may have to repeat the exercise with the 2000-odd /16s I've observed and see whether that's better or worse.

But a wider discussion would be good: are there broader sources of data? Would it be practical to do something like what the spam blockers do: collecting data automatically from multiple sites and then sharing it?
What do people think?
There's a space of possibilities here that involve looking at the problem from the opposite direction.
Maybe if we look around we might find something useful.

keyplyr

11:53 pm on Dec 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've always felt that if you really don't want a hands-on approach (which I recommend BTW) then just use a bot-trap script; several of them posted in these forums.

trintragula

9:49 am on Dec 10, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



I appreciate the advice, people, but I'm interested in exploring alternatives to what's currently being done.

Angonasec

12:55 pm on Dec 10, 2014 (gmt 0)



Trout-ladder and kettle.

Need I say more? :)

bhukkel

1:27 pm on Dec 10, 2014 (gmt 0)

10+ Year Member



You can look at subnets with mail/name servers and websites. That are certainly server farms. If you take a certiain treshold in mind because non server farms also hosts those services you can identify server farms.

trintragula

6:33 pm on Dec 10, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



The challenge is finding the information.

wilderness

7:33 pm on Dec 10, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



FWIW, the alternative to black-listing is white-listing.
Unfortunately the copy & paste examples for white-listing EXAMPLES are almost zilch.

lucy24

7:43 pm on Dec 10, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Here's the problem:
then the server farms are the other ones.

Sure, you can identify behavior on a single request-- and even more so on a request set. But that will only apply to one specific IP. Without performing a lookup, your server has no way of knowing what range that IP belongs to. It could be a /22 or it could be a whopping /13. Even when you do make the lookup, it isn't always clear what the umbrella organization is. Your offending visitor might be identified as coming from Blue Widgets International, holding a /29 range, but what you really need to know is that they sublet from No Questions Asked Colocation, occupying an entire /17. The latter is what you really want to block.

dstiles

8:35 pm on Dec 10, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There are also "business" services that use a group of IPs (say, /29 or even /32) to legitimately browse web sites AND provide their own mail and/or web service, sometimes to remote same-company locations but sometimes to all-world. The danger here is in blocking a browser because another idiot on the system ran a bot or proxy against your web site. Blocking the provider (eg /18) for a single /29 offence is sometimes a bit over-reactive.

trintragula

9:50 am on Dec 11, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thinking about the idea a bit more: what this is about is essentially producing a population density map of the humans in the IPv4 address space, and using that to decide whether a visitor needs further identification, or whether we can just trust that they're probably human.

This is never an exact science, so any decision about whether to block an IP or not is based on probabilities. We will get it wrong some of the time - inevitably.
As I mentioned earlier, the third option of asking the visitor to provide further evidence of their humanness than we can glean from their request makes it possible to refine this.

@lucy - the question of granularity is certainly a good one, but rather than think in terms of who the server farm is, we could start from "what's the smallest CIDR this visitor shares with a known human?" (i.e. the highest number of shared high-order bits). Below a low threshold we may trust them, above a potentially different threshold, we may choose to block them (e.g. if we've never had a human visitor from their /8). In between we may either look more closely, or ask them a question.
On my own site, I'm finding that a vast majority of visitors are from /8s that I have postings from (China excepted...)
Looking at /16 the distribution is pretty random, because I don't have enough data. What I'm going to experiment with is to look at a count of shared high-order bits and see if I can evolve a confidence measure from that. The number of posts on my forum is about 2^16, so the confidence level is not going to be great.

@dstiles - it's certainly true that parts of the IP space have poor 'zoning control' (as the Americans put it), but that's always a problem with origin-based blocking, and is most acute with bot nets, which are specifically exploiting that weakness.
I think as with a lot of these things, you evolve a confidence level, and either use the 'third option' to settle the matter or accept that you're going to guess wrong sometimes, and either let a bot in, or shut a human out.

As a related, but lateral thought: we've been concentrating here on server farms. Is there any mileage in listing DSL and cable providers instead? If that list were substantially smaller, it could be very useful. Badhat server farms may try to hide their IP ranges, but DSL providers with e.g. 10000000 customers each may actually be more forthcoming about their IP ranges. These guys effectively have 'residential zoning' for a lot of their customers. Obviously a lot of web visitors will come from corporate networks, often through firewalls, and those companies may have poor zoning practices with regard to separating their web sites from their staff. Or not. When they're larger they will likely be better at it.

dstiles

9:25 pm on Dec 11, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



When you consider that some farms and DSL ranges can cover as low as /24 or as high as /13 or more (generally DSL) you have a large number of both to deal with. I have around 20600 ranges (server and DSL) and 31700 separate IPs in my database, the latter being IPs which are otherwise registered within server or DSL ranges but which, individually, have trapped themselves for some offence.

Of the 20600 IP ranges, quite a few are /24 to /21 and several are /15 or greater, mostly (but not all) DSL.

Of the 31700 IPs, a large number have been used as part of a botnet and in most cases are actually "expired", having been trapped at least several months ago (I have a sliding scale of trap times for DSL IPs which expire after a period depending on their nastiness).

> When they're larger they will likely be better at it.

I do like a good laugh. :)

trintragula

11:38 pm on Dec 11, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



20600 IP ranges is a lot of IP ranges. Way more in fact than I see in the server farm threads from the last two years that you can find from the stickied list at the top of this forum.
Therein lies the problem: where do you get the data?
It's dealing with the absence of this data that is motivating me to think about ways of deriving it mechanically by exclusion.

lucy24

12:07 am on Dec 12, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



20600 IP ranges is a lot of IP ranges

If you stipulate that half of the possible IPv4 A blocks are available for general use (disclaimer: I just made up this number for rough-estimation purposes), 20600 gives you an average size that's about midway between /15 and /16 per range ... and most ranges are a ### of a lot smaller than that. Thankfully most IP ranges aren't server farms. And malign robots seem to concentrate in certain neighborhoods, even if hosts don't lay down explicit rules for robots operating out of their servers.

:: detour for some quick number-crunching on htaccess Deny list ::

My current median (not mean) is right about on the cusp between 18 and 19. (Happily for statisticians the world over, /19 also looks like the current mode, though only by a hair.)

Well, that was useful. I found half a dozen <24 blocked ranges -- including a couple of exact IPs, i.e. /32 -- that I'm now going to check to see how recent the offenses are.

trintragula

12:31 am on Dec 12, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



I've been doing a spot of research...

:: detour for quick number crunching on collected server farm lists from the top of the forum here ::

Median and mode are both 20 (assuming I've got my sums right - I'm not a real statistician).
The list here covers about 34 million IP addresses in about 1900 IP ranges.

Hmm. Might be a bit less than that because there are some nested ranges.

keyplyr

8:25 am on Dec 12, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Therein lies the problem: where do you get the data?

As I said earlier, you can use a bot-trap script. Typically a script of this nature will input fields from several sources: UA string attributes, IP address, bot-bait file requests, header fields or specific request sequences. The data collected here can write directly to a hpconfig or htaccess file, or be tabled in a DB and further processed to build a firewall.

I've tried 2 different scripts that were posted here a few years ago. They work just fine, however I found I was still manually scrolling through logs for various other reasons, so I consolidated my efforts. I actually enjoy doing it by hand. It keeps me current with my user base and I'm always learning new things.

trintragula

10:21 am on Dec 12, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



I've been reluctant to get into the whole collector thing with long lists of IP ranges or UAs, but I thought I'd better have a look and see what I'm missing.

Yesterday I took the 1900-odd IP ranges currently listed in the server farm threads at the top of this forum (2 years worth) and set up a filter on my site side by side with my automatic bot blocker (@keyplyr - I think my auto bot blocker is what you're calling a bot-trap script. I'd be interested in pointers to the scripts you mention as I haven't found them. My own bot blocker has been running unattended for 18 months, except when I look in to see how it's doing).

38% of the hits the bot blocker blocked yesterday were in the server farm list.
2% of the hits it let through yesterday were in it (that number would be lower with some more refined analysis of the data, as some blockable requests are deliberately allowed through).

Independently, looking at the visitors to the site yesterday, about 40% of the blocked visitors share a /16 with another visitor. 11% of the non-blocked visitors do.
The numbers are much smaller with sharing of a /24 for example: 21% and 3.5% respectively.
The implication here is that not so many of the visitors are from server farms. This is a crude analysis: there are many reasons for visitors to be co-located other than server farms, and some server farms may only send a single visitor to a site as small as mine on a given day. They differ in how they distribute requests across IP addresses.

I don't want to rush to any conclusions here, but the numbers are interesting.

wilderness

4:46 pm on Dec 12, 2014 (gmt 0)

trintragula

5:37 pm on Dec 12, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks for these, but I've only been able to find mention of the robots.txt and hidden link traps among them.

Some quick stats from my own blocker:
5% ignored robots.txt*
8% followed a hidden link
3% followed an inhuman number of an uninteresting kind of link
4% hit the speed trap
13% hit one or more of the above

There are some other traps that are not so easy to get stats on. Watching for uptake of supporting files is particularly effective.
I've started monitoring headers on iBills recommendation: it's impressively effective, but not actually catching anything I don't catch already.

It turns out that about 90% still get caught by having an obvious User Agent. I have a handful of patterns to catch those that I've listed here previously. I maintain no long lists.

* by which I mean that they went somewhere they were told not to, rather than that they accessed the site without checking robots.txt first.

[edited by: trintragula at 5:57 pm (utc) on Dec 12, 2014]

wilderness

5:53 pm on Dec 12, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for these, but I've only been able to find mention of the robots.txt and hidden link traps among them.


If you don't look behind the surface when people attempt to assist you than, there's not much good in rambling on.

EX in the first link.
The PHP Spider Trap results in a 404, however if you search the archives with the same phrase, you get:
[webmasterworld.com...]

The Original Bot Script in the second paragraph is valid.

Most of us been doing this for more than a decade.
Have our own stats and/or experiences.
Most of us also have long had in place resolutions which your attempting to locate. Unfortunately, there is not a copy and paste solution (i. e., one-size-fits-all), rather each webmaster must determine what is beneficial or detrimental to their own site (s) and that involves both seeing beyond the surface and surveying your own raw logs.

lucy24

7:36 pm on Dec 12, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



each webmaster must determine what is beneficial or detrimental to their own site(s)

I think you meant to say:
each webmaster must determine what is beneficial or detrimental to their own site(s)
;)

wilderness

7:41 pm on Dec 12, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



DITTO!

And that has been the ONE fundamental agreement of this forums participants since the forums inception.

trintragula

9:58 am on Dec 13, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



I am very sorry to have been a nuisance, and I have no wish to cause offence.

wilderness

5:10 pm on Dec 13, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



trinragula,
You've not offended anybody here, and should not be so thin-skinned.

Even longtime participants in this forum fail to agree on all methods and/or criteria.

Most of the participants here all long-timers and we've simply grown to accept each others inadequacies and peculiarities.

lucy and I were just bantering back-and-forth.

When a noob appears, it simply takes time to find-a-cog.

Don

lucy24

9:34 pm on Dec 13, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ditto ditto ;) For each person who says flatly "I do such-and-such" there will be someone else saying one of
(a) "Are you nuts? I know a Ukrainian robot who would coast right past that!"
(b) "Are you nuts? I know someone in Taloyoak who would be blocked at the gate by that!"

keyplyr

9:37 pm on Dec 13, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




(c) "Are you nuts? I never eat bologna sandwiches!"

trintragula

11:00 pm on Dec 13, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks all. (I was nearly out the door...)
For my own part, I'm sure some of my own inadequacies and peculiarities are on full display.

I think we are united by a common purpose, which is to keep the unwanted visitors off our sites.

Note to self: ask more questions and make fewer assumptions.
This 63 message thread spans 3 pages: 63