Bad Bot Blackhole - Blocking Ranges - Crawler, Spider, and User Agent ID forum at WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Bad Bot Blackhole - Blocking Ranges

SerpsGuy

4:53 am on Dec 28, 2014 (gmt 0)

I implemented a bad bot blackhole, and left it running for two weeks. It has since blocked over 2k unique IPs. I asked elsewhere on these forums, and was told I should be blocking IP ranges rather than individual ips.

Does anyone have a script or php function to condense a list of IPs into a list of ip ranges? Additionally - if my bad bot script wants to block an IP, is it safe to go ahead and block that entire ip range? Seems like I might be blocking other people?

If it is pretty safe, then how would I go about blocking an IP range rather than an individual IP? Is there a whois lookup function or API that will return the range for me?

wilderness

9:17 am on Dec 29, 2014 (gmt 0)

I suppose most folks have avoided replying because there are simply too many questions with answers that are in the archives.

I implemented a bad bot blackhole, and left it running for two weeks. It has since blocked over 2k unique IPs.

It is my recollection of Key Master's 2002 script (Ban malicious visitors with this Perl Script [webmasterworld.com]
) that these precise/unique IP blocks were ONLY active temporarily and were updated often by the same script.

I asked elsewhere on these forums, and was told I should be blocking IP ranges rather than individual ips.

That assumption is correct!

Does anyone have a script or php function to condense a list of IPs into a list of ip ranges?

I'm not aware that any such thing exists, however in order for it exist it would require a LOOKUP and COMPARE in one of two ways:
1) either an offline database of all IP's, registrants (such a thing does exist however it is very, very large and doing a comparison lookup would take forever even with today's computers).
2) a capability to lookup the known IP to multiple online Registrars and then extract the full-CIDR (or in the case mod-rewrite; convert to mod_rewrite IP syntax.

Additionally - if my bad bot script wants to block an IP, is it safe to go ahead and block that entire ip range? Seems like I might be blocking other people?

Each webamster must determine what is beneficial or detrimental to their own website (s) (another Webmaster World member previously advised you of same)
However, and generally speaking, most bots run rampant from commericial IP's (i.e., server farms and/or hosts) with the most exceptions being from corrupted machines (botnets).

If it is pretty safe, then how would I go about blocking an IP range rather than an individual IP?

There are many, many examples of methods in Webmaster World archives.
The Close to Perfect htaccess [webmasterworld.com] thread provides methods (syntax) that is still function (despite most of the IP's and UA's being outdated.

Is there a whois lookup function or API that will return the range for me?

Each of the following offers their own WHOIS lookups:

ARIN
APNIC
AFRINIC
LACNIC
RIPE

Hope I've answered all your questions.

not2easy

4:39 pm on Dec 29, 2014 (gmt 0)

If there were a free and functional and maintenance-free way to do this, we would be sharing it instead of sharing each new range we come across. Even if I knew of a *paid* way to do this I would be doing other things. If you want it done and you want it to be done right (to actually do what you want it to do) you will need to do it yourself. I think wilderness has given excellent answers and resources.

There are businesses like spamhaus that collect data from members around the world and for a monthly cost and your own participation you can use their API to get up to date info - but it only covers forum spammers. If there is a similar service that covers scrapers, I don't know about it. There is project honeypot, but again, it relies on a script to catch bots doing things they should not be doing. There are sites that compile lists of country IPs but these change and can't be used to block bots without blocking everyone. Servers and hosts and colo servers are not separated from ISPs/telecoms. You can easily block a huge range of visitors if you block every range that falls in your trap. Verizon and AT&T have users who run bots. That is why lookups are important instead of relying on calculations.

I use a script like that, it sends me an email when a bot is trapped and if the IP is not in my own database I do a whois lookup to see who it is. If I don't have it I share the new range information here in this forum to help others. I would say that over half of my own database comes from information shared in this forum.

There are several resources for doing a whois lookup. On Mac OSX you can use the Network Utilities app and check all but AFRINIC and LACNIC. Those two (and the others listed) have their own websites that give you free lookups. By doing a lookup, you will get the IPs at the beginning and end of a range and most but not all of them will supply the CIDR. If you get the range but not the CIDR, there are sites where you enter the range and they give you the CIDR for it. Once you try this, you will quickly see your 2k slide down to a manageable number. Start keeping track of these in a sortable, searchable database. Read through the SES threads here and fill out your database. It does become easier once you start.

Don't just block every range ever reported, get familiar with reviewing your access logs to see who/what is actually visiting your site and what they do there. You will see patterns and UAs and see how your blocking efforts are working. I use the trap as an early warning system but logs tell all.

Yes, it takes time. If you want to stop the damage, it is part of the job.

tangor

6:30 pm on Dec 29, 2014 (gmt 0)

A rational heuristic approach to bot blocking has not yet been invented, at least a reliable one. What's here today is hair tomorrow (sic).

ON the other side, I'd be interested in hearing what stats occur, and server overhead involved, in maintaining single ip blocks ad infinitum. Do you still get human hits or do you crash your server?

Block countries. Block ranges. Block hosts. Block agents. Block referers (sic). Block IPs. Life is less complicated. :)

Note: List is Max to Fine... how granular does your blocking need be?

dstiles

9:06 pm on Dec 29, 2014 (gmt 0)

Spamhaus: this is aimed at mail servers, which is a different thing entirely: mail should not be delivered from dynamic broadband, web sites should only be accessed from broadband (plus a few known good robots). There is sorbs, which lists open proxies and compromised servers (amongst other things) but which is known to be a a little erratic and I would not trust it for web purposes.

There are several threads here about server farms, which most people in this forum agree should be blocked: it is relatively easy to plug those IP ranges into a simple blocker of some kind (eg .htaccess).

The idea of blocking by User-Agent is a partial solution but many baddies come in with forged credentials. Checking other header fields AS WELL is preferable but takes a bit of study to get right, and I doubt anyone here will give much help publicly as such advice will also comfort the enemy. But look back at threads in this forum: there is a lot of general information.

Having arranged a method of detecting bad accesses, it then devolves upon someone to check the IP it rode in on to see if it is a so-far-unknown server farm or a compromised (ie virus-infected) broadband-based computer, which can be temporarily blocked (eg a few days) but unless you intend blocking countries such as RU, UA, CN, IN, BR such IPs should be singly blocked and then unblocked. Note: in some countries such as UK, IPs are assigned to a broadband user only until (eg) the router is turned off / recycled, after which a new IP is assigned. In a few cases you may find dynamic IP sub-ranges allocated to businesses which misuse them: block those as well.

On my rather small server I get about 20% more server-farm accesses than I do valid bot accesses from half a dozen top SEs. The number is very high, but having worked on adding server farm ranges to my blocking database (MySQL) the daily unknowns are small - about 10-20 new IPs per day, which I then analyse and add to the database as either blocked server-farm, allowable dynamic range or blocked dynamic range (known bad services such as SOME UA and RU sources). I add dynamic ranges to my database so that I do not have to look them up for new hits. My home-grown software auto-adds IPs to the database and blocks them for a limited period. All I do is ferret out unknown ranges for good/bad classification.

For analysis I use linux Network Tools to determine the IP's parents, Umit to test for open ports on an IP range (compromised computers almost always have open ports, so test adjacent IPs as well). I sometimes use blacklistalert.org to determine an IP range's general reputation BUT this is a mail blacklist and should be read with caution.

NOTE: Although I use linux for my desktop tools my web server is windows (I wish it wasn't!) so it is likely my own software will not work on most servers used by this forum's denizens.

jmccormac

1:06 am on Dec 30, 2014 (gmt 0)

Is there a business in providing such lists or such a service?

Regards...jmcc

wilderness

1:29 am on Dec 30, 2014 (gmt 0)

Is there a business in providing such lists or such a service?

If such an org exists, their fees would not be economical.

jmccormac

2:32 am on Dec 30, 2014 (gmt 0)

If such an org exists, their fees would not be economical.

That depends on the definition of economical. :) The main flaw in most of the approaches that I've seen is that they rely on post-attack detection rather than pre-emptive action.

Regards...jmcc

dstiles

8:08 pm on Dec 30, 2014 (gmt 0)

> post-attack detection rather than pre-emptive

That's not exactly true. My own system blocks known bad IP ranges AND known bad header combinations. The latter are, if previously unregistered in my database, checked for type - blockable IP range or "broadband" range. Once a range is known any newly blocked IP is silently appended to the database and updated automatically for each new hit. I check every day for accumulations greater than 20-ish and check the accumulation period: many hits in a few days MAY get the specific IP permanently blocked but will be investigated for possible bot-ness and blocked if appropriate. I check for newly detected IPs of unknown parentage and add their ranges to the relevant Type.

In general, my system pre-empts hits; as a result of reporting these pre-emptive hits I perform a small degree of post-attack work which, in future, will then become mostly pre-emtive.

My system COULD be run as a service akin to mail RBLs, returning a 127.0.0.n code according to known/unknown blocked/unblocked etc. The immediate thing against this is that I sometimes act cavalierly in the matter of short IP ranges from such places as RU, UA, BR etc or ranges registered in DNS using hotmail or gmail addresses. I may well have blocked a valid DSL range which others may wish traffic from but which, in my own terms, is likely to be useful.

The second, more general reason for not offering such a service is the time delay. Anyone considering such a service would be concious of the idiot G's concern about response time of web pages about which some people worry needlessly. Although fairly low in practice, there would be a delay which would increase as the blacklist became more widely used. In MY terms, on a local site, the delay is small (usually around 60 mS). In terms of an external site there would be DNS lookup delays, transfer time to AND from the blacklist, plus the actual lookup time. Say, 150 mS?

The latter reasoning would be a trade-off between what is acceptable and what is perceived to be unacceptable but which may well be more acceptable than allowing bad accesses.

A third possible downside would be the method of getting data from the blacklist: my own method would be via a link in every web page (in my case, a link at the top of a common header file) - I run an IIS server anyway. It is probably possible to pull in the information in some other way (eg .htaccess?).

jmccormac

10:34 am on Dec 31, 2014 (gmt 0)

It does seem to be post-attack rather than pre-emptive. By pre-emptive, I mean building a working model of all data centres and other ranges and weighting accordingly. Most of the solutions that I've seen proposed over the years tend to rely on a kind of honeypot approach. It can also be more effective to use an IP level block rather than Apache/IIS/etc configuration block.

Regards...jmcc

lucy24

6:11 pm on Dec 31, 2014 (gmt 0)

I run an IIS server anyway. It is probably possible to pull in the information in some other way (eg .htaccess?)

htaccess is a location, not a method. (I generally have to explain this because someone is saying "htaccess" when they mean "mod_rewrite" or "mod_authwhatsit" or similar. Here I'm pointing it out because your first language is That Other Server ;))

dstiles

8:38 pm on Dec 31, 2014 (gmt 0)

jmccormac - my ADDITION OF RANGE TO DATABASE is post-attack although the attack itself is pre-empted by a variety of detection methods that DO auto-block bad IPs.

It is not economically feasible to create a complete working model of IP ranges: there are far too many and a major proportion of them are never seen in any kind of "attack" mode for a variety of reasons. In any feasible model one has to react during the attack if the range is not pre-blocked due to previous detection.

I have a form of honeypot on all my sites but the effectiveness compared with header detection is very low.

lucy - sorry, I used the term generically. I have little knowledge of what actually is possible within htaccess - obviously. :)

As a matter of curiosity, is it possible for something in htaccess to probe a remote server for such information as we've been discussing? Hopefully using POST in SSL mode. A simple BAN or OK would be a minimum response, or an IP as in RBLs or, best, would be a string giving more detail (but that would need parsing).

As mentioned here from time to time, I really wish I'd never got started on TOS!

jmccormac

11:38 pm on Dec 31, 2014 (gmt 0)

@dstiles Well if you look at IPv4 as being 4,294,967,296 IP addresses, it might appear that way. However it is feasible, economically and technically, to build a working model of IPv4 and by extension, the data centre ranges within it. For large sites, pre-emptive action is a very effective way of dealing with the problem.

Regards...jmcc

keyplyr

1:21 am on Jan 1, 2015 (gmt 0)

My 2 cents

This type of discussion makes its way here every couple months, with much the same outcome.

IMO any major efforts spent toward managing IPv4 blocks other than manual look-ups are a waste of time. Spend that energy working with IPv6.

IMO If you have a commercial site hosted in a shared environment, the manual look-up & block is the only effective method. I say this from experience. I have used 2 of the scripts that are archived here at WW and neither are a final solution, although they are effective to a point. I found I spend just as much time managing those script results as I did doing it all by hand so I ended up just going back to:

Head checks via scripting
UA filters via htaccess
Method filters via htaccess
File request filters via htaccess
IP filters via htaccess
white listing via robots.txt & htaccess

There's only so much one can do in a shared hosting environment. Anything more requires a dedicated server via firewall & routing.