Forum Moderators: open
CREATE TABLE IF NOT EXISTS `ips` (
`A` int(1) unsigned default NULL,
`B` int(1) unsigned default NULL,
`C` int(1) unsigned default NULL,
`D` int(1) unsigned default NULL,
`description` varchar(255) default NULL,
KEY `A` (`A`,`B`,`C`,`D`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
An IP address has 4 parts, or "octets", separated by dots. We're going to store these in separate columns, because this makes it indexing and retrieving them more efficient.
The first four columns are the 4 octets, eg. to store 67.202.62.173, we say A=67, B=202, C=62, D=173.
The last column is a description - not required for the blocker to function, but it's very convenient when maintaining the data.
the int(1) data type will store numbers from 0 to 255, which is ideal for storing each octet
thanks to whoisgregg for that tip
Populate the table with IPs you want to ban. If you are blocking a range of IPs, then you leave a NULL in one of the columns. For example, here is an IP block that belongs to Microsoft:
insert into `ips`(`A`,`B`,`C`,`D`,`description`) values (64,4,4,NULL,'Microsoft')
Here are a few more to get you started:
insert into `ips`(`A`,`B`,`C`,`D`,`description`) values (67,202,62,173,'Alexa');
insert into `ips`(`A`,`B`,`C`,`D`,`description`) values (64,4,4,NULL,'Microsoft');
insert into `ips`(`A`,`B`,`C`,`D`,`description`) values (66,249,65,NULL,'Google');
You can find IP lists like these on the Interweb, or if you're diligent you can compile your own by analyzing your own server logs. I highly recommend rolling your own list. When compiling my list, I started with lists I found online, added some from similar lists shared by colleagues, augmented with my own logs. I found it easiest to pop all the IPs from my logs into Excel; when importing the data I split columns on the "." character. Then it was simple to sort the rows, remove dupes, and see common patterns of IP blocks. From Excel, it was trivial to dump the values into my database.
Now, you have a database table full of IPs you want to detect. Here's what you do with it.
1) Get the IP address. In PHP, it's available from $_SERVER['REMOTE_ADDR'].
2) split the IP address into an array of octets.
3) query the database to find matches
4) return true if any are found.
To save you the trouble of building one, here it is:
function isbannedip(){
$iparray = explode(".",$_SERVER['REMOTE_ADDR']);
$query = "select * from ips where (A=".$iparray[0]." OR A is null) and (B=".$iparray[1]." OR B is null) and (C=".$iparray[2]." OR C is null) and (D=".$iparray[3]." OR D is null) LIMIT 1";
$result = mysql_query($query);
return mysql_num_rows($result)>0;
}
Now, you have an elegant, reusable IP matcher. Use it to shun the bots:
if(isbannedip()){
die();
}
if(isbannedip()){
header('Location: http://www.example.com/');
die();
}
or cloak your content:
if(isbannedip()){
print("mesothelioma mesothelioma mesothelioma mesothelioma mesothelioma");
}else{
print("lorem ipsum dolor sinc et nimibus etna");
}
or whatever you please.
Your database needn't be just bots. You can use this method to block annoying users from your forum, spammers from your blog, or whatever.
Since this is a generic "are you blacklisted?" function, you can use it for all sorts of things. You could include it on every page! In Apache's .htaccess file, add this line:
php_value auto_prepend_file "/global_prepend.inc"
then in that global_prepend.inc file, you insert the code that does the IP check. if the function isbannedip() returns true, then throw a 404 header(), then die().
Or you can use it on individual pages; require_once() the thing, or include() it as needed.
This example was built for IPv4 addresses, but is very easily extended to work with IPv6.
special thanks to whoisgregg, buckworks and jd_morgan for their suggestions to improve this article
Add in a couple of fields to determine when the first hit was plus an expiry period, update the expiry period exponentially (or whatever) every day it hits. Whatever.
I've been going to implement that for soooo long! Pity customers take up so much time. :)
With a little extra effort, you could use the database to generate a firewall block list. This could be automated and refreshed daily as needed. Saves a lot of CPU cycles ;)
e.g.
### 59 ###########################################
block in log first quick from 59.32.0.0/11 to any
block in log first quick from 59.64.0.0/12 to any
block in log first quick from 59.80.0.0/14 to any
block in log first quick from 59.100.0.0/14 to any
block in log first quick from 59.104.0.0/13 to any
### 60 ###########################################
block in log first quick from 60.194.0.0/15 to any
block in log first quick from 60.196.0.0/14 to any
block in log first quick from 60.200.0.0/13 to any
block in log first quick from 60.208.0.0/12 to any
block in log first quick from 60.247.0.0/16 to any
From the ipf how-to:
Rule groups allow you to write your ruleset in a tree fashion, instead of as a
linear list, so that if your packet has nothing to do with the set of tests [..]
those rules will never be consulted.
I'd guess you could make this as fine-grained as needed.
Something like
block in quick from 59.0.0.0/8 to any head 590
block in quick from 60.0.0.0/8 to any head 600
and then you could go hardcore with
block in quick from 59.32.0.0/16 to any head 590 group 5932
block in quick from 59.64.0.0/16 to any head 590 group 5964
and finally
block in log first quick from 59.32.123.0/24 to any group 5932
block in log first quick from 59.64.233.0/24 to any group 5964
(edit: fixed bad copypaste)
But, yes the decimal can be kept in a parallel column and it wouldn't take much to write a little function that converts the octets into decimal. The added complexity is trivial... it's a good idea.
@pfui,
sorry, PHP is my first language, C#.NET is my second. But I believe the technique is simple enough that it could be ported into Perl or Python or Java or ColdFusion or whatever you need, without too much suffering.
~ hww
The only 'thing' that might benefit from using decimal is the check-if-in-range function, and decimal might only be useful there to reduce the actual comparison from four steps (one for each octet) to a single step; You could make the whole thing work with each address specified as four decimal octets, as a single decimal number, or even as a single binary number if you like; the numbers have the same meaning regardless of which radix or format you choose to use to express them or to manipulate them. The most important factors are the efficiency of the check-if-in-range code and the "element storage space" required to implement each approach.
Overall, what matters to potential users is that the finished system support the ability to define flexible IP address ranges with arbitrary start and end addresses (as opposed to single octet-boundary-based entries) -- the ability to specify IP address ranges that do not fall on whole-octet boundaries, and thus the ability to support the "condensation" of multiple contiguous IP address ranges which start and end on different octet boundaries, e.g. "block 192.168.0.5 through 192.168.127.62 inclusive."
If an example is needed, maybe I want to block the entire "Amazon compute cloud" IP address range except for the Internet Archive Wayback Machine servers -- This would not be possible if the code only supports ranges defined on and by octet boundaries.
Jim
except for the Internet Archive Wayback Machine servers
While the intentions of the Wayback machine was good it has become an evil placed used by scrapers, lawyers, law enforcement, etc. and those evil servers are on the top of my block list - nothing should ever be allowed in the Wayback machine, bad idea, burned many people.
Anyway, a simple PHP function to covert to IP octets to integers and vice versa:
function IPOctetToIPLong($IP)
{
if ($IP == "") {
return 0;
} else {
$ips = explode('.', "$IP");
return ($ips[3] + ($ips[2] << 8) + ($ips[1] << 16) + ($ips[0] << 24) );
}
}function IPLongToIPOctet($IP)
{
if ($IP == "")
return "";$IP=floatval($IP); // avoids capped at 127.255.255.255
$a=($IP>>24)&255;
$b=($IP>>16)&255;
$c=($IP>>8)&255;
$d=$IP&255;return "$a.$b.$c.$d";
}$theIP="127.0.0.1";
$theLong=IPOctetToIPLong($theIP);
echo 'IP '.$theIP.' coverted to integer -> '.$theLong.' and back to IP ->'.IPLongToIPOctet($theLong);
Very fast an efficient routines using SHIFT which is CPU register arithmetic instead of messy and slow long integer multiplication and division routines.
When you run it the test code at the bottom should display:
IP 127.0.0.1 coverted to integer -> 2130706433 and back to IP ->127.0.0.1
Enjoy ;)
I first have a table with whiteListed Bot ranges stored in the application scope as a query; only a handfull of records.
then if not found there (query of query faster than DB Run)
<!--- Coldfusion & MySQL Query Flavor --->
<cfquery name="getBadRange">
select banned -- Data type Bit, also indexed
from tbl_bad_ranges
where
INET_ATON('#cgi.REMOTE_ADDR#') -- :)
between start_ip_int and end_ip_int
</cfquery>
<cfif getBadRange.recordcount>
<!--- Log Bad Request routin --->
<cfheader statuscode="403" statustext="Forbidden">
<cfabort>
</cfif>
Just make sure that start_ip_int and end_ip_int stored as UNSINED.
Short and Sweet.
Blend27
p.s. added
<cfscript>
function ipaddressToInt(ipaddress)
{ a = ListFirst(ipaddress,".");
a_rest = ListRest(ipaddress, ".");
b = ListFirst(a_rest, ".");
b_rest = ListRest(a_rest, ".");
c = ListFirst(b_rest, ".");
c_rest = ListRest(b_rest,".");
d = ListFirst(c_rest, ".");
return 16777216*a + 65536*b + 256*c + d;
}
</cfscript>
#ipaddressToInt(ipaddress)#
for those that are MSSQL USERS, not sure if Oracle has its own routine for IP Conversion
BTW, CFLib has some Network conversion functions in NetLib
//Converts the IP Address from a string to an Unsigned Integer.
private System.UInt32 ConvertIptoUInt(string IPAddress)
{
string[] ip = IPAddress.Split(new char[] { '.' });
System.UInt32[] part = new System.UInt32[4];
for (int i = 0;i <= 3;i++)
{
part[i] = System.UInt32.Parse(ip[i]);
}
return (part[0] << 24) + (part[1] << 16) + (part[2] << 8) + part[3];
}
//Converts the Unsigned Integer to a IP Address string.
static string ConvertUinttoIp(System.UInt32 ip)
{
UInt32[] i = new UInt32[4] { 0, 0, 0, 0 };
i[0] = ip >> 24;
ip = ip - (i[0] << 24);
i[1] = ip >> 16;
ip = ip - (i[1] << 16);
i[2] = ip >> 8;
ip = ip - (i[2] << 8);
i[3] = ip;
return string.Format("{0}.{1}.{2}.{3}", i[0], i[1], i[2], i[3]);
}
Get the IP address. In PHP, it's available from $_SERVER['REMOTE_ADDR'].
If the person is behind a proxy or proxies, their IP won`t be available in this var, so its not going to be accurate or reliable. Instead you should check for multiple IP addresses in $_SERVER['HTTP_X_FORWARDED_FOR'] as well as the default. If multiple ips are present, they are seperated by a comma. Build an array of IP addresses first like this:
function getIPAddresses() {
$ip = array();
if (!empty($_SERVER['HTTP_CLIENT_IP'])) {
$ip[] = $_SERVER['HTTP_CLIENT_IP'];
} elseif (!empty($_SERVER['HTTP_X_FORWARDED_FOR'])) {
if (strpos($_SERVER['HTTP_X_FORWARDED_FOR'],',')) {
$split = explode(',',$_SERVER['HTTP_X_FORWARDED_FOR']);
foreach ($split AS $value) {
$ip[] = $value;
}
} else {
$ip[] = $_SERVER['HTTP_X_FORWARDED_FOR'];
}
} else {
$ip[] = $_SERVER['REMOTE_ADDR'];
}
return $ip;
}
$ips = getIPAddresses();
Then work from there..
dc
400 million lookups on 4 billion ip address
space takes 31 seconds, or 12,903,225 per second
The testbench was a dual 550 XEON running a single thread.
Doing this in mysql + php could never reach 12M/sec, even an isam can't reach those levels. However I don't dispute that the approach is perfectly workable, and may be perfectly acceptable in the situation. I am only illustrating what is possible. In fact, unless the network is up to it, you will never reach 12M/sec, because it implies 12M ppps, which most network cards cannot reach without ASIC assistance.
If the goal is banning by region, you can take some of the load off by using a geodns service to selectively return the ip address, a different ip address, or no ip address at all. This has the effect of filtering the majority of the traffic before it ever reaches the server.
BTW, if you are accessing the packet header directly, and working in a compiled language, then binary representation is much preferred because it breaks down to a 4 shifts and or's which are extremely fast on most modern chips.
I do not know how this would compare to high-level SQL activity but presumably the SQL server could be set to cache the most recent hits?
There are a number of ways around flat-file but it needs a) careful planning or b) holding IPs in a (C-based?) program's memory space. This is not always possible (eg lack of C experience). ASP does not have workable static memory space worth talking about so on that platform it's not possible to hold IP ranges in memory without a COM module.
Other solutions, including a directory/file structure, have been suggested here and may fit the bill for some systems. I think Windows sluggish file access may mitigate against this in larger systems with high throughput.
a flat-file solution can only cope up to a point
Incorrect.
Don't think of the term flat file as being all the data in a single flat file.
Use many flat files and the OS as the index.
Break up the files into a directory hierarchy so that no more than a couple of thousand files exist in any directory and the access speeds are phenomenal.
I use this very method to track active IPs running across the site, banned IPs, ranges of IPs, etc.
It's faster than greased lightning and it will never crash, unlike a SQL file under a heavy server load with resources close to being maxed out while under attack, unless the OS itself crashes and nothing works at that point.
I think Windows sluggish file access may mitigate against this in larger systems with high throughput.
Windows file access is more than adequate as long as we're not talking the old Windows FAT system and Windows caches the most frequently accessed files so bump up your disk cache size if it's not performing to your liking.
As I said, I'm not sure your solution would work well under Windows - it has (or had) a poor file access time with a lot of files in each of several directories.
I accept the point about SQL but to its advantage there is the facility to load other information into it, such as expiry time (from days to infinity), reasons and an incremental hits / expiry counter (keep hitting me every day and I'll increase number of days). This latter could be useful where a broadband IP is static and owned by a bad guy, as against dynamic that changes owner within a few days. I accept this info could be included in each of several tabbed files but this would again increase its size. With SQL you only pull out what you want.
Also, updating info (as against adding info) would be more tricky since tabbed lines would need to be modified and the whole file re-written: SQL you just write the update.
plumsauce - a flat-file solution can only cope up to a point.
Did I ever say it was a flat file :D
I do not know how this would compare to high-level SQL activity but presumably the SQL server could be set to cache the most recent hits?
It cleans SQL's clock, and the fastest ISAM's available. A well designed, purpose built flat file systm will always be faster than either of the above. The referenced implementation is none of the above.
This is not always possible (eg lack of C experience).
Not a problem here.
Windows file access is more than adequate as long as we're not talking the old Windows FAT system and Windows caches the most frequently accessed files so bump up your disk cache size if it's not performing to your liking.
Agreed.
I would point out for the benefit of other readers that the file system directory is in essence a highly optimised hash or b+tree indexing system that has been tuned by the OS developers.
Furthermore, NTFS doesn't mysteriously corrupt itself on a crash unlike linux file systems.
Furthermore, NTFS doesn't mysteriously corrupt itself on a crash unlike linux file systems.
To be fair, both OS's are about equally stable in that regard, let's not quibble about OS's as it's off topic and many of the fallacies spread are based on the code of yesteryear.
Inquiring an indexed b-tree file specifically designed to locate whether an IP or IP range is blocked or not will win hands down against any SQL server ever built, probably before the SQL request is even parsed!
Since an IP address is 4 octets, a binary index of 256 entries 4 levels deep is all you need to find any IP address therefore 4 seeks within a file max and you're done.
I do more individual flat files because I'm tracking more data but what I do is a combo so it's lean and mean.
I suppose hitting the second level and finding nothing in the third level would imply 123.123.0.0 - 123.123.255.255?
Some of my ranges are more granular than that, though - eg: 123.123.12.32 - 123.123.12.63. Ok, it's only likely to be a couple of hundred entries max.
On top of that, as mentioned, my flat file contains other information such as "release after x days" and "never return a soft option" - ie never show a report form to (eg) AWS but allow reporting for blocked dynamics.
And then there is dynamic update and easy maintenance. :(
As I said, I'd need to put time into working out all of this and more.