homepage Welcome to WebmasterWorld Guest from 54.226.43.155
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
IP Banning Primer
httpwebwitch




msg:3991841
 7:19 pm on Sep 18, 2009 (gmt 0)

I won't ask why you want to block IPs. But supposing you do, here's how to do it.
First, you need a database to store your IPs in.
Create a table, with this schema:

CREATE TABLE IF NOT EXISTS `ips` (
`A` int(1) unsigned default NULL,
`B` int(1) unsigned default NULL,
`C` int(1) unsigned default NULL,
`D` int(1) unsigned default NULL,
`description` varchar(255) default NULL,
KEY `A` (`A`,`B`,`C`,`D`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

An IP address has 4 parts, or "octets", separated by dots. We're going to store these in separate columns, because this makes it indexing and retrieving them more efficient.
The first four columns are the 4 octets, eg. to store 67.202.62.173, we say A=67, B=202, C=62, D=173.
The last column is a description - not required for the blocker to function, but it's very convenient when maintaining the data.

the int(1) data type will store numbers from 0 to 255, which is ideal for storing each octet
thanks to whoisgregg for that tip

Populate the table with IPs you want to ban. If you are blocking a range of IPs, then you leave a NULL in one of the columns. For example, here is an IP block that belongs to Microsoft:

insert into `ips`(`A`,`B`,`C`,`D`,`description`) values (64,4,4,NULL,'Microsoft')


Leaving a NULL in the "D" column means that this row represents any IPs that match the pattern 64.4.4.*:
64.4.4.1
64.4.4.2
64.4.4.3
64.4.4.4
...
64.4.4.253
64.4.4.254
64.4.4.255

Here are a few more to get you started:
insert into `ips`(`A`,`B`,`C`,`D`,`description`) values (67,202,62,173,'Alexa');
insert into `ips`(`A`,`B`,`C`,`D`,`description`) values (64,4,4,NULL,'Microsoft');
insert into `ips`(`A`,`B`,`C`,`D`,`description`) values (66,249,65,NULL,'Google');

You can find IP lists like these on the Interweb, or if you're diligent you can compile your own by analyzing your own server logs. I highly recommend rolling your own list. When compiling my list, I started with lists I found online, added some from similar lists shared by colleagues, augmented with my own logs. I found it easiest to pop all the IPs from my logs into Excel; when importing the data I split columns on the "." character. Then it was simple to sort the rows, remove dupes, and see common patterns of IP blocks. From Excel, it was trivial to dump the values into my database.

Now, you have a database table full of IPs you want to detect. Here's what you do with it.

1) Get the IP address. In PHP, it's available from $_SERVER['REMOTE_ADDR'].
2) split the IP address into an array of octets.
3) query the database to find matches
4) return true if any are found.

To save you the trouble of building one, here it is:

function isbannedip(){
$iparray = explode(".",$_SERVER['REMOTE_ADDR']);
$query = "select * from ips where (A=".$iparray[0]." OR A is null) and (B=".$iparray[1]." OR B is null) and (C=".$iparray[2]." OR C is null) and (D=".$iparray[3]." OR D is null) LIMIT 1";
$result = mysql_query($query);
return mysql_num_rows($result)>0;
}


MySQL lets you add "LIMIT 1" to a query. This isn't strictly necessary, but since you only want to detect the presence of a row, this tells the database that it can give up after it's found one row; it doesn't need to keep scanning. I don't know if this actually speeds things up, but doing that makes me think that it's faster.

Now, you have an elegant, reusable IP matcher. Use it to shun the bots:

if(isbannedip()){
die();
}


or send them packing:

if(isbannedip()){
header('Location: http://www.example.com/');
die();
}

or cloak your content:

if(isbannedip()){
print("mesothelioma mesothelioma mesothelioma mesothelioma mesothelioma");
}else{
print("lorem ipsum dolor sinc et nimibus etna");
}

or whatever you please.

Your database needn't be just bots. You can use this method to block annoying users from your forum, spammers from your blog, or whatever.

Since this is a generic "are you blacklisted?" function, you can use it for all sorts of things. You could include it on every page! In Apache's .htaccess file, add this line:
php_value auto_prepend_file "/global_prepend.inc"

then in that global_prepend.inc file, you insert the code that does the IP check. if the function isbannedip() returns true, then throw a 404 header(), then die().

Or you can use it on individual pages; require_once() the thing, or include() it as needed.

This example was built for IPv4 addresses, but is very easily extended to work with IPv6.

special thanks to whoisgregg, buckworks and jd_morgan for their suggestions to improve this article

 

Pfui




msg:3991953
 1:07 am on Sep 19, 2009 (gmt 0)

Thanks to all for any new way to make our webmastering/site-hosting lives easier. I'm all for elegant solutions but alas, can't try out your technique(s). It requires MySQL and, apparently, PHP, neither of which we run on our machine. (Don't need 'em; don't use 'em; don't fret about their exploits.) Got Perl? :)

Ocean10000




msg:3991960
 1:23 am on Sep 19, 2009 (gmt 0)

Good Job httpwebwitch you provided some nice examples in PHP and a great explaination of the reasoning behind the design.

dstiles




msg:3991975
 1:57 am on Sep 19, 2009 (gmt 0)

I would have turned each IP into a single number. That way it's easy to put a range into the database such as 127.0.0.0 - 127.0.95.255 (converted to decimals) using two fields, lowest and highest. A simple numeric query of ">=" and "<=" would pull out anything within the range.

Add in a couple of fields to determine when the first hit was plus an expiry period, update the expiry period exponentially (or whatever) every day it hits. Whatever.

I've been going to implement that for soooo long! Pity customers take up so much time. :)

incrediBILL




msg:3992080
 8:13 am on Sep 19, 2009 (gmt 0)

I would have turned each IP into a single number.

Exactly - as an extra field even, no reason you can't have both.

However, I use flat files per IP or IP range instead of SQL for this because a hardcore DDOS has resulted in crashing my database a couple of times.

caribguy




msg:3992107
 9:58 am on Sep 19, 2009 (gmt 0)

Great primer, thanks httpwebwitch!

With a little extra effort, you could use the database to generate a firewall block list. This could be automated and refreshed daily as needed. Saves a lot of CPU cycles ;)

e.g.

### 59 ###########################################
block in log first quick from 59.32.0.0/11 to any
block in log first quick from 59.64.0.0/12 to any
block in log first quick from 59.80.0.0/14 to any
block in log first quick from 59.100.0.0/14 to any
block in log first quick from 59.104.0.0/13 to any

### 60 ###########################################
block in log first quick from 60.194.0.0/15 to any
block in log first quick from 60.196.0.0/14 to any
block in log first quick from 60.200.0.0/13 to any
block in log first quick from 60.208.0.0/12 to any
block in log first quick from 60.247.0.0/16 to any

incrediBILL




msg:3992151
 1:48 pm on Sep 19, 2009 (gmt 0)

generate a firewall block list. ... Saves a lot of CPU cycles

Problem is when this list gets long with many thousands of entries it bogs Apache down.

Back to SQL or flat files to solve that problem.

caribguy




msg:3992226
 6:30 pm on Sep 19, 2009 (gmt 0)

Haven't tried this yet, but I know that ipf can use groups to limit the number of rules that are processed. Apache and your other services will never see any of that traffic...

From the ipf how-to:
Rule groups allow you to write your ruleset in a tree fashion, instead of as a
linear list, so that if your packet has nothing to do with the set of tests [..]
those rules will never be consulted.

I'd guess you could make this as fine-grained as needed.

Something like

block in quick from 59.0.0.0/8 to any head 590
block in quick from 60.0.0.0/8 to any head 600

and then you could go hardcore with

block in quick from 59.32.0.0/16 to any head 590 group 5932
block in quick from 59.64.0.0/16 to any head 590 group 5964

and finally

block in log first quick from 59.32.123.0/24 to any group 5932
block in log first quick from 59.64.233.0/24 to any group 5964

(edit: fixed bad copypaste)

httpwebwitch




msg:3992227
 6:40 pm on Sep 19, 2009 (gmt 0)

@dstiles and @incrediBill,
The idea of converting the IP into decimal was suggested whlie this article was being proofread. It does have significant advantages, and I do believe it's superior to the method I've described here. But it has one major disadvantage: I can't look at a decimal number and compare it to an octet-divided IP in my head. Since much of the maintenance of this list is done by importing data in and out of Excel, having the IP number on my screen in a readable format makes maintaining the list easier. For me.

But, yes the decimal can be kept in a parallel column and it wouldn't take much to write a little function that converts the octets into decimal. The added complexity is trivial... it's a good idea.

@pfui,
sorry, PHP is my first language, C#.NET is my second. But I believe the technique is simple enough that it could be ported into Perl or Python or Java or ColdFusion or whatever you need, without too much suffering.

~ hww

jdMorgan




msg:3992236
 7:55 pm on Sep 19, 2009 (gmt 0)

You don't have to convert the IP octets to decimal in your head, and with careful attention to optimizing the number of steps/calculations required for database checking, you would not have to store the IP address as decimal, either.

The only 'thing' that might benefit from using decimal is the check-if-in-range function, and decimal might only be useful there to reduce the actual comparison from four steps (one for each octet) to a single step; You could make the whole thing work with each address specified as four decimal octets, as a single decimal number, or even as a single binary number if you like; the numbers have the same meaning regardless of which radix or format you choose to use to express them or to manipulate them. The most important factors are the efficiency of the check-if-in-range code and the "element storage space" required to implement each approach.

Overall, what matters to potential users is that the finished system support the ability to define flexible IP address ranges with arbitrary start and end addresses (as opposed to single octet-boundary-based entries) -- the ability to specify IP address ranges that do not fall on whole-octet boundaries, and thus the ability to support the "condensation" of multiple contiguous IP address ranges which start and end on different octet boundaries, e.g. "block 192.168.0.5 through 192.168.127.62 inclusive."

If an example is needed, maybe I want to block the entire "Amazon compute cloud" IP address range except for the Internet Archive Wayback Machine servers -- This would not be possible if the code only supports ranges defined on and by octet boundaries.

Jim

incrediBILL




msg:3992287
 11:55 pm on Sep 19, 2009 (gmt 0)

except for the Internet Archive Wayback Machine servers

While the intentions of the Wayback machine was good it has become an evil placed used by scrapers, lawyers, law enforcement, etc. and those evil servers are on the top of my block list - nothing should ever be allowed in the Wayback machine, bad idea, burned many people.

Anyway, a simple PHP function to covert to IP octets to integers and vice versa:

function IPOctetToIPLong($IP)
{
if ($IP == "") {
return 0;
} else {
$ips = explode('.', "$IP");
return ($ips[3] + ($ips[2] << 8) + ($ips[1] << 16) + ($ips[0] << 24) );
}
}

function IPLongToIPOctet($IP)
{
if ($IP == "")
return "";

$IP=floatval($IP); // avoids capped at 127.255.255.255

$a=($IP>>24)&255;
$b=($IP>>16)&255;
$c=($IP>>8)&255;
$d=$IP&255;

return "$a.$b.$c.$d";
}

$theIP="127.0.0.1";
$theLong=IPOctetToIPLong($theIP);
echo 'IP '.$theIP.' coverted to integer -> '.$theLong.' and back to IP ->'.IPLongToIPOctet($theLong);

Very fast an efficient routines using SHIFT which is CPU register arithmetic instead of messy and slow long integer multiplication and division routines.

When you run it the test code at the bottom should display:
IP 127.0.0.1 coverted to integer -> 2130706433 and back to IP ->127.0.0.1

Enjoy ;)

blend27




msg:3992468
 1:00 pm on Sep 20, 2009 (gmt 0)

Jim, here is what I do in Coldfusion/MySql:

I first have a table with whiteListed Bot ranges stored in the application scope as a query; only a handfull of records.

then if not found there (query of query faster than DB Run)

<!--- Coldfusion & MySQL Query Flavor --->
<cfquery name="getBadRange">
select banned -- Data type Bit, also indexed
from tbl_bad_ranges
where
INET_ATON('#cgi.REMOTE_ADDR#') -- :)
between start_ip_int and end_ip_int
</cfquery>
<cfif getBadRange.recordcount>
<!--- Log Bad Request routin --->
<cfheader statuscode="403" statustext="Forbidden">
<cfabort>
</cfif>

Just make sure that start_ip_int and end_ip_int stored as UNSINED.

Short and Sweet.

Blend27

p.s. added
<cfscript>
function ipaddressToInt(ipaddress)
{ a = ListFirst(ipaddress,".");
a_rest = ListRest(ipaddress, ".");
b = ListFirst(a_rest, ".");
b_rest = ListRest(a_rest, ".");
c = ListFirst(b_rest, ".");
c_rest = ListRest(b_rest,".");
d = ListFirst(c_rest, ".");
return 16777216*a + 65536*b + 256*c + d;
}
</cfscript>
#ipaddressToInt(ipaddress)#

for those that are MSSQL USERS, not sure if Oracle has its own routine for IP Conversion

BTW, CFLib has some Network conversion functions in NetLib

Ocean10000




msg:3992525
 4:31 pm on Sep 20, 2009 (gmt 0)

For those who want to see these in C#
Assumes IPv4.

//Converts the IP Address from a string to an Unsigned Integer.
private System.UInt32 ConvertIptoUInt(string IPAddress)
{
string[] ip = IPAddress.Split(new char[] { '.' });
System.UInt32[] part = new System.UInt32[4];
for (int i = 0;i <= 3;i++)
{
part[i] = System.UInt32.Parse(ip[i]);
}
return (part[0] << 24) + (part[1] << 16) + (part[2] << 8) + part[3];
}

//Converts the Unsigned Integer to a IP Address string.
static string ConvertUinttoIp(System.UInt32 ip)
{
UInt32[] i = new UInt32[4] { 0, 0, 0, 0 };
i[0] = ip >> 24;
ip = ip - (i[0] << 24);
i[1] = ip >> 16;
ip = ip - (i[1] << 16);
i[2] = ip >> 8;
ip = ip - (i[2] << 8);
i[3] = ip;
return string.Format("{0}.{1}.{2}.{3}", i[0], i[1], i[2], i[3]);
}

dreamcatcher




msg:3992732
 5:54 am on Sep 21, 2009 (gmt 0)

Get the IP address. In PHP, it's available from $_SERVER['REMOTE_ADDR'].

If the person is behind a proxy or proxies, their IP won`t be available in this var, so its not going to be accurate or reliable. Instead you should check for multiple IP addresses in $_SERVER['HTTP_X_FORWARDED_FOR'] as well as the default. If multiple ips are present, they are seperated by a comma. Build an array of IP addresses first like this:


function getIPAddresses() {
$ip = array();
if (!empty($_SERVER['HTTP_CLIENT_IP'])) {
$ip[] = $_SERVER['HTTP_CLIENT_IP'];
} elseif (!empty($_SERVER['HTTP_X_FORWARDED_FOR'])) {
if (strpos($_SERVER['HTTP_X_FORWARDED_FOR'],',')) {
$split = explode(',',$_SERVER['HTTP_X_FORWARDED_FOR']);
foreach ($split AS $value) {
$ip[] = $value;
}
} else {
$ip[] = $_SERVER['HTTP_X_FORWARDED_FOR'];
}
} else {
$ip[] = $_SERVER['REMOTE_ADDR'];
}
return $ip;
}

$ips = getIPAddresses();

Then work from there..

dc

carguy84




msg:3992745
 6:26 am on Sep 21, 2009 (gmt 0)

Has anyone done performance testing of splitting the IP in code versus building the IP on the SQL server?

select rowId
from ips
where @ip = a + '.' + b + '.' + c '.' + d

jojy




msg:3992871
 1:50 pm on Sep 21, 2009 (gmt 0)

IP check query should run once and then cache the results in session/cookie so you don't have to check again n again. It could save the database overhead.

whoisgregg




msg:3992891
 2:25 pm on Sep 21, 2009 (gmt 0)

Anyway, a simple PHP function to covert to IP octets to integers and vice versa:

There are also built-in PHP functions to handle this:

$int = sprintf("%u", [url=http://php.net/ip2long]ip2long[/url]('127.0.0.1'));
$ip = [url=http://php.net/long2ip]long2ip[/url]($int);

incrediBILL




msg:3992916
 3:04 pm on Sep 21, 2009 (gmt 0)

There are also built-in PHP functions

True but those got bugged up in some of the 64-bit PHP versions for a while and being I was on a 64-bit OS, just used my own to avoid those problems for a while.

Also just showing how it's done because it's an easy function to convert to other languages.

dstiles




msg:3993070
 8:33 pm on Sep 21, 2009 (gmt 0)

Dreamcatcher - remember, folks, to whitelist the intranet IP ranges such as 10.n.n.n etc. They are very common in proxies and should not (normally!) be blocked. :)

plumsauce




msg:3993288
 5:51 am on Sep 22, 2009 (gmt 0)

If you want performance numbers from doing the same thing in 'C', here is an extract from the comments in a c file designed to do the same job:


400 million lookups on 4 billion ip address
space takes 31 seconds, or 12,903,225 per second

The testbench was a dual 550 XEON running a single thread.

Doing this in mysql + php could never reach 12M/sec, even an isam can't reach those levels. However I don't dispute that the approach is perfectly workable, and may be perfectly acceptable in the situation. I am only illustrating what is possible. In fact, unless the network is up to it, you will never reach 12M/sec, because it implies 12M ppps, which most network cards cannot reach without ASIC assistance.

If the goal is banning by region, you can take some of the load off by using a geodns service to selectively return the ip address, a different ip address, or no ip address at all. This has the effect of filtering the majority of the traffic before it ever reaches the server.

BTW, if you are accessing the packet header directly, and working in a compiled language, then binary representation is much preferred because it breaks down to a 4 shifts and or's which are extremely fast on most modern chips.

thetrasher




msg:3993383
 11:21 am on Sep 22, 2009 (gmt 0)

$ips = getIPAddresses();

Then work from there..

dc


X-Forwarded-For: epic fail!
blend27




msg:3993556
 4:44 pm on Sep 22, 2009 (gmt 0)

--- X-Forwarded-For: epic fail! ---

You mean to tell me that I've been missing on the action from all those notorious X-Forwarded-For Users for all these years?

There are 2 types of those, WhitListed ones and the ones that everybody else uses.

!fail

caribguy




msg:3993576
 5:17 pm on Sep 22, 2009 (gmt 0)

HTTP_CLIENT_IP'67.44.83.nnn'
HTTP_X_FORWARDED_FOR'67.44.83.nnn, 66.82.162.zzz'

Don't forget your users with Internet via satellite...

dstiles




msg:3993765
 9:56 pm on Sep 22, 2009 (gmt 0)

plumsauce - a flat-file solution can only cope up to a point. The time taken to load the file on a busy server can mean overlaps (file busy error) and in extreme cases exceed available number of filehandles, taking into account that at the same time there are perhaps hundreds of images being served and web pages being opened and processed at the same time. I had this on an earlier server with only a moderate throughput and I had to upgrade the processor to get the required speed (not the only reason for upgrade).

I do not know how this would compare to high-level SQL activity but presumably the SQL server could be set to cache the most recent hits?

There are a number of ways around flat-file but it needs a) careful planning or b) holding IPs in a (C-based?) program's memory space. This is not always possible (eg lack of C experience). ASP does not have workable static memory space worth talking about so on that platform it's not possible to hold IP ranges in memory without a COM module.

Other solutions, including a directory/file structure, have been suggested here and may fit the bill for some systems. I think Windows sluggish file access may mitigate against this in larger systems with high throughput.

incrediBILL




msg:3993776
 10:11 pm on Sep 22, 2009 (gmt 0)

a flat-file solution can only cope up to a point

Incorrect.

Don't think of the term flat file as being all the data in a single flat file.

Use many flat files and the OS as the index.

Break up the files into a directory hierarchy so that no more than a couple of thousand files exist in any directory and the access speeds are phenomenal.

I use this very method to track active IPs running across the site, banned IPs, ranges of IPs, etc.

It's faster than greased lightning and it will never crash, unlike a SQL file under a heavy server load with resources close to being maxed out while under attack, unless the OS itself crashes and nothing works at that point.

I think Windows sluggish file access may mitigate against this in larger systems with high throughput.

Windows file access is more than adequate as long as we're not talking the old Windows FAT system and Windows caches the most frequently accessed files so bump up your disk cache size if it's not performing to your liking.

dstiles




msg:3993794
 10:25 pm on Sep 22, 2009 (gmt 0)

That was the one I was obliquely referring to, Bill - I remember you posting it before. My "flat-file" was exactly that - a single flat file - hence (for the record) mention of its limitations and problems.

As I said, I'm not sure your solution would work well under Windows - it has (or had) a poor file access time with a lot of files in each of several directories.

I accept the point about SQL but to its advantage there is the facility to load other information into it, such as expiry time (from days to infinity), reasons and an incremental hits / expiry counter (keep hitting me every day and I'll increase number of days). This latter could be useful where a broadband IP is static and owned by a bad guy, as against dynamic that changes owner within a few days. I accept this info could be included in each of several tabbed files but this would again increase its size. With SQL you only pull out what you want.

Also, updating info (as against adding info) would be more tricky since tabbed lines would need to be modified and the whole file re-written: SQL you just write the update.

plumsauce




msg:3993928
 4:44 am on Sep 23, 2009 (gmt 0)

plumsauce - a flat-file solution can only cope up to a point.

Did I ever say it was a flat file :D

I do not know how this would compare to high-level SQL activity but presumably the SQL server could be set to cache the most recent hits?

It cleans SQL's clock, and the fastest ISAM's available. A well designed, purpose built flat file systm will always be faster than either of the above. The referenced implementation is none of the above.

This is not always possible (eg lack of C experience).

Not a problem here.

Windows file access is more than adequate as long as we're not talking the old Windows FAT system and Windows caches the most frequently accessed files so bump up your disk cache size if it's not performing to your liking.

Agreed.

I would point out for the benefit of other readers that the file system directory is in essence a highly optimised hash or b+tree indexing system that has been tuned by the OS developers.

Furthermore, NTFS doesn't mysteriously corrupt itself on a crash unlike linux file systems.

incrediBILL




msg:3993941
 5:41 am on Sep 23, 2009 (gmt 0)

Furthermore, NTFS doesn't mysteriously corrupt itself on a crash unlike linux file systems.

To be fair, both OS's are about equally stable in that regard, let's not quibble about OS's as it's off topic and many of the fallacies spread are based on the code of yesteryear.

Inquiring an indexed b-tree file specifically designed to locate whether an IP or IP range is blocked or not will win hands down against any SQL server ever built, probably before the SQL request is even parsed!

Since an IP address is 4 octets, a binary index of 256 entries 4 levels deep is all you need to find any IP address therefore 4 seeks within a file max and you're done.

I do more individual flat files because I'm tracking more data but what I do is a combo so it's lean and mean.

dstiles




msg:3994487
 10:37 pm on Sep 23, 2009 (gmt 0)

I must find time to look into the directory solution but I'm still not clear how it handles ranges. I'll have to think about that.

I suppose hitting the second level and finding nothing in the third level would imply 123.123.0.0 - 123.123.255.255?

Some of my ranges are more granular than that, though - eg: 123.123.12.32 - 123.123.12.63. Ok, it's only likely to be a couple of hundred entries max.

On top of that, as mentioned, my flat file contains other information such as "release after x days" and "never return a soft option" - ie never show a report form to (eg) AWS but allow reporting for blocked dynamics.

And then there is dynamic update and easy maintenance. :(

As I said, I'd need to put time into working out all of this and more.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved