I was wondering if anyone does what I call IP bunch analysis. Every day, I have a script that goes through the previous day's Apache log data and count data by IP. The rules are:
* a bunch starts when an IP makes a 2nd request in 2 or fewer seconds from the time of a 1st request
* the bunch grows in size with each additional page request that occurs 2 or fewer seconds after the previous page request
After I've got all the data together, I flag an IP as an abuser (bot?) if either of these conditions are met:
* there was any single bunch of size 10 or bigger (that is, 10 or more page requests averaging under 2 seconds per request)
* there were 10 or more total bunches each with at least 4 page requests in them
The first case is like 10 page hits in under 20 seconds (or for even longer); the second case is lots of little bunches, with at least 4 page hits in 8 seconds for each little bunch.
I used to look up these bunch IPs whenever I'd flag them, but after seeing Romania, China, Germany and the Ukraine over and over, I now just DQ any IP automatically. I'm up to 330 IPs so far. Here is a little sample:
$banned_ip{"78.45.213.180"}=qq(2/18/2012 clicks - Czech Republic);
$banned_ip{"180.153.227.57"}=qq(2/17/2012 clicks - China);
$banned_ip{"180.106.152.163"}=qq(2/16/2012 clicks - China);
$banned_ip{"178.217.184.147"}=qq(2/16/2012 clicks - Poland);
$banned_ip{"31.214.201.251"}=qq(2/16/2012 clicks - Germany);
$banned_ip{"178.176.122.176"}=qq(2/16/2012 clicks - Russia);
$banned_ip{"121.205.215.174"}=qq(2/16/2012 clicks - China);
$banned_ip{"180.153.227.29"}=qq(2/16/2012 clicks - China);
$banned_ip{"188.165.238.19"}=qq(2/15/2012 clicks - France);
$banned_ip{"220.161.150.70"}=qq(2/15/2012 clicks - China);
$banned_ip{"46.21.144.51"}=qq(2/15/2012 clicks - NED);
$banned_ip{"109.230.245.221"}=qq(2/15/2012 clicks - Germany);
I do have some masks, too:
$banned_ip{"174.157.101"}=qq(SBP - 11/2/2009);
$banned_ip{"77.93.39"}=qq(SBP - 11/2/2009);
$banned_ip{"85.175.6"}=qq(SBP - 11/2/2009);
$banned_ip{"190.18.128"}=qq(SBP - 11/2/2009);
$banned_ip{"85.234.151"}=qq(SBP - 11/2/2009);
$banned_ip{"80.249.69"}=qq(SBP - 11/2/2009);
Now, at the top of any dynamic page where I don't want bots to crawl, I call a function which just uses the $ENV env var for IP, looks in my list, and if the IP is found, return a 403.
Does anyone else do something like this? If so, do you use stricter or looser criteria?
Interested to know what I may be doing right or wrong. Since I automated my bunch analysis (instead of doing it manually once in a while), I'm throwing out 5 new IP every day.