homepage Welcome to WebmasterWorld Guest from 54.161.200.144
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
IP Bunch Analysis
sftriman




msg:4437831
 3:43 am on Apr 6, 2012 (gmt 0)

I was wondering if anyone does what I call IP bunch analysis. Every day, I have a script that goes through the previous day's Apache log data and count data by IP. The rules are:

* a bunch starts when an IP makes a 2nd request in 2 or fewer seconds from the time of a 1st request
* the bunch grows in size with each additional page request that occurs 2 or fewer seconds after the previous page request

After I've got all the data together, I flag an IP as an abuser (bot?) if either of these conditions are met:

* there was any single bunch of size 10 or bigger (that is, 10 or more page requests averaging under 2 seconds per request)
* there were 10 or more total bunches each with at least 4 page requests in them

The first case is like 10 page hits in under 20 seconds (or for even longer); the second case is lots of little bunches, with at least 4 page hits in 8 seconds for each little bunch.

I used to look up these bunch IPs whenever I'd flag them, but after seeing Romania, China, Germany and the Ukraine over and over, I now just DQ any IP automatically. I'm up to 330 IPs so far. Here is a little sample:

$banned_ip{"78.45.213.180"}=qq(2/18/2012 clicks - Czech Republic);
$banned_ip{"180.153.227.57"}=qq(2/17/2012 clicks - China);
$banned_ip{"180.106.152.163"}=qq(2/16/2012 clicks - China);
$banned_ip{"178.217.184.147"}=qq(2/16/2012 clicks - Poland);
$banned_ip{"31.214.201.251"}=qq(2/16/2012 clicks - Germany);
$banned_ip{"178.176.122.176"}=qq(2/16/2012 clicks - Russia);
$banned_ip{"121.205.215.174"}=qq(2/16/2012 clicks - China);
$banned_ip{"180.153.227.29"}=qq(2/16/2012 clicks - China);
$banned_ip{"188.165.238.19"}=qq(2/15/2012 clicks - France);
$banned_ip{"220.161.150.70"}=qq(2/15/2012 clicks - China);
$banned_ip{"46.21.144.51"}=qq(2/15/2012 clicks - NED);
$banned_ip{"109.230.245.221"}=qq(2/15/2012 clicks - Germany);

I do have some masks, too:

$banned_ip{"174.157.101"}=qq(SBP - 11/2/2009);
$banned_ip{"77.93.39"}=qq(SBP - 11/2/2009);
$banned_ip{"85.175.6"}=qq(SBP - 11/2/2009);
$banned_ip{"190.18.128"}=qq(SBP - 11/2/2009);
$banned_ip{"85.234.151"}=qq(SBP - 11/2/2009);
$banned_ip{"80.249.69"}=qq(SBP - 11/2/2009);

Now, at the top of any dynamic page where I don't want bots to crawl, I call a function which just uses the $ENV env var for IP, looks in my list, and if the IP is found, return a 403.

Does anyone else do something like this? If so, do you use stricter or looser criteria?

Interested to know what I may be doing right or wrong. Since I automated my bunch analysis (instead of doing it manually once in a while), I'm throwing out 5 new IP every day.

 

wilderness




msg:4437853
 7:03 am on Apr 6, 2012 (gmt 0)

Interested to know what I may be doing right or wrong.


Denying to the precise Class D range is a bad practice.
The following a good example.
$banned_ip{"180.153.227.57"}=qq(2/17/2012 clicks - China);
$banned_ip{"180.153.227.29"}=qq(2/16/2012 clicks - China);

If you leave the door open long enough the harvester will simply begin grabbing pages faster and from IP ranges far beyond a Class D block. Eventually they'll come from so many IP's and so many different UA's that it'll become difficult to analyze.

330 precise IPs is nothing.
Denying to the precise Class D for a long period could result in tens-of-thousands or hundreds-of-thousands.

In this example:
$banned_ip{"180.153.227.57"}=qq(2/17/2012 clicks - China);
$banned_ip{"180.153.227.29"}=qq(2/16/2012 clicks - China);

Why not just take out the entire network?
180.152-159.
or even in this particulr instance the entire Class A of "180"

I'm only interested in North American traffic and have the surrounding IP's close to your "180":
RewriteCond %{REMOTE_ADDR} ^17[789]\. [OR]
RewriteCond %{REMOTE_ADDR} ^18[0-35-9]\. [OR]

Course everybody knows I'm an extremist.

incrediBILL




msg:4437860
 7:42 am on Apr 6, 2012 (gmt 0)

Course everybody knows I'm an extremist.


Not when it comes to blocking China, the Great Firewall of China is far from extreme.

The Chinese have a bunch of startups all focused on crawling and/or scraping, the fancy name is aggregating. It's an epidemic, and blocking the entire country is the only vaccine. I also add Russia, Japan, Vietnam, Nigeria and a few other countries to the block list on a couple of sites.

RE: IP Bunch Analysis:

I actually have a report I run similar to your IP Bunch that I call a Proximity Alarm that shows groups of similar class C and D hits that use similar user agents, operate within a similar time frame, or are potential threats due to behavior.

A couple of years back on one site I started refusing service if the visitor doesn't accept cookies. Within weeks all the scrapers hitting my site adapted to use cookies. Now, thanks to the forced cookies, I can see them hopping from IP to IP because some are too either too stupid to flush those cookies when they change IPs OR they're using some IP pooling software and simply don't know when the IP changes and the cookies established from a prior IP exposes their network of IPs as it keeps switching IPs.

Gotta play hardball with these IP hopping content pirates!

DeeCee




msg:4437864
 7:54 am on Apr 6, 2012 (gmt 0)

Yes, @Wilderness. That is a bit extreme. :)
But if only interested in traffic from one country or one small group, it kinda works. But then, if you are interested only in North American readers, then why not merely check the country code through Apache or .htaccess and ban by country? :)

And Yes, you are right. It becomes a quite long list, when blocking individual IPs. I catch hundreds of IPs/day, tracked across 50+ categories.


@sftriman,

Too much for a site to be running through in a PHP based simple compare check list. It will eventually take the site to its knees as the list grows.. Try putting some time measuring code at various points logging execution time for your checks compared to actual page generation. I think you will find that comparing each IPs in PHP like you do can eventually "take over" your system. Slow a site down significantly.

I track offending IPs both as individual IPs and as policy blocks that takes out ranges or CIDRs as well, across a set of both real sites, honey pots sites (dummy sites attracting only offenders), and through firewall log analysis. Both web-site and other offenders, such as cpanel, SSH, vicidial, database and other hackers. I drag many of them into tar-pits. Mostly because it is fun to watch the crawlers being slowed down like hit by a bug-zapper.

But I track the catches continuously into classified DNSBLs (currently a dozen or so lists, depending of blocking level), that are updated normally once an hour, and then block "invalid" offenders using that and various other methods.

Right now for Wordpress sites I kick them out using a Spam/Security block plugin, blocking tracked scrapers and such, plus catching spam.

I am right now working on a Apache module that will block using various DNSBL methods, but inside Apache, to stomp out bad actors before they waste execution of web-site code. That shaves off a lot of otherwise wasted server/network bandwidth. (Plus it protects the content)

wilderness




msg:4437939
 12:24 pm on Apr 6, 2012 (gmt 0)

But then, if you are interested only in North American readers, then why not merely check the country code through Apache or .htaccess and ban by country?


DeeCee,
I had active websites for more than a decade, and with these same restrictions.
The global denial was initiated long before any Geo-IP info existed.
I already have the numbers in place and really don't have any need to consult another source. Very rarely do I find a new or reassigned IP range to add outside of North America.

Banning by individual countries (i. e., Geo IP) requires condensing the combined lines of non-desired countries into lesser lines.
I don't know if you've ever attempted anything like this manually (I did in 2002 when I separated the Oceanic ranges from the other APNIC ranges), however I can assure that it's just short of a nightmare.

DeeCee




msg:4438035
 4:09 pm on Apr 6, 2012 (gmt 0)

@wilderness,
Sure. What is already long established and works. No need to tamper with existing perfection. :)

Sort of when I as a kid took things apart to see why they worked, and then they afterwards magically stopped working. :)
Easier to take things apart than to put them back together again.

incrediBILL




msg:4438051
 4:40 pm on Apr 6, 2012 (gmt 0)

I'm up to 330 IPs so far. Here is a little sample:


You may want to consider moving that list of IPs to a database or a flat file and get it out of code.

At a minimum, break up the list by the first digit in the IP address and use a switch() statement if you intend to keep it in code to minimize the processing time, or drop the data into a flat file organized by the first IP digit (ips_001.txt, ips_002.txt, etc), OR... assuming you retain your list in code, even better and more efficient than a switch() statement, chop up the list of IPs into include files (ips_001.php, ips_002.php, etc.) and just include the shorter IP code you need on request.

Short, fast, minimal processing.

Other techniques which work even faster is to move an IP list out to Apache as a RewriteMap [httpd.apache.org] starting with a plain text file and then change it to a DBM hash file for sheer speed once you get it working. Where Apache wins big time here is it also caches the RewriteMap so it only loads it on the first request and then it's faster than heck.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved