homepage Welcome to WebmasterWorld Guest from 54.197.65.82
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Proactive vs. Reactive IP Range Blocking
General Policy or Behavior Based
incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4517026 posted 6:35 am on Nov 8, 2012 (gmt 0)

Just curious whether most bot blockers in the spider forum block all data centers just as a matter of policy or do you wait until something happens before quarantining an IP range?

Furthermore, do some of you literally hunt down data center IP ranges just to make sure you have them all?

I'm kind of 50/50 myself as I'll try to make sure I have all known ranges for places well known for bad traffic like AWS, The Planet, etc. but I don't look too hard for companies that run a pretty clean shop like Peer 1, RackSpace, etc. Therefore I'm proactive where I know there's trouble and more reactive in other places and address them only as they come along.

Doesn't mean I still don't have thousands of IP ranges being blocked, but I'm sure many of you have many more!

 

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4517026 posted 8:46 am on Nov 8, 2012 (gmt 0)

Just curious whether most bot blockers in the spider forum block all data centers just as a matter of policy or do you wait until something happens before quarantining an IP range?


Bill,
I'm more reactive, however there are instances, where I've used ranges and/or names that have been provided by other forum members.


I'm kind of 50/50 myself as I'll try to make sure I have all known ranges for places well known for bad traffic like AWS, The Planet, etc. but I don't look too hard for companies that run a pretty clean shop like Peer 1, RackSpace, etc. Therefore I'm proactive where I know there's trouble and more reactive in other places and address them only as they come along.


Not to sidetrack from data centers, however from the beginning of my web activities in 1999, my denials (for all visitors) have been primarily reactive, at least until the activity increases to a point of frustration (temporary or otherwise) where reaction prevention is currently too overwhelming.

ARIN-Whois use to function quite effectively and easily, however today its quite a PITA (even impossible) to determine sub-net ranges or related organizations IP's.

Despite the constant work of black-listing it allows me to remain quite flexible with widget visitors whom come in all shapes and sizes.

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4517026 posted 2:52 pm on Nov 8, 2012 (gmt 0)

I use Proactive methods.

If scraper comes along from an unknown data center IP Range, I usually hunt down and block all the ranges from the company that range is assigned to for ARIN/RIPE.

Most of the traffic from other RIRs i block auto-magically any way on most of the sites, except when HUMAN traffic starts coming in from search engines/sites that I know link to my sites.

RIPE ranges are slightly harder to maintain due to companies constantly coming/going in/out of bushiness.

But in any event the traffic from all those ranges is constantly monitored for human behavior and when the range "GOES HUMAN", it is first put on PROBATION(home cooked captchs/spider traps).

I started building my Software Firewall in early 04, it's quiet an Automated Beast at this point with hooks into several known Abuse APIs, and sites sharing banned IP data via RESTful web services from with in.

I refactor the code every 6 month based on the notes I take.

MxAngel



 
Msg#: 4517026 posted 9:21 pm on Nov 8, 2012 (gmt 0)

I block traffic from well-known (bad) data centers and spiders / bots.

If I spot bad behavior from an IP belonging to a hosting company / data center, I usually block all IPís belonging to the same host, by IP range, CIDR range or hostname to make sure to catch a maximum of IPís.

Normal visitors donít originate from a webhost although sometimes Iím surprised how many businesses use their mail server or ns servers to surf the web and I need to take that in account too. Thatís why Iíve got a set of ďrulesĒ that apply to a certain type of servers.

Iíve got a total country block on some countries because I had enough of dealing with their daily hacking attempts. Blocking by single IPís is useless as they donít use static IPís.

I added some general detection stuff to simply track IP blocks or traffic coming from a certain type of servers (see note about mail and ns servers); it also detects new bots / attacks and the script connects to sites sharing banned IP data ... I kinda add as I go.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4517026 posted 9:58 pm on Nov 8, 2012 (gmt 0)

I block a datacentre if it comes to my attention, usually because of bad activity from an IP or because of following it up from postings here. Sometimes I find some scanner hitting an FTP or mail server and I then block those as well.

I do not usually go beyond the IP's /16 in tracking down datacentre offenders. I do extend this on broadband ranges, following them down to /12 or even /11. This means that my security logs do not flag an IP as "new" - unless there is a lot of activity on such IPs I ignore them once identified.

I do not usually extend research into non-contiguous ranges, the occasional exception being china/korea.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4517026 posted 1:31 am on Nov 9, 2012 (gmt 0)

The problem with only blocking data centers after you see activity is that there may be some very clever stealth activity, particularly from high profile scrapers and commercial data miners. You don't see it as anything obvious in your logs and it's carefully crafted not to set off any alarms but some of my automated tools have snared many of them for a variety of reasons. Then the cows are out of the barn, your pages are already in the hands of scrapers and you're left with the problem of reacting to the mess they make time consuming crud like DMCAs and all that nonsense.

The only proactive thing I do to make life simpler is to put tracking bugs in the pages so one simple search query will spit out all the pages copied that were republished, even if the content is scrambled to avoid copyscape, so I can see where it landed and be able to connect the dots on how it got there.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4517026 posted 9:36 pm on Nov 9, 2012 (gmt 0)

My secondary system of blocking UAs gets rid of a lot of junk, including alerting me to new IP ranges I should or should not be blocking. :)

Tracking down datacentres could easily become a full-time job, which of course will be far worse when ipv6 finally gets a grip. By which time, hopefully, I will have left it all behind. :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved