rowan194 - 12:47 pm on Aug 15, 2012 (gmt 0)
I'm working on something like this right now, allowing you to associate user agents with IP ranges. I've been collecting data since 2009. I have one site which gets hit pretty hard with scrapers (as well as legitimate SE bots) so it's almost like a honeypot for finding new and obscure user-agents.
I'm hoping to work the data in such a way that it will be possible to determine, based on past empirical data of activity on various sites, whether a load from a particular IP range is likely to be a bot. As the OP hints at, you wouldn't expect many interactive browser sessions to be coming from AWS or ThePlanet IPs, unless the servers were proxying for humans... I'm going to be trying to determine that automatically.