Everything that comes to my sites are already filtered into 3 buckets:
1. Allowed - whitelisted crawlers given instant access
2. Blocked - automated tools blocked from data centers and a variety of other criteria
3. Browsers - appear to be valid browsers, at least by today's definition ;)
Now I'm analyzing what was assumed to be browsers that have been logged and doing more in-depth analysis.
To do this, first I've built a residential ISP rDNS filter. This filter is a big list of rDNS strings from all of the major residential ISPs. Any logged browser is then discarded if it's rDNS result has an entry on the ISP rDNS list.
Now we've basically filtered all traffic down to eliminate IPs from data centers and well known ISPs into a much smaller and very manageable list of stuff left to evaluate.
The dregs of the web.
So far I'm finding even smaller and lesser known ISPs and of course smaller and more obscure hosts and after adding much of that to the filters, I repeat and see what's left until there's hopefully nothing as everything is being filtered into the allow and deny lists.
With all that filtering, what you have left is the nastiest stealth stuff that really tries hard to hide and is now easily exposed and just sitting there like a glaring sore thumb.
With IPv6 it may not be possible due to the sheer volume of IP addresses to track, but with IPv4 it's looking pretty good so far and I'm quite pleased with the results.
I think all the data miners that don't want to be found will be switching to IPv6 just to be harder to find.
Assuming ISPs continue to provide useful rDNS on IPv6 entries, maybe the solution to that problem isn't blocking data centers, but allowing business and residential ISPs only, the ultimate whitelist. Assume everything else is a data center and block it with holes punched thru that firewall as needed on a case by case basis.
Any thoughts or comments?