incrediBILL - 7:42 pm on Mar 25, 2012 (gmt 0)
I've never comprehended the use of testing headers myself!
I don't test them, I block on them.
Most poorly written bots don't even put certain header fields in the request, or they make stupid headers, or add something dumb, or lack simple things like what language they're using.
I don't spend a lot of time looking at headers, I have scripts that analyze them for me.
Using an after the fact "bot catcher" doesn't work in these cases. By the time I and/or my filters/programs have caught an IP and locked it out, the bot has moved on to a new IP, switched to another legit UA, etc. I end up adding all this processing overhead with diminishing results... and still see 30%+ of my bandwidth going out to them.
Get used to it, that's going to be bot blocking in the future.
What we're doing now, in this very forum, is on the verge of being obsolete when it comes to actually catching and blocking bad bots. Being the lone wolf bot hunter isn't going to work much anymore. It's going to take a collective network of websites to collect, detect and block these random IPs doing the bidding of the bot herder.
I only know this because I'm working on the problem :)
I've recently encountered masses of unstoppable IPs hosted on residential computers doing things they shouldn't be doing. Monitoring their activity across multiple computers is the only way to identify which of them is involved in the attacks vs. being actual humans making stray hits on websites.
Not only scrapers are the problem, there are corporate crawlers out there that also crave access to these networks of IPs to do data mining on your sites undetected. I won't site specifics examples because there's no way to prove it's them 100% until they do something that verifies they have been in the honeypot, but I'm positive it's happening.
IPv6 will just make the situation worse, much worse.