motorhaven - 6:12 pm on Mar 25, 2012 (gmt 0)
Here's a fundamental problem at least in my case:
- Most of the "rogue" bots these days I get are using legitimate UAs.
- Most come in, grab a few pages using a single IP, and don't come back under that IP.
Using an after the fact "bot catcher" doesn't work in these cases. By the time I and/or my filters/programs have caught an IP and locked it out, the bot has moved on to a new IP, switched to another legit UA, etc. I end up adding all this processing overhead with diminishing results... and still see 30%+ of my bandwidth going out to them.
Someone in a previous thread had talked about an .htaccess method they came up with which looks at headers closely to match up the UA to the what the headers should contain verses what they actually contain. The problem with this approach is, alas, it wasn't described in detail and so it leaves it up to me to spend an enormous amount of time logging headers from all visitors, spending time looking at headers from real visitors and comparing it to fake visitors, etc.
Even then, its far from perfect, because of things like corporate and military visitors who have proxies grabbing info either before or after the user gets the page (or it gets the page for them). These filters don't present "normal" UAs nor is their header content identical to a real browser, and if I block them out I lose an enormous number of visitors at military/corporate desktops.