Yep. I use an online tool to find the IP blocks belonging to the ISP (there are bunches of tools out there) and then deny the block in .htaccess.
In the case of something like Amazon, I am temporarily taking out two entire blocks like this:
deny from XXX.XXX.
Which denies a BUNCH of IP numbers, so it's aggressive, possibly overly so, but I wanted to see what kind of impact it would have. It's only been a day; I'll leave it there at least a week before I decide to leave it there for good.
@netmeg -- just wondering if you think that using duration, user agent info, etc. as you mentioned may be a reliable shortcut to finding these bots or do you recommend taking the longer, more methodical route as you described above?
Depends on the tools and time you have to devote to it. And duration can be tricky - some of the bots I notice that come back over and over report ridiculous duration times - like two hours - so you have to be careful.
The first important metric seems to be Direct traffic. So far, I've never seen an obvious bot that reports as coming from another website or from a search engine. That'll probably change some day, but so far, my bots don't seem to be spoofing that. So that's the first way to narrow it down. Start by looking at Direct traffic.
Then you want to look for anomalies. Like for example, all these bots report a browser of Firefox 18.0. The current version of Firefox is 27.0.1. They all report at 1024x768 resolution - what real screen even uses that resolution nowadays? Not laptops, and not 19" and larger flat screens - but every single one of these log entries says 1024x768. To me, that smells like some kind of black hat tool for scraping/clicking, so that might bear looking into.
I have other bots (that look like infected PCs) that all report as IE, but all different versions and resolutions. That particular anomaly is that on a site with 3400 pages, these bots only hit four pages, and never hit more than one page per visit. So for these, I was able to set up a segment of direct traffic, that visits page 1 OR page 2 OR page 3 OR page 4, and has IE has a browser, and the visit is less than 5 seconds.
So for me, I look for the signature first, and then I look for the behaviors within that signature.