incrediBILL - 12:55 am on Mar 4, 2011 (gmt 0)
incrediBILL would most probably have some thoughts to add on this issue.
I shared them in front of a large audience at PubCon back in Nov. 2010, if I can find my memory stick I might post the slide deck.
The message doesn't change: whitelisting
Anything else is a waste of time chasing your tail monitoring logs and making big stupid ugly lists and I really hate wasting my time. Work smart, not hard, especially when it comes to real time sucks like spider hunting.
Then to stop cloaked bots, you need scripts for speed traps, volume traps, also monitor built-in spider traps such as visitors don't typically open robots.txt, privacy policies or legal info pages but cloaking spiders nail 'em every time on the first visit.
What do you do when something hits a spider/speed/usage trap?
Make them solve a simple captcha, ask up to 10+ times and auto-block if you don't get a response.
Moving right along...
The best way to prevent scrapers is to get your content indexed before scrapers get to it.
Completely ineffective and often content is scrambled into a keyword gibberish stew and you'll never know which site grabbed your content for that purpose unless you put beacons in your content, which I do.
Besides, not all scrapers republish content, often they are data miners and other resource suckers that don't belong which make millions mining your sites.