Forum Moderators: open
Your best bet is use your .htaccess to block everything that doesn't have Mozilla in the user agent then expressly allow Google / Yahoo / MSN by ranges of IP addresses, then whitelist other bots as needed.
Then use AlexK's PHP script to snare scrapers pretending to be an http client:
[webmasterworld.com...]
I'm blocking scrapers and other web aggregators that aren't providing any value to my web site.
Basically Google, MSN, Yahoo, Teoma and Gigablast (on the fence with them) get in and EVERYTHING else gets the boot.
To protect my content and intellectual property I've changed my site to completely OPT-OUT for all crawlers except by whitelist invitation only.
I have the site locked down so hard the only way you'll steal 1,000 pages is with a minimum of 250 IPs, and they better not be on the same block.
Yeah I sort of figured it is a no win situation. The problem is that we transitioned from hitbox to indextools and traffic reportedly shot up big time which I don't necessarily buy - so I am trying to block the bulk of the spider visits that might be artificially inflating page views.
I wish I could block the useragents, but with indextools - it only allows you to block IP addresses