Forum Moderators: open
You can whitelist, aka allow just Google, Yahoo & MSN, to keep out all unwanted bots that easily identify themselves. However, when it comes to everything else, there's a ton of bots that don't identify themselves that spoof browser user agents and it gets real complicated at that point.
You can block known web hosting data centers and lists of proxy IPs which stops a lot of noise but that won't stop bots operating from residential IPs.
FWIW, you can do a reasonable job but you can't stop everything.
As Bill said, blocking user agents and known hosting data centers can only get you this far, what we all need is a commercial package that intelligently analyzes traffic logs and headers and presents a captcha challenge to verify human activity, I'm still waiting for this package to come along!