I run a site that has hundreds of thousands of pages that are generated dynamically. Each of these pages contains links to anywhere from a few to a few hundred of the other pages. Six months ago I let Google in, after discovering that their algorithms are the only ones that work well on my pages. About every five weeks, Google spends 10 days crawling with about 5 crawlers, 24/7, and then gets tired and comes back next month.
I lifted the robots.txt exclusion on my cgi-bin directory in order to let Google in. Then I added a bunch of other bots to my own exclusion file, so that when they try to come in, they get a "Server too busy" message. I know I could have tried to finesse the robots.txt by adding "Disallows" for each of these other bots, but I didn't think this would be reliable.
Yesterday I discovered an alternative method to dynamically decide what's a bot and what isn't. I've been writing the HTTP_FROM string to my own cgi-bin log files for several years now, so I'm in a position to know who uses it and who doesn't. After studying the last few months of logs, I've decided that most of the major bots use this field, and apart from bots, only the rare misconfigured browser ever uses it. Originally it was intended for an email address.
In other words, this environment variable is a good way for a cgi program to make a fast, first-level, up-front determination about whether a request is coming from a bot. Of course, it won't help you with those personal spiders, and you need additional levels of monitoring to keep your bandwidth safe from them, but as a first-level filter it seems to be a pretty good trick.
The best second-level monitoring I've come up with is to look at the end of your log file and count the number of hits that match each line to a depth just past the minute digit. That includes the domain and the time, but not the current second. If the rate exceeds a certain figure per minute, it has to be a bot because no one can read your pages that fast!