The robots.txt protocol seems to give a website owner currently the choice either to accept any number of queries from a conformant bot or none. Given the growing number of crawlers out there, and the fact that these collectively place an increasing demand upon the websites which they crawl, the risk of resource starvation (as I suspect occured with my site a couple of days ago due to msnbot's activities) is increasingly likely.
The choice to either allow or deny a robot access seems restrictive. The use case might be that a site owner may want to be indexed by, say, the MSN search facility, but wants to limit this crawler to no more than 1000 HTTP requests a week, or 10Mbytes, a day, or 500 files a week. Presumably the ROBOTS.TXT syntax could be extended to allow entries like:
User-agent: msnbot Disallow: more_than 1000 HTTP per week Disallow: more_than 10 Mb per day Disallow: more_than 500 files per week
This would give webmasters finer grained control over the resource loading placed upon websites by particular crawlers.