Welcome to WebmasterWorld Guest from

Forum Moderators: goodroi

Message Too Old, No Replies

protocol extension resource control

proposed extension to robots.txt protocol allowing resource control



11:29 am on Nov 2, 2004 (gmt 0)

Inactive Member
Account Expired


The robots.txt protocol seems to give a website owner currently the choice either to accept any number of queries from a conformant bot or none. Given the growing number of crawlers out there, and the fact that these collectively place an increasing demand upon the websites which they crawl, the risk of resource starvation (as I suspect occured with my site a couple of days ago due to msnbot's activities) is increasingly likely.

The choice to either allow or deny a robot access seems restrictive. The use case might be that a site owner may want to be indexed by, say, the MSN search facility, but wants to limit this crawler to no more than 1000 HTTP requests a week, or 10Mbytes, a day, or 500 files a week. Presumably the ROBOTS.TXT syntax could be extended to allow entries like:

User-agent: msnbot
Disallow: more_than 1000 HTTP per week
Disallow: more_than 10 Mb per day
Disallow: more_than 500 files per week

This would give webmasters finer grained control over the resource loading placed upon websites by particular crawlers.