homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Gold Sponsor 2015!
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

protocol extension resource control
proposed extension to robots.txt protocol allowing resource control

10+ Year Member

Msg#: 481 posted 11:29 am on Nov 2, 2004 (gmt 0)

The robots.txt protocol seems to give a website owner currently the choice either to accept any number of queries from a conformant bot or none. Given the growing number of crawlers out there, and the fact that these collectively place an increasing demand upon the websites which they crawl, the risk of resource starvation (as I suspect occured with my site a couple of days ago due to msnbot's activities) is increasingly likely.

The choice to either allow or deny a robot access seems restrictive. The use case might be that a site owner may want to be indexed by, say, the MSN search facility, but wants to limit this crawler to no more than 1000 HTTP requests a week, or 10Mbytes, a day, or 500 files a week. Presumably the ROBOTS.TXT syntax could be extended to allow entries like:

User-agent: msnbot
Disallow: more_than 1000 HTTP per week
Disallow: more_than 10 Mb per day
Disallow: more_than 500 files per week

This would give webmasters finer grained control over the resource loading placed upon websites by particular crawlers.


Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved