Proposing updates to the robots.txt standard

Regarding discusson about dumb spiders and the argument about bots obeying a site's T.O.S. [webmasterworld.com]

I think Dijkgraaf has brought up some interesting considerations that deserve their own thread. A new robots standard proposal deserves attention.

reads robots.txt
obeys Disallow directives
Request robots.txt again if more than 24 hours since last request
Doesn't request robots.txt more than once an hour
Doesn't request files more than once every 5 seconds
Requests are not repeated within 24 hours
obeys no cache tag
obeys no follow tag
obeys no index tag
obeys none tag
request-header user-agent contains URL to page about bot
the about bot page explains how to disallow bot in robots.txt
request-header user-agent contains bot name in agent
request-header user-agent bot name matches the one it looks for in robots.txt / meta tags
request-header user-agent doesn't change often
recognises commented out links
doesn't request URL's with #
Stops revisiting not found (404) pages after a time
Doesn't frequently re-requests (404) pages
Stops revisiting moved permanently (301) pages
Stops revisiting gone (410) pages
And also some features that I'd like to see in all bots (some support some, but most don't)
request-header contains If-Modified-Since (this includes when fetching robots.txt)
obeys Disallow file extension wildcard directives
Allows site owner to control delay
allows site owner to control frequency of revisits
obeys no no-email-collection tag
request-header From contains e-mail address
request-header referer contains where it found link to the requested item
request-header user-agent contains contact e-mail address
the about bot page explains purpose of the bot

Now I'll go back to beating a dead horse, so to speak. I think wilderness and Lord Majestic's argument about bots obeying sites' terms of service can be summed up like this:

wilderness: Bots should obey a sites terms of service.

Lord Majestic: While desirable, it isn't technically feasible.

I agree with both of you. That's why I'd like to propose a new standard that would encompass Dijkgraaf's list and add a bot-understandable framework for obeying terms of service that relate to search engine use of a website.

Perhaps robots.txt could be extended to tell bots whether or not files could be cached and for how long, how often to respider, how many times a minute a bot should make requests, etc.

That's about all I'd want from the standard, but I'm sure wilderness could think of a lot more.

Proposing updates to the robots.txt standard

It is time for a change

volatilegx

Lord Majestic

Dijkgraaf

wilderness

Lord Majestic

wilderness

Lord Majestic

wilderness

Lord Majestic

Dijkgraaf

Lord Majestic

Dijkgraaf

wilderness

Lord Majestic

volatilegx

Lord Majestic

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week