Forum Moderators: goodroi
I think Dijkgraaf has brought up some interesting considerations that deserve their own thread. A new robots standard proposal deserves attention.
reads robots.txtobeys Disallow directives
Request robots.txt again if more than 24 hours since last request
Doesn't request robots.txt more than once an hour
Doesn't request files more than once every 5 seconds
Requests are not repeated within 24 hours
obeys no cache tag
obeys no follow tag
obeys no index tag
obeys none tag
request-header user-agent contains URL to page about bot
the about bot page explains how to disallow bot in robots.txt
request-header user-agent contains bot name in agent
request-header user-agent bot name matches the one it looks for in robots.txt / meta tags
request-header user-agent doesn't change often
recognises commented out links
doesn't request URL's with #
Stops revisiting not found (404) pages after a time
Doesn't frequently re-requests (404) pages
Stops revisiting moved permanently (301) pages
Stops revisiting gone (410) pages
And also some features that I'd like to see in all bots (some support some, but most don't)request-header contains If-Modified-Since (this includes when fetching robots.txt)
obeys Disallow file extension wildcard directives
Allows site owner to control delay
allows site owner to control frequency of revisits
obeys no no-email-collection tag
request-header From contains e-mail address
request-header referer contains where it found link to the requested item
request-header user-agent contains contact e-mail address
the about bot page explains purpose of the bot
Now I'll go back to beating a dead horse, so to speak. I think wilderness and Lord Majestic's argument about bots obeying sites' terms of service can be summed up like this:
wilderness: Bots should obey a sites terms of service.
Lord Majestic: While desirable, it isn't technically feasible.
I agree with both of you. That's why I'd like to propose a new standard that would encompass Dijkgraaf's list and add a bot-understandable framework for obeying terms of service that relate to search engine use of a website.
Perhaps robots.txt could be extended to tell bots whether or not files could be cached and for how long, how often to respider, how many times a minute a bot should make requests, etc.
That's about all I'd want from the standard, but I'm sure wilderness could think of a lot more.
Now to comment on some of Dijkgraaf's suggestions:
Doesn't request files more than once every 5 seconds
Crawl-delay parameters seems to be more and more supported and it gives opportunity to specify custom delay. Perhaps it would make sense to have multiple Crawl-Delays because some static files are fast for retrieval but some dynamic pages (search) can be slow.
From bot writer's point of view default delay of 5 seconds is too high (for a site with lots of pages), and I'd say 1 second should be minimum, with default somewhere between 1 and 2 :)
request-header referer contains where it found link to the requested item
Pretty much impossible because the same link could be present in thousands of pages. However I am inclined to change my bot to include referer in cases when bot gets redirected from another domain -- this seems to be valuable for possible troubleshooting as well as gives idea to the webmaster.
Ideal situation would be merging Google's SiteMaps idea into robots.txt - a big mistake by Google to fail resist temptation to make people submit SiteMaps directly to Google rather than making it an open standard for all search engines.
obeys no cache tag
obeys no follow tag
obeys no index tag
obeys none tag
These are really features of the search engine rather than bot - also "nofollow" flag does not mean link won't be followed, its a very misleading name for a zero pagerank feature.
request-header From contains e-mail address
request-header referer contains where it found link to the requested item
request-header user-agent contains contact e-mail address
recognises commented out links
All of this is redundant so long as user-agent contains link to page that in turn contains all contact information. Stuffing all this into headers is a waste of bandwidth on all sides.
Now I am off to register newrobotstxt.org domain ;)
Regarding crawl-delay, I think this is fine as is as it tells spiders the time delay between fetching any resources.
I do see the need to tell spiders how often a resource changes and when to re-request it. This is actually covered with a tag in Google's site map called changefreq, and google has the following comments about it
How frequently the page is likely to change. This value provides general information to search engines and may not correlate exactly to how often they crawl the page. Valid values are:
Possibly rather than doing this on an induvidual document basis as per the Google sitemap, you would want to do this on a higer level of directory or site in robots.txt (or robots.xml)
technically not feasible
robots.txt is binary at it's lowest form.
on/off or yes/no.
While robots.txt may be effective for determination of spidering agendas, the technical capabilites of how the accumulated data is analyzed and utilized has superseded robots.txt by light years.
Getting emotional over antimated objects ;)
We shouldn't forget that these softwares are created, programmed and motivated by human beings that have some simple humman traits, such as conscious and reasoning.
Perhaps your correct that the technology doesn't exist?
Perhaps your correct that I expect too much capability from SE's spidering my/our data?
Daily, I deal with people who are intimidated by the word "computer" and use lesser machines to access the internet (of course these folks are a very small share of the market.) These are folks from a different time (nearly
another century) and yet they are capable of balancing two world's and two technologies, all without being aware of robots.txt, scrapers, harvesters, or even packets.
Why should the constraints (whether self-imposed or not) on these folks be any less demanding of the technologies, they are expected to grasp?
Why should the constraints (whether self-imposed or not) on these folks be any less demanding of the technologies, they are expected to grasp?
A 5 year old kid has better image and voice recognition than the best available supercomputer.
I have programmed for 15 years and while I set myself very high goals I am not even going to try to code what you expect from bots - you are about the only one with such expectation and even if you had the best site in the world it would have been cheap and more sensible just not to crawl your site.
What you are proposing is wishful thinking that is really laughable to anybody who tried to program anything serious. Perhaps you have unique vision of the problem but if that's the case then you will have to prove that you can actually implement yourself what you expect of others. Can you write code that would understand T&C pages from various websites? Can you do it?
I really don't want to get into this any further - I can't implement what you want and to my knowledge even the best in class programmers (ie Google) not even trying to do it, which pretty much means that unless you prove the whole world that you can do it, then nobody will do it anytime soon because there are far more important problems to solve like better ranking or anti-spam algorithms: there is robots.txt out there and if you don't like particular bot or all bots, then just disallow access to your site and that will be the end of it.
This is my last post on the topic of laughable expectation that bots should understand T&Cs that many non-lawyers would struggle to decipher.
I'd rather that this discusion be about what features we would like to see that can actually be achieved.
In fact most of the points I raised aren't even changes to the robots.txt, but rather standards of behaviour for crawlers/indexing.
Two features that I would like to become part of the robots.txt standard are already supported by some bots being, Crawl-delay: and wildcards in disallow's, but what else can we add?
How about a revisit: tag that gives a minimum time before revisiting a page/url? This would be a single line per User-Agent giving a side wide indicator of how often changes occur on a site, and if you are worried about bandwith you usage you up this figure to tell the spider to visit less often.
How about a revisit: tag that gives a minimum time before revisiting a page/url?
Currently robots.txt is split by user-agent, it is necessary to be able to define groups of URLs (with * wildcard support in urls), and then have per-group settings like Crawl-Delay or Revisit or whatever. This would give flexibility to assign different params to different groups of URLs.
[webmasterworld.com...]
<quote> This would be a single line per User-Agent </quote>
But that's exactly my point - you are in effect limiting yourself to one settings per user-agent, ie like now Crawl-Delay applies to all urls, even though some URLs have heavier processing then others, ie like search URLs would take more resources on server, so it makes sense to have higher crawl-delay for these, but lower crawl-delay for static pages that are not a big burden on server. Same with revisit - some pages rarely change so you may want to set revisit value different from (say) the home page.
This means that more than a single parameter per user-agent (like it is) will be needed, in effect requiring to switch to url groups rather than user-agent groups.
; define url groups
Url-Group: DisallowAll
Url: /
Url-Group: StaticPages
Url: /*.html
Url: /*.htm
Url: /*.txt
Crawl-Delay: 1
Revisit: 30
Url-Group: DynamicPages
Url: /*.php
Url: /*.jsp
Crawl-Delay: 10
Revisit: 5
Url-Group: TrapForBadBots
Url: /abusethis/
Disallow: yes
Url-Group: HomePage
Url: /index.php
Revisit: 1
User-Agent: *
; specify url groups for the advanced agents
; presence of url group will automatically override
; ANY other settings below unless url is not matched
; to any of the groups
Url-Group: DynamicPages
Url-Group: StaticPages
Url-Group: HomePage
Url-Group: TrapForBadBots
; for backwards compatibility
Crawl-Delay: 5
Disallow: /abusethis/ ; or whatever should be disallowed