Forum Moderators: goodroi

Message Too Old, No Replies

Proposing updates to the robots.txt standard

It is time for a change

         

volatilegx

2:30 pm on Nov 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Regarding discusson about dumb spiders and the argument about bots obeying a site's T.O.S. [webmasterworld.com]

I think Dijkgraaf has brought up some interesting considerations that deserve their own thread. A new robots standard proposal deserves attention.

reads robots.txt

obeys Disallow directives

Request robots.txt again if more than 24 hours since last request

Doesn't request robots.txt more than once an hour

Doesn't request files more than once every 5 seconds

Requests are not repeated within 24 hours

obeys no cache tag

obeys no follow tag

obeys no index tag

obeys none tag

request-header user-agent contains URL to page about bot

the about bot page explains how to disallow bot in robots.txt

request-header user-agent contains bot name in agent

request-header user-agent bot name matches the one it looks for in robots.txt / meta tags

request-header user-agent doesn't change often

recognises commented out links

doesn't request URL's with #

Stops revisiting not found (404) pages after a time

Doesn't frequently re-requests (404) pages

Stops revisiting moved permanently (301) pages

Stops revisiting gone (410) pages
And also some features that I'd like to see in all bots (some support some, but most don't)

request-header contains If-Modified-Since (this includes when fetching robots.txt)

obeys Disallow file extension wildcard directives

Allows site owner to control delay

allows site owner to control frequency of revisits

obeys no no-email-collection tag

request-header From contains e-mail address

request-header referer contains where it found link to the requested item

request-header user-agent contains contact e-mail address

the about bot page explains purpose of the bot

Now I'll go back to beating a dead horse, so to speak. I think wilderness and Lord Majestic's argument about bots obeying sites' terms of service can be summed up like this:

wilderness: Bots should obey a sites terms of service.

Lord Majestic: While desirable, it isn't technically feasible.

I agree with both of you. That's why I'd like to propose a new standard that would encompass Dijkgraaf's list and add a bot-understandable framework for obeying terms of service that relate to search engine use of a website.

Perhaps robots.txt could be extended to tell bots whether or not files could be cached and for how long, how often to respider, how many times a minute a bot should make requests, etc.

That's about all I'd want from the standard, but I'm sure wilderness could think of a lot more.

Lord Majestic

4:13 pm on Nov 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am all for making some progress in robots.txt advancements that are well overdue. I would be quiet happy to take very active part in this.

Now to comment on some of Dijkgraaf's suggestions:

Doesn't request files more than once every 5 seconds

Crawl-delay parameters seems to be more and more supported and it gives opportunity to specify custom delay. Perhaps it would make sense to have multiple Crawl-Delays because some static files are fast for retrieval but some dynamic pages (search) can be slow.

From bot writer's point of view default delay of 5 seconds is too high (for a site with lots of pages), and I'd say 1 second should be minimum, with default somewhere between 1 and 2 :)

request-header referer contains where it found link to the requested item

Pretty much impossible because the same link could be present in thousands of pages. However I am inclined to change my bot to include referer in cases when bot gets redirected from another domain -- this seems to be valuable for possible troubleshooting as well as gives idea to the webmaster.

Ideal situation would be merging Google's SiteMaps idea into robots.txt - a big mistake by Google to fail resist temptation to make people submit SiteMaps directly to Google rather than making it an open standard for all search engines.

obeys no cache tag
obeys no follow tag
obeys no index tag
obeys none tag

These are really features of the search engine rather than bot - also "nofollow" flag does not mean link won't be followed, its a very misleading name for a zero pagerank feature.

request-header From contains e-mail address
request-header referer contains where it found link to the requested item
request-header user-agent contains contact e-mail address
recognises commented out links

All of this is redundant so long as user-agent contains link to page that in turn contains all contact information. Stuffing all this into headers is a waste of bandwidth on all sides.

Now I am off to register newrobotstxt.org domain ;)

Dijkgraaf

9:28 pm on Nov 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Shouldn't this thread be in the robots.txt forum then?

Regarding crawl-delay, I think this is fine as is as it tells spiders the time delay between fetching any resources.
I do see the need to tell spiders how often a resource changes and when to re-request it. This is actually covered with a tag in Google's site map called changefreq, and google has the following comments about it
How frequently the page is likely to change. This value provides general information to search engines and may not correlate exactly to how often they crawl the page. Valid values are:

  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never

    Possibly rather than doing this on an induvidual document basis as per the Google sitemap, you would want to do this on a higer level of directory or site in robots.txt (or robots.xml)
  • wilderness

    11:11 pm on Nov 3, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Frequently, I begin responses and "grab my tongue (er keyboard)" to delay my sometime over reaction.
    The following from another thread and early this AM, however I believe is pertinent criteria for bots to consider.

    technically not feasible

    robots.txt is binary at it's lowest form.
    on/off or yes/no.

    While robots.txt may be effective for determination of spidering agendas, the technical capabilites of how the accumulated data is analyzed and utilized has superseded robots.txt by light years.

    Getting emotional over antimated objects ;)

    We shouldn't forget that these softwares are created, programmed and motivated by human beings that have some simple humman traits, such as conscious and reasoning.

    Perhaps your correct that the technology doesn't exist?
    Perhaps your correct that I expect too much capability from SE's spidering my/our data?

    Daily, I deal with people who are intimidated by the word "computer" and use lesser machines to access the internet (of course these folks are a very small share of the market.) These are folks from a different time (nearly
    another century) and yet they are capable of balancing two world's and two technologies, all without being aware of robots.txt, scrapers, harvesters, or even packets.

    Why should the constraints (whether self-imposed or not) on these folks be any less demanding of the technologies, they are expected to grasp?

    Lord Majestic

    11:20 pm on Nov 3, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Why should the constraints (whether self-imposed or not) on these folks be any less demanding of the technologies, they are expected to grasp?

    A 5 year old kid has better image and voice recognition than the best available supercomputer.

    I have programmed for 15 years and while I set myself very high goals I am not even going to try to code what you expect from bots - you are about the only one with such expectation and even if you had the best site in the world it would have been cheap and more sensible just not to crawl your site.

    What you are proposing is wishful thinking that is really laughable to anybody who tried to program anything serious. Perhaps you have unique vision of the problem but if that's the case then you will have to prove that you can actually implement yourself what you expect of others. Can you write code that would understand T&C pages from various websites? Can you do it?

    I really don't want to get into this any further - I can't implement what you want and to my knowledge even the best in class programmers (ie Google) not even trying to do it, which pretty much means that unless you prove the whole world that you can do it, then nobody will do it anytime soon because there are far more important problems to solve like better ranking or anti-spam algorithms: there is robots.txt out there and if you don't like particular bot or all bots, then just disallow access to your site and that will be the end of it.

    This is my last post on the topic of laughable expectation that bots should understand T&Cs that many non-lawyers would struggle to decipher.

    wilderness

    11:45 pm on Nov 3, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    This is my last post on the topic of laughable expectation that bots should understand T&Cs that many non-lawyers would struggle to decipher.

    Be nice if you choked on your laughter (at least from my point-of-view) then you wouldn't be such a PITA.

    Lord Majestic

    11:51 pm on Nov 3, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Be nice if you choked on your laughter

    I wish only the best to you because existance of your point of view is very refreshing and stimulating :)

    wilderness

    12:25 am on Nov 4, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    I wish only the best to you because existance of your point of view is very refreshing and stimulating happy! :)

    My only wish for you is that you acquire the skills to properly copy and paste :)

    Lord Majestic

    12:36 am on Nov 4, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    My only wish for you is that you acquire the skills to properly copy and paste happy!

    I prefer to focus on creation of something unique rather than to copy and paste :)

    Dijkgraaf

    12:44 am on Nov 4, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Now you two, stop snipping at each other.

    I'd rather that this discusion be about what features we would like to see that can actually be achieved.

    In fact most of the points I raised aren't even changes to the robots.txt, but rather standards of behaviour for crawlers/indexing.

    Two features that I would like to become part of the robots.txt standard are already supported by some bots being, Crawl-delay: and wildcards in disallow's, but what else can we add?
    How about a revisit: tag that gives a minimum time before revisiting a page/url? This would be a single line per User-Agent giving a side wide indicator of how often changes occur on a site, and if you are worried about bandwith you usage you up this figure to tell the spider to visit less often.

    Lord Majestic

    12:55 am on Nov 4, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    How about a revisit: tag that gives a minimum time before revisiting a page/url?

    Currently robots.txt is split by user-agent, it is necessary to be able to define groups of URLs (with * wildcard support in urls), and then have per-group settings like Crawl-Delay or Revisit or whatever. This would give flexibility to assign different params to different groups of URLs.

    Dijkgraaf

    12:59 am on Nov 4, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Lord Majestic, thats exactly what I proposed
    <quote> This would be a single line per User-Agent </quote>

    wilderness

    1:00 am on Nov 4, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    There's an old thread by Jim.
    Although it doesn't discuss new concepts, it explains some existing standards.

    [webmasterworld.com...]

    Lord Majestic

    1:06 am on Nov 4, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    <quote> This would be a single line per User-Agent </quote>

    But that's exactly my point - you are in effect limiting yourself to one settings per user-agent, ie like now Crawl-Delay applies to all urls, even though some URLs have heavier processing then others, ie like search URLs would take more resources on server, so it makes sense to have higher crawl-delay for these, but lower crawl-delay for static pages that are not a big burden on server. Same with revisit - some pages rarely change so you may want to set revisit value different from (say) the home page.

    This means that more than a single parameter per user-agent (like it is) will be needed, in effect requiring to switch to url groups rather than user-agent groups.

    volatilegx

    9:10 pm on Nov 4, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    I support Lord Majestic's idea of dividing the robots.txt list by URL groups instead of User Agent groups. It makes far more sense.

    Lord Majestic

    9:20 pm on Nov 4, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Here is example that would help understand what I mean better:

    ; define url groups
    Url-Group: DisallowAll
    Url: /

    Url-Group: StaticPages
    Url: /*.html
    Url: /*.htm
    Url: /*.txt
    Crawl-Delay: 1
    Revisit: 30

    Url-Group: DynamicPages
    Url: /*.php
    Url: /*.jsp
    Crawl-Delay: 10
    Revisit: 5

    Url-Group: TrapForBadBots
    Url: /abusethis/
    Disallow: yes

    Url-Group: HomePage
    Url: /index.php
    Revisit: 1

    User-Agent: *

    ; specify url groups for the advanced agents
    ; presence of url group will automatically override
    ; ANY other settings below unless url is not matched
    ; to any of the groups
    Url-Group: DynamicPages
    Url-Group: StaticPages
    Url-Group: HomePage
    Url-Group: TrapForBadBots

    ; for backwards compatibility
    Crawl-Delay: 5
    Disallow: /abusethis/ ; or whatever should be disallowed