Welcome to WebmasterWorld Guest from 18.204.48.40

Forum Moderators: goodroi

Google Submits Formal Robots Exclusion Protocol to IETF

     
3:31 pm on Jul 1, 2019 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month

joined:May 9, 2000
posts:26370
votes: 1035


Google has drafted a Robots Exclusion Protocol Specification and submitted it to the IETF (Internet Engineering Task Force).
Surprisingly, the Robots Exclusion Protocol has never been formalised since its inception in 1994.
Note, the rules are not changed, merely reflecting today's web as we know it.

There are some things that are worth noting in this:-
  • Any URI based transfer protocol can use robots.txt. For example, it's not limited to HTTP anymore and can be used for FTP or CoAP as well.
  • Developers must parse at least the first 500 kibibytes of a robots.txt. Defining a maximum file size ensures that connections are not open for too long, alleviating unnecessary strain on servers.
  • A new maximum caching time of 24 hours or cache directive value if available, gives website owners the flexibility to update their robots.txt whenever they want, and crawlers aren't overloading websites with robots.txt requests. For example, in the case of HTTP, Cache-Control headers could be used for determining caching time.
  • The specification now provisions that when a previously accessible robots.txt file becomes inaccessible due to server failures, known disallowed pages are not crawled for a reasonably long period of time.


  • Google says it's also updated "the augmented Backus–Naur form in the internet draft to better define the syntax of robots.txt, which is critical for developers to parse the lines."

    [webmasters.googleblog.com...]
    1:30 pm on July 2, 2019 (gmt 0)

    New User from US 

    joined:July 2, 2019
    posts:2
    votes: 0


    This is exciting.
    7:56 pm on July 2, 2019 (gmt 0)

    Preferred Member

    10+ Year Member Top Contributors Of The Month

    joined:July 23, 2004
    posts:596
    votes: 103


    This might be a good thing for only the handful of the 2 or 3 parsing agents that actually look at robots.txt and obey it in the first place.

    As for the 10's of 100's of the other parsing agents that totally ignore robots.txt? --- meh
    10:19 pm on July 2, 2019 (gmt 0)

    Senior Member from CA 

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

    joined:Nov 25, 2003
    posts:1339
    votes: 438


    At the same time Google announced [opensource.googleblog.com] they've open sourced their robots.txt parser [github.com].


    ...we open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the 90's. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.

    We also included a testing tool in the open source package to help you test a few rules. Once built, the usage is very straightforward...
    10:20 pm on July 2, 2019 (gmt 0)

    Senior Member from CA 

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

    joined:Nov 25, 2003
    posts:1339
    votes: 438



    As for the 10's of 100's of the other parsing agents that totally ignore robots.txt?

    DOA