Page is a not externally linkable
jdMorgan - 8:46 pm on Aug 13, 2006 (gmt 0)
Instead of thinking, "I'm sure they can handle this" and interpreting the Standard liberally, it's best to think in terms of "the robots will do (or recognize) this and no more." Keep it dirt-simple, in other words. It is best to assume that a robot will accept the first record that it finds that matches its user-agent token, and to interpret the definition of "matches" as meaning that the robot will accept either a "*" or its specific user-agent token whichever it finds first. I know of several robots that will read past a "*" record to see if they can find a more-specific record, but there are many, many that won't. In other words, the "User-agent: *" record should always be the catch-all record at the end of your robots.txt file, and all robot-specific records should precede it. This because a given robot will accept directives from one robots.txt record and no more. And while a few robots support the (required) feature of specifying multiple user-agents in a single record, this causes many others to blow up completely and either go away or crawl the whole site. So even some of what is clearly defined in he standard is poorly supported. If that's not clear, I'm referring to this construct: My conclusion is that the best approach is to serve a different robots.txt files to each robot -- test the user-agent in the request and serve an appropriate and separate file to each one that matters to your site, then serve a generic one to the 'bots that send you little or no traffic. Only in this way can you be sure that a proprietary directive intended for one robot won't cause a less-sophisticated robot to reject the file as invalid and either spider the whole site or just go away without crawling at all. When doing so, it's best to code the logic so that new variants of recognized robots won't be turned away. The big search companies are developing a penchant for releasing additional 'specialty' robots at an ever-increasing pace, and for changing their user-agent string arrangement for no good reason. Nevertheless, it wouldn't do to send an otherwise-welcome robot packing because you don't recognize the new version.
The number one mistake in interpreting robots.txt is to think that the robot examining it is programmed to be "smart" about discerning what you want. While it's certainly possible for today's robots to be smarter than they were back when the Standard was proposed, the robots.txt Standard was invented in much simpler times and for much simpler web sites.
User-agent: googlebot
User-agent: Slurp
Disallow: /cgi-bin
Disallow: /admin
That construct, while clearly required* by the Standard, is not supported by many second-tier robots.
* From A Standard for Robot Exclusion: "The record starts with one or more User-agent lines" (emphasis added)