They're treating arguments to Disallow and Allow as URL-paths, not filenames. So, this addition and this example are very good news to people who don't want the same dynamic (page) URL with dozens of session IDs in the query strings crawled and then possibly treated as duplicate content, simply because people have linked to them, or because the Webmastee left the "stats" page accessible to robots. The ability th tell the robot to not crawl URLs with appended session IDs is a decent back-end fix for those sites whose owners don't have the technical know-how to disable sessions for crawler user-agents.
We talk of "the Standard", and some folks get somewhat excited about defending it as sacrosanct. However, it was never "The Standard," but rather, "A Standard for Robot Exclusion." It was never voted on by any official body; It is only a de-facto standard. So, compliance with this "standard" is entirely voluntary.
However, it would be very, very nice if the search providers would lay down their arms for a short time and get together to expand and modernize the Standard, and then document the result in a formal way. The function of robots.txt files should not be an area for competition, but rather, for cooperation.
There are many things about the Standard and its extensions that are problematic:
Multiple User-agent record support, as in:
Precedence versus specificity of User-agent fields: Some 'bots accept the first record containing a User-agent prefix which matches their name, or contains "*" -- whichever comes first. This is the method specified in the original Standard.
In order to make up for common Webmaster errors, though, other robots accept the record containing the "best match" on the User-agent string, regardless of the record order. No matter which method is used, this choice should be documented and available on-line.
Precedence of Disallow and Allow for non-mutually-exclusive partial paths. This can be fixed (e.g. "Allow" always overrides "Disallow"), or can be based on directive order, but it certainly should be documented.
Clarification of "case-insensitive substring match" in the Standard. I'd like to see this changed to "case-insensitive prefix-match." So that, for example, "msnbot-Media" would stop trying to crawl where only "msnbot/" is allowed, and getting itself 403'ed on my sites as a result (Hey, see that trailing slash? -- "msnbot-" does not match "msnbot/").
Documentation of behaviour for unsupported directives: If a 'bot doesn't support all 'modern' directives (for example Crawl-delay), I'd like to see an explicit declaration of behaviour, as in, "If a we don't recognize a directive, we A) ignore it, B) ignore the record in which it appears, C) consider the robots.txt file to be invalid and leave, or D) consider robots.txt to be invalid and henceforth have our way with your site." Any of these is fine with me, as I have bigger guns to back up robots.txt, but I'd like to see it in writing.
Finally, I'd like to see a return to something like 'the good old days' when new search companies appeared and simply copied the AltaVista Scooter robot documentation -- Standardization, rather than the current "Balkanization" of robots.txt handling and documentation.