Page is a not externally linkable
jdMorgan - 4:43 am on Nov 8, 2006 (gmt 0)
They're treating arguments to Disallow and Allow as URL-paths, not filenames. So, this addition and this example are very good news to people who don't want the same dynamic (page) URL with dozens of session IDs in the query strings crawled and then possibly treated as duplicate content, simply because people have linked to them, or because the Webmastee left the "stats" page accessible to robots. The ability th tell the robot to not crawl URLs with appended session IDs is a decent back-end fix for those sites whose owners don't have the technical know-how to disable sessions for crawler user-agents. [soapbox] However, it would be very, very nice if the search providers would lay down their arms for a short time and get together to expand and modernize the Standard, and then document the result in a formal way. The function of robots.txt files should not be an area for competition, but rather, for cooperation. There are many things about the Standard and its extensions that are problematic: Multiple User-agent record support, as in: versus Precedence versus specificity of User-agent fields: Some 'bots accept the first record containing a User-agent prefix which matches their name, or contains "*" -- whichever comes first. This is the method specified in the original Standard. In order to make up for common Webmaster errors, though, other robots accept the record containing the "best match" on the User-agent string, regardless of the record order. No matter which method is used, this choice should be documented and available on-line. Precedence of Disallow and Allow for non-mutually-exclusive partial paths. This can be fixed (e.g. "Allow" always overrides "Disallow"), or can be based on directive order, but it certainly should be documented. Clarification of "case-insensitive substring match" in the Standard. I'd like to see this changed to "case-insensitive prefix-match." So that, for example, "msnbot-Media" would stop trying to crawl where only "msnbot/" is allowed, and getting itself 403'ed on my sites as a result (Hey, see that trailing slash? -- "msnbot-" does not match "msnbot/"). Documentation of behaviour for unsupported directives: If a 'bot doesn't support all 'modern' directives (for example Crawl-delay), I'd like to see an explicit declaration of behaviour, as in, "If a we don't recognize a directive, we A) ignore it, B) ignore the record in which it appears, C) consider the robots.txt file to be invalid and leave, or D) consider robots.txt to be invalid and henceforth have our way with your site." Any of these is fine with me, as I have bigger guns to back up robots.txt, but I'd like to see it in writing. Finally, I'd like to see a return to something like 'the good old days' when new search companies appeared and simply copied the AltaVista Scooter robot documentation -- Standardization, rather than the current "Balkanization" of robots.txt handling and documentation. Jim
SS, 1) What is anyone's guess how Y's spider would behave by default? If it's not Disallowed, and it's not Allowed ... would the spider crawl it? Wouldn't that make Allow pretty meaningless? After all, if it's not Disallowed ...
If a URL-path-prefix is not explicitly Disallowed or explicitly Allowed, then it is implicitly allowed, and by default, it will be crawled -- Much the same as if the robots.txt file were non-existent or blank. 2) Can anyone explain how/why the example for /*?sessionid would work? Does anyone have filenames that include a query string on their server? What's the point, and why is this instructions a useful addition to robots.txt, which is meant to instruct spiders/bots in where they can and can't crawl?
We talk of "the Standard", and some folks get somewhat excited about defending it as sacrosanct. However, it was never "The Standard," but rather, "A Standard for Robot Exclusion." It was never voted on by any official body; It is only a de-facto standard. So, compliance with this "standard" is entirely voluntary. User-agent: googlebot
User-agent: slurp
Disallow: /cgi-bin User-agent: slurp googlebot
Disallow: /cgi-bin
Both methods are described in proposed Standards, but which (if any) robots support both?
[/soapbox]