pageoneresults - 11:08 am on Apr 26, 2011 (gmt 0)
the net result would be webmasters not blocking pages and letting google bot crawl zillions of extra pages.
Googlebot crawls them now when listed in robots.txt. They don't index, but they crawl. I've never been fond of robots.txt because Google interprets the guidelines literally. I've seen sites show thousands, hundreds of thousands of URI only entries due to this crap.
Me, I just noindex those items that do not belong in the indexing pool. We serve it dynamically based on the request. Been doing it that way for years with no ill effects. And, it keeps documents OUT of the index. Unlike robots.txt entries which get a URI only listing and are available for all to see via the site: command.
I've seen robots.txt files give away information that I don't think the general public should have access to. There are too many prying eyes these days that are up to no good. I don't need no stinkin' robots.txt file to provide them with a map of everything I don't want indexed. And, I don't care that Google "crawls and indexes" noindex pages. It doesn't display them in the SERPs, ever, and that is the intended goal.
Also, WTF is the canonical element good for? What happens when another crawler grabs those documents and doesn't understand the canonical? There are still lots of those out there. Googlebot is not the only one you need to be concerned about. What's going to happen is there will be all sorts of URI configurations that are scraped, repurposed, etc. Now you have all those incoming redirects and/or fragmented signals to contend with. Not me, never used the canonical and probably never will - it's a hack.