pageoneresults - 11:40 pm on May 30, 2010 (gmt 0)
Here's my theories of using robots.txt to Disallow documents.
First off, anything in the robots.txt file is an invitation for folks to explore, not to mention bots. What they do during that exploration process is anyone's guess.
I've done enough retractions of Disallow directives to see positive results after the fact. For example, a website capable of generating 10,000 documents of unique content. When performing site: searches for Disallowed documents, 60,000 URI only listings show up after expanding the result set. What exactly happens during the crawl routines of this website?
I think those URI only entries are black holes for crawl equity. I don't want the bot wasting its resources on referencing 60,000 URIs, I really don't. I don't even want the bots to know that those URIs exist. No, I want to grab that bot by the balls and send them on a pre-planned crawling adventure.
Not a single person have convinced me that robots.txt is useful. I find noindex, nofollow to be the perfect solution to keep documents out of the indices and to keep folks from peeping. Every single site where we've removed the Disallow and went to the doc level with noindex, nofollow had improvements in their crawl routines, every single one of them. The one common denominator was normalization. Crawls became much more normal after that and the sites appeared to perform that much better overall.
Anyone else doing it this way?