pageoneresults - 4:55 pm on May 30, 2010 (gmt 0)
I don't see Googlebot crawling robots.txt disallowed documents.
If they don't crawl them, why are there so many URI only listings when performing site: searches for Disallowed files? They do crawl them but, they don't index them based on my understanding.
I've yet to find a single page of mine that has a noindex directive in the index using site: searches. Those pages using noindex are pretty much invisible to the bots. They may know about them via their own internal mechanisms but the general public doesn't.
robots.txt is a great way for Googlebot to discover URIs. It will crawl anything and everything if it finds a link. Most of those pages being Disallowed have links to them. Googlebot is going to find that link one way or the other. If folks want to give the bots a starting point by using robots.txt, that's fine. I've seen too much stuff showing up in site: searches that really should NOT be there.
I think many folks overlook the potential risks involved with all those URI only listings that occur from robots.txt Disallow entries. Think about all those internal links that you have which point to Disallowed files. I think you're creating one big round robin of crawling and that's why I use noindex instead. I don't want my document URIs showing up when someone performs site: searches - that's not right.