pageoneresults - 4:21 pm on May 30, 2010 (gmt 0)
Personally I think robots.txt is the arch nemesis when it comes to crawling. I don't recommend its use anymore and haven't for quite a few years. Yes, we do use one and typically block /js/ directories and use it as a whitelisting method, we Disallow all but known bots.
In my mind there's nothing worse than performing a site: search for Disallowed files on your site and finding 50,000 URI only listings. I think Google broke the robots.txt protocol when they started showing Disallowed files in the SERPs.
Keep in mind that the robots.txt file is a road map for documents you don't want indexed. I've found some very interesting stuff when previewing robots.txt files. Not to mention performing site: searches and seeing thousands upon thousands of URI only listings, that's not good if you ask me.
We've been using noindex, nofollow at the document level without fail. We're also using X-Robots-Tags and doubling up on the directives, one from the server header and one at the document level. To date, it's worked like a charm. None of the SEs will list a page that is noindex.
Think about this from a crawling perspective. Do you really want Googlebot crawling all of those documents and displaying URI only listings for someone to do who knows what with them? Could I generate a page of links from those Disallowed files and create some havoc with your crawling routines? I think so. ;)