TheMadScientist - 5:57 pm on Mar 17, 2011 (gmt 0)
I must admit I had previously thought that robots.txt stopped Google crawling the 'disallowed' pages.
It does ... It stops them from crawling the page, so if you use a noindex directive on the page they don't ever know it's there ... Robots.txt does not remove pages from the index.
Robots.txt exclusion and noindex are two totally different things and mutually exclusive ... When you disallow in robots.txt Google DOES NOT (contrary to popular belief) crawl the pages, which means they do not know what is on the page, or whether the page contains a noindex directive or not, so they use external information, such as links and link text to try to determine the topic of the page and generally include the page(s) in the index, which is especially noticeable when conducting a site: search.
NoIndex is the only directive which tells them to not index the page, but it cannont be used for disallowed pages, because when a page is disallowed in robots.txt they follow the instructions and Do Not crawl the page to see the noindex directive.
You can only use one or the other effectively, and if you try to use both the robots.txt disallow will take precedents and the page will often be included in the index, usually as URL only.