ergophobe - 2:18 pm on Aug 30, 2013 (gmt 0)
If you don't know where to redirect to, how do you know what to block? You must have some rule that allows you to figure that out.
Which brings us back to Lucy24's point. This is fundamental
robots.txt controls the crawl, not the index.
Disallow in robots.txt doesn't tell Google not to index a page, it tells it not to crawl it. If you have a enough inbound links to that page Google might choose to index it and would be fully within compliance with robots.txt protocols to do so. Disallow is more about not wasting server resources, not spending your crawl budget and so forth. It is only secondarily and in a sense accidentally about what's in the index.
noindex controls the index, not the crawl
If you noindex a page, on the other hand, you are telling Google to keep it out of the index or remove it from the index, but Google is welcome to crawl it all they want.
So if you are trying to control duplicate content, robots.txt is NOT generally your tool of choice.
In this particular case, though, it sounds like your primary problem is with canonicalization. So rather than add a noindex to the pages that have extra parameters, you probably want to add a rel=canonical to those pages and point to the primary page.
google has those pages in its index but is hidding it from me !
If they are hidden from you, how do you know they are in the index? In any case, the more important question is why do you need to know they are in the index? What you really need to figure out is what your URL structure is, what you need for redirects (if any) and canonical tags.
In other words, whether Google has spotted a dupe content issue already (thus reflected in the index) or not (thus not YET reflected in the index), you need to figure out the root causes and fix that.
Telling Google to ignore certain parameters in GWT is a good strategy. rel=noindex, rel=canonical and 301 redirects might be good strategies too. Robots.txt disallow probably isn't going to do what you want.