Welcome to WebmasterWorld Guest from 184.108.40.206
With robots.txt you let googlebot figure out that you don't want them to index (they might still do it!).
With .htaccess you can deny using a useragent (say anything that has "google" in the UA string) -- googlebot will not even get to the robots.txt.
How could they possibly stop cloaking?
True, but Bek isn't talking about cloaking.
The objective is simply to exclude Googlebot alltogether - something which robots.txt is designed to do and Googlebot will no doubt honour. There is no intent do deceive.
Apart from mundane infelicities—you sacrifice links to one of these pages--I don't think it will work to keep Google out. I remember reading somewhere that Google doesn't promise not to visit pages it's excluded from; it only promises not to put them in the SERPs. That's how they keep people honest. After all, you could have a single website with three versions of every page at different URLs.
I'm sure this has been tried, and Google's caught it.
I hope you didn't read that here, suidas!
The opposite is the case. If Google sees a link to a /robots.txt excluded URL, then Google will not fetch it. It can still list the URL in the results without having to fetch the page.
If a URL is not /robots.txt excluded, but has a META robots tag with 'noindex', then if Google fetches the URL it will not be listed.
> sure this has been tried, and Google's caught it
That someone has a Web site, and that there's some part of the Web that shouldn't be crawled? I don't see how Google would see a quality issue there, unless they didn't like the site they were crawling.
The 'dual site' approach using Robots Exclusion Protocol would sacrifice links though.