Forum Moderators: Robert Charlton & goodroi
With robots.txt you let googlebot figure out that you don't want them to index (they might still do it!).
With .htaccess you can deny using a useragent (say anything that has "google" in the UA string) -- googlebot will not even get to the robots.txt.
Apart from mundane infelicities—you sacrifice links to one of these pages--I don't think it will work to keep Google out. I remember reading somewhere that Google doesn't promise not to visit pages it's excluded from; it only promises not to put them in the SERPs. That's how they keep people honest. After all, you could have a single website with three versions of every page at different URLs.
I'm sure this has been tried, and Google's caught it.
I hope you didn't read that here, suidas!
The opposite is the case. If Google sees a link to a /robots.txt excluded URL, then Google will not fetch it. It can still list the URL in the results without having to fetch the page.
If a URL is not /robots.txt excluded, but has a META robots tag with 'noindex', then if Google fetches the URL it will not be listed.
> sure this has been tried, and Google's caught it
That someone has a Web site, and that there's some part of the Web that shouldn't be crawled? I don't see how Google would see a quality issue there, unless they didn't like the site they were crawling.
The 'dual site' approach using Robots Exclusion Protocol would sacrifice links though.