|Best way to prevent indexing of https pages|
Some things I think might work, with pro's and cons, but to be honest, I dont know how to do some of them.
1. Using a wildcard in robots.txt file. Probably the easiest, however, this is only respected by Google (and not Bing / Yahoo). Robots.txt isnt always reliable either.
2. Writing a php script which dynamically inserts meta tags noindex, nofollow. I dont want to have to edit an entire site to add stuff like this.
Similarly, using php to deliver two different robots.txt file (one for secure, and one for non-secure). Again, way too much work.
3. Using 301 redirects. Even messier.
4. Placing https site in different folder. Not really an option for me.
5. X-Robots-Tag in htaccess. This would be great, if I knew an htaccess rule that allowed me to do this. I am fairly novice to htaccess stuff.
Can someone provide a working example?
this older thread will probably give you some good insights
Yes, that is great info, and I just implemented it.
However, it is known that google will still index a given page if there are enough links to it, and ignore the robots.txt rules.
Perhaps is this is a good thread to discuss the X-Robots-Tag ? and how to implement it for the same purpose? just so theres some documentation on the subject.
i think there might be some confusion.
google does honor and abide by robots.txt and will not crawl or index a page if it is properly disallowed.
if google does discover links pointing to a url they will include the url in their search index but will not crawl or index it if that url is disallowed by robots.txt.
url discovery is very different from crawling and indexing. by finding links on other websites and looking at the anchor text, plus incorporating url listing information from Yahoo, Best of the Web and the DMOZ directory you can learn alot about a url without ever having to crawl or index the url. this is one way to abide by robots.txt and still include as many urls in the search index as possible.
The robots.txt file only contains the so called Crawler directives, telling search engines, identified by their User-agent:, where they are not allowed to go by using Disallow: and where they can (and should) go by using Allow:, and by pointing them at a Sitemap:.
As Sebastian pointed out and explains thoroughly in another brilliant post, pages that search engines aren't allowed to spider, can still show up in the search results, when they have enough links pointing at them. This basically means that if you want to really hide something from the search engines and thus from people using search, robots.txt just isn't good enough.
[edited by: goodroi at 2:41 pm (utc) on Dec. 6, 2009]
[edit reason] Please do not republish entire blog posts [/edit]
having urls show up in the search results is very different from ignoring robots.txt. google does follow robots.txt. technically google has followed robots.txt 99.9% of the time because there has been the occasional programming glitch with googlebot. these involve very rare and unique situations. once google was notified of them they worked to fix them so they would handle the robots.txt properly the next time.