Forum Moderators: goodroi
1. Using a wildcard in robots.txt file. Probably the easiest, however, this is only respected by Google (and not Bing / Yahoo). Robots.txt isnt always reliable either.
2. Writing a php script which dynamically inserts meta tags noindex, nofollow. I dont want to have to edit an entire site to add stuff like this.
Similarly, using php to deliver two different robots.txt file (one for secure, and one for non-secure). Again, way too much work.
3. Using 301 redirects. Even messier.
4. Placing https site in different folder. Not really an option for me.
5. X-Robots-Tag in htaccess. This would be great, if I knew an htaccess rule that allowed me to do this. I am fairly novice to htaccess stuff.
Can someone provide a working example?
Perhaps is this is a good thread to discuss the X-Robots-Tag ? and how to implement it for the same purpose? just so theres some documentation on the subject.
google does honor and abide by robots.txt and will not crawl or index a page if it is properly disallowed.
if google does discover links pointing to a url they will include the url in their search index but will not crawl or index it if that url is disallowed by robots.txt.
url discovery is very different from crawling and indexing. by finding links on other websites and looking at the anchor text, plus incorporating url listing information from Yahoo, Best of the Web and the DMOZ directory you can learn alot about a url without ever having to crawl or index the url. this is one way to abide by robots.txt and still include as many urls in the search index as possible.
Crawler directives
The robots.txt file only contains the so called Crawler directives, telling search engines, identified by their User-agent:, where they are not allowed to go by using Disallow: and where they can (and should) go by using Allow:, and by pointing them at a Sitemap:.As Sebastian pointed out and explains thoroughly in another brilliant post, pages that search engines aren't allowed to spider, can still show up in the search results, when they have enough links pointing at them. This basically means that if you want to really hide something from the search engines and thus from people using search, robots.txt just isn't good enough.
[edited by: goodroi at 2:41 pm (utc) on Dec. 6, 2009]
[edit reason] Please do not republish entire blog posts [/edit]