How do I prevent the SEs from crawling any links to my site w/HTTPS

Forum Moderators: goodroi

Message Too Old, No Replies

How do I prevent the SEs from crawling any links to my site w/HTTPS

I don't want the crawlers to index anything w/ HTTPS protocol

ZydoSEO

5:28 pm on Nov 29, 2007 (gmt 0)

What is the best way to prevent this? I don't want the crawler to follow and index any pages on my site pointed to by internal or external links using the https protocol. The https portion of our site is our transactional portion which we don't want indexed... The http portion is where the home page and all of the content resides that we do want indexed.

I just inherited this huge site... The previous SEO person has an entry in the Robot.txt to disallow /https* which I am 99.9999999% sure will not work. I am guessing this only disallows www.mydomain.com/https* and mydomain.com/https* type URLs. I don't think you can disallow protocols and domains there.

One would think this would be a common request for which there would be a simple solution.

Suggestions?

PS: I'm running on a Microsoft platform Windows Server 2003, IIS 6.0...

encyclo

9:51 pm on Dec 1, 2007 (gmt 0)

You can't do this with robots.txt alone. Are you saying that the whole site is available under both http and https? Is the site dynamic (ASP, .NET etc.)?

If yes, then your basic options are:

1. More the transactional sections to a subdomain, eg.

secure.example.com

, then use a robots.txt under the subdomain which excludes the bots.

2. Cloak the robots.txt file to display a different file when the request is made via https (if you can rewrite the robots.txt file to a dynamic file this can work well).

3. In your application, add the appropriate code to include a meta robots noindex element for all pages served by https.

The last option is often the easiest to set up.

incrediBILL

8:37 pm on Dec 9, 2007 (gmt 0)

Typically you just block the first transaction page In robots.txt which blocks the crawl to the rest of the pages. The problem is that you've already got them indexed so now you need to block each page name and I would also add a meta "NOINDEX" to those pages as well.

If the crawler is indexing other pages in your site using HTTPS as well you may need to permanently redirect just the crawler back to HTTP to stop this.