homepage Welcome to WebmasterWorld Guest from 184.73.52.98
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
How do I prevent the SEs from crawling any links to my site w/HTTPS
I don't want the crawlers to index anything w/ HTTPS protocol
ZydoSEO




msg:3516350
 5:28 pm on Nov 29, 2007 (gmt 0)


What is the best way to prevent this? I don't want the crawler to follow and index any pages on my site pointed to by internal or external links using the https protocol. The https portion of our site is our transactional portion which we don't want indexed... The http portion is where the home page and all of the content resides that we do want indexed.

I just inherited this huge site... The previous SEO person has an entry in the Robot.txt to disallow /https* which I am 99.9999999% sure will not work. I am guessing this only disallows www.mydomain.com/https* and mydomain.com/https* type URLs. I don't think you can disallow protocols and domains there.

One would think this would be a common request for which there would be a simple solution.

Suggestions?

PS: I'm running on a Microsoft platform Windows Server 2003, IIS 6.0...

 

encyclo




msg:3518309
 9:51 pm on Dec 1, 2007 (gmt 0)

You can't do this with robots.txt alone. Are you saying that the whole site is available under both http and https? Is the site dynamic (ASP, .NET etc.)?

If yes, then your basic options are:

1. More the transactional sections to a subdomain, eg. secure.example.com, then use a robots.txt under the subdomain which excludes the bots.

2. Cloak the robots.txt file to display a different file when the request is made via https (if you can rewrite the robots.txt file to a dynamic file this can work well).

3. In your application, add the appropriate code to include a meta robots noindex element for all pages served by https.

The last option is often the easiest to set up.

incrediBILL




msg:3523641
 8:37 pm on Dec 9, 2007 (gmt 0)

Typically you just block the first transaction page In robots.txt which blocks the crawl to the rest of the pages. The problem is that you've already got them indexed so now you need to block each page name and I would also add a meta "NOINDEX" to those pages as well.

If the crawler is indexing other pages in your site using HTTPS as well you may need to permanently redirect just the crawler back to HTTP to stop this.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved