homepage Welcome to WebmasterWorld Guest from 54.237.78.165
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
How do I prevent the SEs from crawling any links to my site w/HTTPS
I don't want the crawlers to index anything w/ HTTPS protocol
ZydoSEO

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3516348 posted 5:28 pm on Nov 29, 2007 (gmt 0)


What is the best way to prevent this? I don't want the crawler to follow and index any pages on my site pointed to by internal or external links using the https protocol. The https portion of our site is our transactional portion which we don't want indexed... The http portion is where the home page and all of the content resides that we do want indexed.

I just inherited this huge site... The previous SEO person has an entry in the Robot.txt to disallow /https* which I am 99.9999999% sure will not work. I am guessing this only disallows www.mydomain.com/https* and mydomain.com/https* type URLs. I don't think you can disallow protocols and domains there.

One would think this would be a common request for which there would be a simple solution.

Suggestions?

PS: I'm running on a Microsoft platform Windows Server 2003, IIS 6.0...

 

encyclo

WebmasterWorld Senior Member encyclo us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3516348 posted 9:51 pm on Dec 1, 2007 (gmt 0)

You can't do this with robots.txt alone. Are you saying that the whole site is available under both http and https? Is the site dynamic (ASP, .NET etc.)?

If yes, then your basic options are:

1. More the transactional sections to a subdomain, eg. secure.example.com, then use a robots.txt under the subdomain which excludes the bots.

2. Cloak the robots.txt file to display a different file when the request is made via https (if you can rewrite the robots.txt file to a dynamic file this can work well).

3. In your application, add the appropriate code to include a meta robots noindex element for all pages served by https.

The last option is often the easiest to set up.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3516348 posted 8:37 pm on Dec 9, 2007 (gmt 0)

Typically you just block the first transaction page In robots.txt which blocks the crawl to the rest of the pages. The problem is that you've already got them indexed so now you need to block each page name and I would also add a meta "NOINDEX" to those pages as well.

If the crawler is indexing other pages in your site using HTTPS as well you may need to permanently redirect just the crawler back to HTTP to stop this.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved