What is the best way to prevent this? I don't want the crawler to follow and index any pages on my site pointed to by internal or external links using the https protocol. The https portion of our site is our transactional portion which we don't want indexed... The http portion is where the home page and all of the content resides that we do want indexed.
I just inherited this huge site... The previous SEO person has an entry in the Robot.txt to disallow /https* which I am 99.9999999% sure will not work. I am guessing this only disallows www.mydomain.com/https* and mydomain.com/https* type URLs. I don't think you can disallow protocols and domains there.
One would think this would be a common request for which there would be a simple solution.
PS: I'm running on a Microsoft platform Windows Server 2003, IIS 6.0...
Typically you just block the first transaction page In robots.txt which blocks the crawl to the rest of the pages. The problem is that you've already got them indexed so now you need to block each page name and I would also add a meta "NOINDEX" to those pages as well.
If the crawler is indexing other pages in your site using HTTPS as well you may need to permanently redirect just the crawler back to HTTP to stop this.