How to treat HTTPs versions of your site?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How to treat HTTPs versions of your site?

ATWeb

12:27 am on Jul 12, 2008 (gmt 0)

I provide a SSL (HTTPs) version of my site for extra privacy to my users. The content is exactly identical for the HTTP and HTTPS versions.

I have been unable to find ANY info about how search engines treat HTTPS versions of sites. The HTTP version is the "main" version of my site.

I have now put a robots "noindex" meta element on all HTTPS page. I fear that this will be misinterpreted and bad somehow for my SEO...

What do you say? What should I do? Google Sitemaps has no info on this and Google Help has no info on this at all.

I have asked tons of people but nobody has got a clue.

[edited by: tedster at 12:48 am (utc) on July 12, 2008]

tedster

1:19 am on Jul 12, 2008 (gmt 0)

Hello ATWeb, and welcome to the forums. You're right that you don't want to have Google index two urls for the same content - not for any reason, and certainly not because of the https protocol. Here's what Google has to say:

Each port must have its own robots.txt file. In particular, if you serve
content via both http and https, you'll need a separate robots.txt file for each
of these protocols. For example, to allow Googlebot to index all http pages
but no https pages, you'd use the robots.txt files below.
For your http protocol (http://yourserver.com/robots.txt):
User-agent: *
Allow: /
For the https protocol (https://yourserver.com/robots.txt):
User-agent: *
Disallow: /
Webmaster Help Center

Technical details are available around the forums and can depend on what server you are using, usually [url=http://www.webmasterworld.com/forum92/]Apache [google.com] or Windows IIS [webmasterworld.com]. For instance, there's good information in this thread [webmasterworld.com] or you can find morre via Site Site Search [webmasterworld.com].

Here's another approach that also works: Serve secure versions of your pages only from a dedicated subdomain, such as secure.example.com. Then use robots.txt to disallow spidering of that subdomain.

ATWeb

10:15 am on Jul 12, 2008 (gmt 0)

Ah. I already did put the noindex meta elements on all the HTTPS pages, but I will add the HTTPS robots.txt as well. (Or is that redundant?)

"User-agent: *
Allow: /"

Are you sure that there is an "Allow" directive? You problably know best, but I was under the impressions that you had to do "Disallow: " to "allow any"...

tedster

10:33 am on Jul 12, 2008 (gmt 0)

Allow is an extension to the robots.txt standard that Google follows.

The Allow extension
Googlebot recognises an extension to the robots.txt standard called Allow. This extension may not be recognized by all other search engine bots, so check with other search engines in which you are interested to find out. The Allow line works exactly like the Disallow line. Simply list a directory or page that you want to allow.
[google.com...]