Removing https:// pages

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Removing https:// pages

Ellio

10:26 pm on Mar 7, 2006 (gmt 0)

Is it safe to remove https:// pages using the Google remove tool without also removing the http:// equivalent as well and getting barred for six months?

The tool does ask for the full URL.....

trinorthlighting

11:01 pm on Mar 7, 2006 (gmt 0)

Google is not supposed to index https:// Pages. I would email google with examples and see what they will do.

tedster

11:28 pm on Mar 7, 2006 (gmt 0)

Google does not avoid https pages automatically, nor are they supposed to. However, in their webmaster information about removing pages, this is what they say about https pages - serve a different robots.txt file on https.

By the way, using the removal tool in this sutuation has made a LOT of trouble for some sites, with both secure and regular urls disappearing.

Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you'd use the robots.txt files below.
For your http protocol (http://yourserver.com/robots.txt):
User-agent: *
Allow: /
For the https protocol (https://yourserver.com/robots.txt):
User-agent: *
Disallow: /
[google.com...]

souffle

11:46 pm on Mar 7, 2006 (gmt 0)

i have a secure site and google has indexed my site in http and https form according to google would that be concidered a dublicate site. please help.
also im not sure whats happening with google. One day i have a cached snap shot of my home page and the next im not. one day im indexed and next im not. can any one help

tedster

12:25 am on Mar 8, 2006 (gmt 0)

This whole moment with Google is so new and essentially unprecedented, that it's hard to be 100% confident about any advice. Because Google is in the middle of rolling out the new Big Daddy infrastructrure, often you can see very different data even hour to hour as they switch around between data centers. This is not really a good time to make any big decisions based on what you see in the results only some of the time.

Ideally (and this is true for some domains I looked at recently) I would hope that Google will make the https: duplicates a Supplemental Result -- and then they will just gently fade away and not show up in the search results.

I'd like to think that they can sort this out for the average site without needing lots of people to go to extraordinary lengths. I'd like to.

Ellio

8:16 am on Mar 8, 2006 (gmt 0)

Tedster,

Please confirm that if I simply replace my existing robots text with the exact:
.............................

For your http protocol (http://www.mysite.co.uk/robots.txt):
User-agent: *
Allow: /

For the https protocol (https://www.mysite.co.uk/robots.txt):
User-agent: *
Disallow: /

....................................

I will be banning the robots from the https:// version of the site and the existing pages in the index will be reomved at next crawl?

Ellio

8:19 am on Mar 8, 2006 (gmt 0)

I ask the above as the Google guidelines are not clear.

The http and http files are the SAME in the same root folder. So how do I serve a different robots.txt to the htpps and https versions?

The certificate was installed on the whole site so that any page prefixed with https loaded as secure.

So how do you instal a different robots.txt file on each "port"?

souffle

1:29 pm on Mar 8, 2006 (gmt 0)

ellio why do you want to remove https pages do google see them as duplicate im confused right now any one help

Ellio

2:10 pm on Mar 8, 2006 (gmt 0)

If Google incorrectly indexes a site page(s) as https instead of http (because a SSL cert was installed and a link inadvertently pointed to the https version of the same page) then they may delete the correct http page and replace it with the https.

The result is zero rank and zero hits for that page as all links point to the http version.

Normally Google would not choose https over http for the same page but they have recently and ts caused us major problems.

The answer IS NOT to host secure pages on the same domain as the non secure pages.

We have changed our structure now but are still trying to get rid of the https pages in the index so that the http can return!

souffle

8:49 am on Mar 9, 2006 (gmt 0)

what if i cancel the https secure site all together and stay on http would that help.

Dijkgraaf

9:11 am on Mar 9, 2006 (gmt 0)

If your folder for http and https are the same, but you want to differentiate the robots.txt it is serving up you will have to have your server process robots.txt as a scripted page (e.g. PHP,APS etc.) and get the script to detect wether the reqeust was http or https and serve up the appropriate text.
I know it is easy enough in apache to get it to treat .txt files as .php files, and would thing you could get IIS to do something similar.

Ellio

9:16 am on Mar 9, 2006 (gmt 0)

>>>>>If your folder for http and https are the same, but you want to differentiate the robots.txt it is serving up you will have to have your server process robots.txt as a scripted page (e.g. PHP,APS etc.) and get the script to detect wether the reqeust was http or https and serve up the appropriate text.
I know it is easy enough in apache to get it to treat .txt files as .php files, and would thing you could get IIS to do something similar. <<<<

Yes I have found out that you can use ASAP on IIS to do exactly the above:

Make the robots.txt file an ASP script. You can then set the script to look for https or http in the header and output the correct information.