Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Removing https pages from Google with robots.txt

         

ramachandra

9:48 am on Aug 2, 2007 (gmt 0)

10+ Year Member



Google URL removal tool is useful only when you want to remove the indexed pages which you don't want Google to index the pages and is limited to http:// indexed pages but I haven't find any option to remove indexed https:// pages in webmaster tool.

After reading from the sources I have found the solution to exclude the https:// is by adding robots.txt file but what if both http:// and https:// are on the same and robots.txt file common for both?

I assumed that by blocking the https:// pages folder will not index the pages but it did.

Now to deindex the https:// I have added the below mentioned lines in common robots.txt

Disallow:https://www.example.com/secure/a.asp
......

and also added <meta name="robots" content="noindex,nofollow"> in all the pages which points [....]

Can anyone here tell me whether the method I have implemented to exclude https:// is right?

[edited by: tedster at 3:27 pm (utc) on Aug. 2, 2007]
[edit reason] moved from another location [/edit]

tedster

3:51 pm on Aug 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To my knowledge, robots.txt file syntax does not allow protocols such as http:, https:, ftp:, rtsp: or whatever. You need to serve a dedicated, different file for http://example.com/robots.txt and https://example.com/robots.txt - and your level of access to your hosting server may make that problematic.

The best practice is to install the secure certificate on a dedicated subdomain, such as secure.example.com This also avoids having all your regular urls resolve as https - historically that has caused duplicate url problems in Google

bwnbwn

5:15 pm on Aug 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



tedster
I use robots text to block indexing of the https pages on several sites. It seems to work as we haven't had either site get indexed with the https pages in any search engine.

I wish I had know what you just said about using a subdomain but this was years ago they were set up and it was just to much work to correct the problem.

I use a no follow tag on every link I have to the secure page as well use a rewrite rule to send any seach engine request on https to disallow all.

My robots text file is actually a aspx file
but sends .txt file to the engine.

It has worked so far.

<%If Request.ServerVariables("HTTPS") = "off" Then 'if not secure%>User-agent: *
Disallow: /admin/
Disallow: /bin/
Disallow: /class/
Disallow: /contentTemplates/
Disallow: /db/
Disallow: /panels/
Disallow: /poll/
Disallow: /articles/files
Disallow: /articles/
<%
else
%>User-agent: *
Disallow: /
<%
end if
%>

<meta name="robots" content="noindex,nofollow"> I don't think this is correct just use a no follow tag take out the other stuff not needed and I don't think will be readable by a bot.

eltercerhombre

1:31 am on Aug 5, 2007 (gmt 0)

10+ Year Member



I'm not sure how my colleagues at work did it, but I asked my them to have a different robots file for the https and they did it by configuring Apache.

So now, if on https you ask for robots.txt there's an alias (I remember I saw that on the Apache config, but not sure) and it returns a different robots.txt than in http (when I do edit them, I have robots.txt and robots-secure.txt)

But no sub-domains for the secure server, that was enough.

pageoneresults

1:42 am on Aug 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



On Windows, if you have ISAPI_Rewrite...

RewriteCond %HTTPS ^on$
RewriteRule /robots.txt /robots.https.txt [I,O,L]

Key_Master

1:45 am on Aug 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For Apache:

RewriteEngine on
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]