https - Google.co.uk keeps indexing https - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

https - Google.co.uk keeps indexing https

neilzb

5:56 pm on Nov 21, 2006 (gmt 0)

What on earth is going on? One of my clients sites has always done well. Ranked top 10 for a very competitive search until just recently it was knocked out of Google altogether for duplicate content I think. That’s when they brought me in. So, I sorted the duplicate content and it began to rank well in google.com and google.co.uk again.

Problem is if you search 'pages from UK only' which is where most of their traffic comes from the site is nowhere to be found because that datacenter keeps indexing a https version of the index page. Its only that datacenter though. Why, and how can I fix this?

tedster

8:11 pm on Nov 21, 2006 (gmt 0)

Here's what Google has to say:

Each port must have its own robots.txt file. In particular, if you serve
content via both http and https, you'll need a separate robots.txt file for each
of these protocols. For example, to allow Googlebot to index all http pages
but no https pages, you'd use the robots.txt files below.
For your http protocol (http://yourserver.com/robots.txt):
User-agent: *
Allow: /
For the https protocol (https://yourserver.com/robots.txt):
User-agent: *
Disallow: /
[url=http://www.google.com/support/webmasters/bin/answer.py?answer=35302&query=https&topic=&type=]Webmaster Help Center

neilzb

2:08 pm on Nov 27, 2006 (gmt 0)

so if i have a [mydomain...] and [mydomain...] the spiders will read them both?

I thought the spiders only looked for a 'http' robots text.

Am understanding correctly. Do both http & https files?

I know i prob sound like im asking a stupid question, but i really need to clarify.

WW_Watcher

3:01 pm on Nov 27, 2006 (gmt 0)

OK, but what do you do when the same robots.txt file services both your http & https?

I have done all the reccomended redirects with .htaccess for the non-www to www & index.* to / as reccomended by the experts here, but I still have issues with G indexing the https along with the http.

Back To Watching
WW_Watcher

WW_Watcher

5:08 pm on Nov 27, 2006 (gmt 0)

OK, I think I answered my own question on the https vs http for me.
I created a new text file called robots_ssl.txt

User-agent: *
Disallow: /

Then I added to my .htaccess under my current redirects.

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ robots_ssl.txt

Does anyone see any issue with this? Will this stop the indexing of my site as https, and still work fine with the http:

I tested http://www.example.com/robots.txt and it shows my normal robots.txt
I tested https://www.example.com/robots.txt and it shows the contents of the robots_ssl.txt

Thanks In Advance!
Back To Watching
WW_Watcher

[edited by: tedster at 3:10 am (utc) on Nov. 28, 2006]
[edit reason] use example.com [/edit]

theBear

9:08 pm on Nov 27, 2006 (gmt 0)

Recomended practice is to set up your port 443 service on a different subdomain.

IE: secure.mydomain

Then you have two completely seperate domain roots and two seperate robots.txt files.

I'd be a bit worried about Google getting mixed up by the method you came up with WW_Watcher.

But hey I don't have a port 443 service to worry about.

WW_Watcher

2:47 am on Nov 28, 2006 (gmt 0)

Hey theBear, I did not come up with the idea(I ain't that smart ;-)), but I am not sure if I can post the link.

This was one of two solutions they had listed to solve my problem, the other was a php include to put in a noindex into the page when the call came from https, I did not want to have to alter every page on the site to do the php include.

This appears to be working quite well from every way I have looked at it.

I found this solution by searching on how to stop google from indexing https, and found an article written by Dan Johnson, Technical & Marketing Consultant at SEO Workers.

Back To Watching
WW_Watcher

neilzb

10:15 am on Dec 4, 2006 (gmt 0)

So, i have done:

For your http protocol (http://yourserver.com/robots.txt):

User-agent: *
Allow: /

For the https protocol (https://yourserver.com/robots.txt):

User-agent: *
Disallow: /

But guess what? Every other version of google is fine with it and it is ranking as it should apart from google.co.uk pages within the uk. It has stopped indexing https, but it has stopped indexing the index page all together. Still indexing the rest of the site, just not the index page?

What should i try next? This one has got me really stumped!

Narasinha

4:28 am on Dec 8, 2006 (gmt 0)

neilzb wrote:
But guess what? Every other version of google is fine with it and it is ranking as it should apart from google.co.uk pages within the uk. It has stopped indexing https, but it has stopped indexing the index page all together. Still indexing the rest of the site, just not the index page?

Were they previously indexing both the secure and standard pages, or only the secure pages? If the standard pages were not indexed properly, it may take some time for the datacenter to "catch up" with the data. Optionally, you might try disallowing Googlebot specifically rather than using the global disallow on the secure pages.

I am assuming you have both the secure and standard pages on the same subdomain, like we did at webnauts (which prompted me to write the above-referenced article).

Dan Johnson

<Sorry, no URLs.
See Terms of Service [webmasterworld.com]>

[edited by: tedster at 5:20 am (utc) on Dec. 8, 2006]

jonescd

10:11 am on Dec 12, 2006 (gmt 0)

Ok,
But how did you access the files for https. Godaddy won't give me access to edit my https folder files and everything I edit for the http folder on my server is duplicated in https files. So I can't really tell the https to have a disallow robots file while at the same time telling http to have an allow robots file. Can someone please explain how to conduct this change as currently I am not getting indexed in Google because of this new change tonight.