Welcome to WebmasterWorld Guest from 54.198.118.102

Forum Moderators: Robert Charlton & andy langton & goodroi

Message Too Old, No Replies

http:// and https:// - Duplicate Content?

Site has both versions

     
1:11 pm on Dec 20, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 5, 2002
posts:884
votes: 2


This site is accessible both under http:// and https:// with same content and both are indexed separately in Google. Will this be duplicate in Google's eyes?

Thanks for your feedbacks

1:27 pm on Dec 20, 2005 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38070
votes: 16


yes. The second one found (usually the https) will get pr0'd... no problem...
1:44 pm on Dec 20, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 21, 2005
posts:2259
votes: 0


But https pages can still show in a site: search. That suggests to me that Google may have a problem with it. And you can't remove a https page with the removal tool.

Whether it impacts your ranking or not I can't say but I wouldn't be surprised if Google sees it as duplicate content.

2:09 pm on Dec 20, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 5, 2002
posts:884
votes: 2


Brett, I have never seen https domains either with PR or zero, but only Grey. Guess Google doesn't show any details for https sites? But, both versions are indexed separately.

Oddsodd, am afraid so too.

2:17 pm on Dec 20, 2005 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38070
votes: 16


Regardless of the minutia of grey vs 0 - what does it matter if it *is* seen as dupe content?

You should always show the http first shouldn't you? And gbot should always run into the https second?

Or do think it happened the other way around?

2:23 pm on Dec 20, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 21, 2005
posts:2259
votes: 0


what does it matter if it *is* seen as dupe content

Is there an "allowance" amount of dup content before some trigger is hit?

2:30 pm on Dec 20, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 5, 2002
posts:884
votes: 2


You should always show the http first shouldn't you? And gbot should always run into the https second?

Ideally yes and no problem IF Google attributes https as same as http and merges it into http, like it does most of the time with www, non-www. But will Google do it always? More importantly, how can we make sure it doesn't get confused between the two?

Thanks Brett for your answers :)

3:32 pm on Dec 20, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:June 12, 2003
posts:83
votes: 0


... IF Google attributes https as same as http and merges it into http, like it does most of the time with www, non-www ...

Don't count on Google merging URLs like this. Put a separate robots.txt on the secure port.

"Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols." [google.com...]

4:50 pm on Dec 20, 2005 (gmt 0)

New User

10+ Year Member

joined:Apr 27, 2004
posts:15
votes: 0


This is a major problem, especially if you don’t use absolute links on your site. I made this mistake a year ago, and Google indexed my entire site, all 100k pages as http and https. My traffic fell off by about 50% from all search engines. Clearly there is an issue with Google having the same page as http and https in their index. What exactly it is, can’t tell you! However once I fixed the problem and removed the https from my main domain, my site traffic resumed back to where it was prior to having both pages indexed. It is also my experience that Google does not follow the robots tag for https. I have acutally seen where Google started to remove the non secure pages in the site when you try to use the robots.txt to keep the spiders from indexing the secure pages on the root domain.

Here is the fix for this issue.
Make sure you are using absolute links on the site.
e.g. http//:www.widgets.com/foo/foo1.html do not use /foo/foo1.html

Always use a host name for the secure part of the site. e.g. https//:secure.bluewidgets.com.
This also will allow you to use another server if needed for the secure pages.
HTTPS puts a huge load on the server and actually slows the page rendering by about 50%, due to encryption.

You will have error pages on the site for the removed https, but it is my experience that Google does remove these pages quickly. Google removed all 100K http pages in about 45 days on my site.

Good Luck!

6:07 pm on Dec 20, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 5, 2002
posts:884
votes: 2


coffeebean, Thanks for that snippet from Google.

asgdrive, great info. Much appreciated.

Always use a host name for the secure part of the site. e.g. https//:secure.bluewidgets.com

Just to clarify, this site is [domain.com...] Should this be changed to something like [secure.domain.com...] and preferrably hosted on a different server?

Thanks

7:59 am on Jan 11, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 5, 2002
posts:884
votes: 2


Okay, I asked Matt on his "SEO advice: url canonicalization" [mattcutts.com]

"About Canonicalization - What about https and http versions? I have a site that is indexed for https, in place of http. I am sure this too is a form of canonical URIs and how do you suggest we go about it?"

And he replied -

"McMohan, Google can crawl https just fine, but I might lean toward doing a 301 redirect to the http version (assuming that e.g. the browser doesn’t support cookies, which Googlebot doesn’t)."

How do we do that on IIS? Also, by doing a 301, when a visitor goes to the secure shopping cart part, won't he be redirected back to http?

Any suggestions, highly appreciated.

9:14 am on Jan 11, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 8, 2003
posts:1141
votes: 0


I do not understand all this talk about duplicate content penalty. I have a database driven website and all content is available three or more times on different URL. This is a perfectly normal thing. And happens because the URL is created dynamicly.

What happens is, that under a specific search result only one of this pages is shown on the first result pages, the others are somewhere back in the index.

Nevertheless I read about this duplicate content worries all the time. As far as I know you can have a dozen copies of a single website on your domain. To my knowledge there is nothing like a duplicate content penalty.

The only problem related to duplicate content I know about is when someone steals your content and puts it on his website and has a higher PageRank. Then your site might drop in rankings and the other site will be shown.

9:50 am on Jan 11, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Oct 19, 2004
posts:355
votes: 1


McMohan,

Dupe content 100% this was the reason for my site going AWOL in the Florida update.

The way we got round it is by writing a script.

If [domain.com...] was requested then deliver robots file

User-agent: *
Disallow: /

If ht*p://domain.com/robots.txt was requested then deliver your normal robots.txt file

Remember all good robots ask for the robots.txt file so it doesn’t matter if it’s https:// or http://
As long as the script delivers the correct file you are protected no need to move the site or make it a sub-domain.
You can if you want to of course.

Sorry but I’m going to have to disagree with Matt here Google’s Big Daddies robot the Mozilla 5 spider has Java and cookies enabled we’ve tested and both are flagged on.

Hope this helps

Vimes

10:26 am on Jan 11, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 5, 2002
posts:884
votes: 2


Thanks Vimes, that was helpful.

But asgdrive doesn't seem to be in favor of disallowing robots through robots.txt on https. To quote him

It is also my experience that Google does not follow the robots tag for https. I have acutally seen where Google started to remove the non secure pages in the site when you try to use the robots.txt to keep the spiders from indexing the secure pages on the root domain.

Whats your take on that? Any risks as he sees?

10:34 am on Jan 11, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 1, 2002
posts:1580
votes: 0


>This site is accessible both under http:// and https:// with same content and both are indexed separately in Google.

WHY?

Your https content should not allow Google to crawl/index it.

Don't blame Google for your own mistakes!

Clean your own house and then complain about Googlebot, in the meantime you have only yourself to blame!

11:18 am on Jan 11, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Oct 19, 2004
posts:355
votes: 1


McMohan,

Since adding the https tag I’ve had no https pages requested by "good" robots. Sorry requested is the wrong word indexed is correct one.
There are still links out there, put up by a competitor to the [domain.com...]

This is showing as a URL listing now which in my mind is the standard way Google handles links that shouldn’t be indexed, partial indexing.
This page occasionally gets requested, but as all “good” bots will ask for the robots.txt file at some point of there journey on your site, it soon bugs out.

It will take time for these pages to disappear as you can't remove them by the removal tool.(or at least you couldn't the last time I looked.)

The quicker you act the sooner they will either disappear or go to URL listings if there are links directly to the pages.
You will suffer from some page inflation for a while but big daddy seems to be handling this a lot better than the old indices, from what I’ve seen of the test DC.

From my experience this was the easiest and quickest way of getting my site back in to Google’s good books and back into the index.

There’s nothing wrong with asgdrive suggestion of making it a sub-domain but again I’d still block it with robots.txt

i think we're talking apples and pears here, both will correct the issue, all depends on which is easiest for you to implement i suppose.

Vimes

11:21 am on Jan 11, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 24, 2005
posts:965
votes: 0


As a quick solution, I'd put a disallow all on the robots.txt on the https version (you should use a seperate subdomain to prevent the problem experienced by asgdrive eg. secure.example.com and www.example.com). HTTPS requests are quite resource intensive on your server and you don't want robots hitting it hard with lots of requests slowing your site down.

However, the fact that the robots are finding this content is an indicator of a problem with your linking (this is what percentages was implying). Pages that don't need to be secure should not be served as secure. You need to check your links and see where the duplication is happening and change the links to make sure they point to unsecured versions of the pages.

Once that's done, you should ensure that any pages that don't need to be secure are not served as such (check to see how the page is being accessed and get your server to issue a redirect to the unsecured version). By doing this, it acts as a safety net. It makes sure that any slip-ups you make in the future with your linking won't result in robots wandering around your secured pages.

[edited by: mrMister at 11:34 am (utc) on Jan. 11, 2006]

11:55 am on Jan 11, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 5, 2002
posts:884
votes: 2


percentages,
Your https content should not allow Google to crawl/index it.
Don't blame Google for your own mistakes!
Clean your own house and then complain about Googlebot, in the meantime you have only yourself to blame!

Call me ignorant, unaware, novice, what will you. I admit those pages shouldn't be indexed and am trying for a solution. I read my posts again, and I haven't blamed Google anyware.

Vimes and mrMister, thanks for the info. Very useful. Much appreciated.

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members