|http and https: duplicate content?|
| 4:53 am on Sep 22, 2008 (gmt 0)|
I've received several Google alerts for links to my site, and I've noticed that some are internal links that Google is reporting.
What has me concerned is that I'm seeing alerts for pages such as https://www.example.com/widgets.html, which is the exact same page as http://www.example.com/widgets.html.
Once a visitor is in the secure section of my site, any links he/she follows out of the secure section still uses https in the URL.
Does anyone know if Google would regard this as duplicate content?
| 5:04 am on Sep 22, 2008 (gmt 0)|
Yes - it can cause problems.
Two urls are different if they do not match exactly, character for character. Any time two different urls resolve to the same content, the potential is there for a problem, and at the very least, splitting your link juice into different areas that "should be" concnetrated in one spot.
For an optimum set-up, there are many potential duplicate url issues, or "canonical url problems. For our most recent attempt to catalog them all, see:
That thread is one of several that are available in the Hot Topics area [webmasterworld.com], which is always pinned to the top of this forum's index page.
For a specific discussion of the https form of trouble, see:
| 10:35 pm on Sep 25, 2008 (gmt 0)|
Thanks, Tedster. I checked all of the topics you provided links to, but so far a solution to my problem isn't in any of them. My site is on a shared IIS server, and the hosting company can't install ISAPI for reasons I'm too dumb to understand.
Out of curiousity, I checked checked some competitors' websites and found that some competitor's pages can can be accessed using both http and https, while others switch to http.
What's bizarre to me is that the pages affected haven't moved in their rankings. One is #1, but now shows https://example.com/widgets.html where before it showed www.example.com/widgets.html
I don't get it.
[edited by: tedster at 11:45 pm (utc) on Sep. 25, 2008]
| 11:52 pm on Sep 25, 2008 (gmt 0)|
The optimum set up for secure pages is to have them served only from a dedicated host name (subdomain) such as secure.example.com. Then you can serve a dedicated robots.txt file rather easily for your https urls, and subsituting one protocol for another should result in a 404 status.
If Google has started to index your urls both ways, then the situation could easily decay over time. So I'd suggest you do something to stop this confusion from spreading - even move to another hosting situation if that's what it takes.
It's your business and you just noticed a crack in the foundation. Ignoring it might be OK, but it's also a very real risk.
| 4:12 am on Sep 26, 2008 (gmt 0)|
Thanks for being so patient with me, Tedster.
My hosting company said that they could set up the secure.example.com subdomains as you describe.
However, I don't see how that addresses the problem of my non-secure pages showing up with both the http and https protocols. I haven't been able to find a workable robots.txt solution to disallow the https prefix.
Can you or anyone else think of a direction to point me to?
Thanks for your patience.
| 4:17 am on Sep 26, 2008 (gmt 0)|
If you only serve secure files from the subdomain, then the non-secure pages should result in a 404 response when the https protocol is requested. Those urls should just fall out of the index rather quickly.
But first you'll need to sort out the file locations and your internal linking. Make sure that links out of the secure subdomain are absolute and only use the http protocol - or your visitors will also get the 404.
| 1:19 pm on Sep 26, 2008 (gmt 0)|
|If you only serve secure files from the subdomain, then the non-secure pages should result in a 404 response when the https protocol is requested. Those urls should just fall out of the index rather quickly. |
Oh-oh, we handle http vs https a bit differently. We would not serve a 404 in this instance but a 301 instead.
I hate serving 404s these days, I really do. The damn bots will sit there and continually request files that have been moved or gone for quite some time. I mean, I still see 404s from stuff that was removed years ago. There are still linked references out there on the web that I have no control over. We've been micro-managing 404s now for a little while. While it is a tedious process, getting them "all in order" really makes for a much cleaner monitoring environment and we do monitor just about everything. :)
This http vs https issue plagues most who have SSL, especially on Windows. For some reason, us Windows folks just don't get it! We have, but we have to wonder about our counterparts sometimes. We're still learning too. My peers think I'm crazy for investing in me MS network. Ya, I'll agree. :)
Some have asked how we are performing the 301 in this instance. We have multiple robots.txt files being served based on protocol, request, etc. We also force non http and https where applicable. I've seen way too many instances of people linking to the wrong protocol. To avoid this, you MUST 301 those requests or they will continue to fill your logfiles with 404s. Not only that, some of those visitors may be lost not knowing what happened when they clicked the link.
We utilize ISAPI_Rewrite 2 and 3. In the below examples, these are rules for the 2.0 ini file.
RewriteCond %HTTPS off
RewriteRule /(sub存ub存ub存ub存ub存ub)/(.*) https://www.example.com/$1/$2 [I,RP]
RewriteCond %HTTPS on
RewriteRule /(sub存ub存ub存ub存ub存ub)/(.*) /$1/$2 [I,L]
#This will force ANY https:// to http: if they are not in the last rule above
RewriteCond %HTTPS on
RewriteRule (.*) http://www.example.com$1 [I,RP]
In reference to serving robots.txt for http and https, this is what we might utilize from the 2.0 rules.
RewriteCond %HTTPS ^on$
RewriteRule /robots.txt /robots.https.txt [I,O,L]
The robots.https.txt contains one Disallow: line for the entire site. We typically don't want our https pages getting indexed, those are off limits to the bots, the good ones anyway. It's the bad ones that continue to get us into these pickles by choking on rewrites and screwing everything up with their regurgitated mish-mash of who knows what!
|Can you or anyone else think of a direction to point me to? |
I really don't like recommending this but, find a new host. If they don't want to take the time and install the software to make their servers behave like they should, then find someone who will. That type of attitude is going to drag those types of hosts into the ground as technology continues to improve and people become more aware of indexing challenges. You can tell them that "I" said they FAIL! ;)
| 2:04 pm on Sep 26, 2008 (gmt 0)|
Thanks again for the information.
Pageoneresults, I love my hosting company, and it came well-recommended from many members of a hosting forum.
Right now I'm on a shared server plan. At one time they had ISAPI implemented on the shared Windows servers, but one site's rewrites could cause problems for other sites. So now they only do ISAPI on VPS servers. The cost would be $50-$60 more a month for the VPS, but I'd say that's worth it if it works.
As for 301 redirects, I thought that was only possible with Apache servers using .htaccess files. Obviously I'm wrong again.
I can understand the examples you're presenting, but I wouldn't dare try to implement them myself with my limited knowledge. If the folks at the hosting company can't take care of this, I'll have to find someone who can and pay them.
I'd sure hate to lose the top rankings I've been working on the last few months because of this snafu.
| 2:10 pm on Sep 26, 2008 (gmt 0)|
|As for 301 redirects, I thought that was only possible with Apache servers using .htaccess files. Obviously I'm wrong again. |
Nah, ISAPI_Rewrite has been out for years. We've been rewriting on Windows for as long as the product has been out. In fact, we've worked directly with Yaroslav at Helicon in working out some of the bugs in various releases. We tend to delve into the more advanced features of the rewrites which of course uncovers all those little bugs that the typical user wouldn't run into.
With ISAPI_Rewrite 3.0, you use a .htaccess file just like you would on Apache.
With Windows Server 2008 and IIS 8, we're finding that rewriting just took a major turn and that the above types of products may no longer be required moving forward. You will still need them if you are not running Windows Server 2008. The new methods are via XML config files and from what I've seen so far, this is going to rock! MS have not been asleep at the wheel!
|At one time they had ISAPI implemented on the shared Windows servers, but one site's rewrites could cause problems for other sites. |
We've not seen that behavior. The ini rules are per site. But, we are in a controlled environment and typically write the rules ourselves and install them for the client. I'm not 100% sure that an incorrect rewrite for one site is going to cause problems for others. I could be wrong but we've never seen that and I've had a few rewrite hiccups. :(
|The cost would be $50-$60 more a month for the VPS, but I'd say that's worth it if it works. |
Absolutely! You cannot cut costs when dealing with technical challenges such as this.
| 8:27 pm on Sep 26, 2008 (gmt 0)|
|As for 301 redirects, I thought that was only possible with Apache servers using .htaccess files. |
IIS Servers also do 301 redirects natively - even without ISAPI Rewrite. That must be the case, or else a third party plug-in couldn't make it happen. In fact, even without admin level access to the IIS Internet Services Manager, you could still do a page by page 301 redirect using VBScript.
However, without ISAPI Rewrite installed and the luxury of their interface, IIS makes it very difficult to do the exact kind of rules-based redirecting that we're talking about.
I fully agree that a 301 redirect from https to http would be better than a 404 for the old incorrect protocol urls. But you need the tools to make it practical.
| 10:36 pm on Sep 26, 2008 (gmt 0)|
Thanks again for your invaluable input.
I was about to switch to a VPS server when the sales person suggested I talk to yet another tech support person. The tech guy was certain that a 301 as well as a https redirect could be accomplished on the shared server I'm using. I'm hoping he's right, as I don't want to disrupt anything right now.
I was dubious when the tech support guy said that https and http pointing to the same page wouldn't be considered duplicate content. I trust both of you more than someone whose knowledge of SEO is an unknown quantity (even if that tech person says he's been doing SEO for ten years).
One troublesome thing I saw today was that a page on my site had moved from #1 to #3 for the term "Acme red widgets." The #1 and #2 spots were taken by one of the big online shopping sites. More worrisome, though, was that the page had moved from #31 for "Acme widgets" to oblivion. The page was holding both rankings well until Google used the https protocol.
| 8:09 am on Sep 27, 2008 (gmt 0)|
Another way to deal with this is through robots.txt
Most sites should deny all bots access to HTTPS
Set up a mapping for robots.txt so that it is handled by an ASP file - if the port is 443 then serve a robots.txt which Disallows all; if not then serve the standard one.
There are tutorials around which a quick search will find.
| 6:17 pm on Sep 27, 2008 (gmt 0)|
vincevincevince, there's a way to do that, yes. The problem is that three or four of my product pages which were previously ranking well for their search terms are now appearing using the https protocol. My fear is that if I block the bots from https addresses, those pages that are ranking will disappear from the SERPS.
This is a Google forum, though, not an IIS forum. My host came up with a solution that may or may not work. I'll ask about it in the IIS forum.
Thanks again for all the replies.
| 7:08 pm on Sep 27, 2008 (gmt 0)|
|The problem is that three or four of my product pages which were previously ranking well for their search terms are now appearing using the https protocol. My fear is that if I block the bots from https addresses, those pages that are ranking will disappear from the SERPS. |
When you redirect the https pages to http, any inbound link juice that may have boosted the https pages but not the http pages will be redirected to the http version, so this isn't something I'd worry about. After the redirect, the http equivalents should be ranking.
I should tell you that I feel a 301 redirect is the only way to dependably fix this situation. I'd recommend you canonicalize your domain at the same time.
You might also want to run a test to see if your http pages return when you disable Google's dupe filter. Run the search, and then, in your address window, append &filter=0, click on "Go," and see if your http pages show up. This won't be an exact test of the situation you should expect after the redirect, but it might be helpful nevertheless. Note that extra pages may be returned on sites ahead of you in the serps, so your pages will be shifted down when you do this.
| 10:44 pm on Sep 27, 2008 (gmt 0)|
Not 100% sure what you meant about disabling Google's dupe filter, Robert, but I ran the search, added &filter=0 to the end of the URL and ran it again. It was still the https pages that showed up, and in exactly the same place.
The secure portion of the site has been online for about 11 months, and many of my phrases have been page one for quite some time. I wonder why it's only in the last week or so that Google started using the https protocol? The paranoid side of me thinks a competitor may be trying something.
| 6:06 am on Sep 28, 2008 (gmt 0)|
|Not 100% sure what you meant about disabling Google's dupe filter, Robert, but I ran the search, added &filter=0 to the end of the URL and ran it again. It was still the https pages that showed up, and in exactly the same place. |
I'm not sure what you mean by "and ran it again." If you mean, you hit the Google "Search" button again, or hit Enter, that's not what I'm suggesting.
I'm suggesting that you first run the search, then add &filter=0 to the end of the URL, and then click on the right-pointing arrow that's just to the right of the address window on your browser. In Firefox, when you mouse over this arrow, a text flag is displayed that reads "Go to the address in the Location Bar." In Internet Explorer, the message reads "Go to http etc...."
| 3:54 pm on Sep 28, 2008 (gmt 0)|
Hi, Robert. When I do as you suggested, the same pages are in the same positions, with the https protocol. No change.
| 5:55 pm on Sep 28, 2008 (gmt 0)|
Then it's possible these https pages (I assume in the regular area of your site, not in the secure area) are ranking because of the https inbound links. The only way you can take advantage of these links is to do the 301s to http.
For me, it's much cleaner to do as tedster suggests, and set up a subdomain for secure pages, and then block all bots from accessing that section of your site.
When I get into issues like this, btw, I generally hire an SEO-aware programmer who specializes in this kind of stuff. Often the problem also involves the setup of your ecommerce package. I would never rely on tech support from a hosting company, and particularly not level one tech support, to handle a situation like this. Most IT people, and even system adminstrators, in my experience simply don't get duplicate content issues, and even when they do it's often difficult or impossible to persuade them how important these issues are.
[edited by: Robert_Charlton at 6:00 pm (utc) on Sep. 28, 2008]
| 6:20 pm on Sep 28, 2008 (gmt 0)|
|When I get into issues like this, btw, I generally hire an SEO-aware programmer who specializes in this kind of stuff. |
Which is the best advice given in this thread to date. The more you fumble around with it and the more your current host implement things that are not correct, the more layers of confusion you add to the indexing routines.
|Often the problem also involves the setup of your ecommerce package. |
Yes it does. And, most of them FAIL out of the box when dealing with Windows.
|I would never rely on tech support from a hosting company, and particularly not level one tech support, to handle a situation like this. Most IT people, and even system adminstrators, in my experience simply don't get duplicate content issues, and even when they do it's often difficult or impossible to persuade them how important these issues are. |
This is usually a Windows Administration challenge. You nailed it on the head too! Most Windows Server Administrators are absolutely CLUELESS! I used to be too and then got on the bandwagon, that was years ago. I can't believe that there are still Windows hosting companies that haven't gotten IT yet. That's okay, those who provide support in this area will be glad to take those clients leaving your FAILED hosting platform!
dickbaker, why not just hire someone to do this for you? I think it will be quite difficult to piece this all together for free at WebmasterWorld. And, if you are dealing with your current host, I don't think it is going to happen anytime soon. Bite the bullet and get someone in there to clean this up for you. And, if needed, change hosts too.
| 9:15 pm on Sep 28, 2008 (gmt 0)|
OK. I've moved to a VPS server. I see that I can select a file such as "widgets.html" that is now being indexed by Google with the https protocol, and in IIS do a permanent redirect to http://www.example.com/widgets.html. I can do this through the VPS control panel.
I'm only dealing with one tech support person right now, rather than one of the hundreds they have on staff, and he's one who's been doing SEO for quite some time. (Although he didn't believe me when I said that having http and https versions of the same page could result in duplicate content penalties).
Because I have pages that are ranking very well, I don't want to do a full redirect of all https pages immediately, as I suspect I'd lose the rankings for those pages. Instead I thought that once Google had indexed the affected pages with the http protocol, I could then do an ISAPI rewrite on the robots.txt file.
Please tell me if I'm wrong in what I'm doing.
Lastly, if I need to find someone to hire, where would I look?
Thanks for your input and interest on this.
By the way, Robert, I don't know where the incoming links would be for the https versions of these affected pages. I don't have any such links on my site, and I haven't submitted anything using https to any other sites. The only link on my shopping cart pages to a non-secure page is a link to the home page on my site, and that's an absolute URL.