Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Will installing SSL certificate cause duplicate content penalty with http & https?

         

killua

1:44 am on Mar 5, 2017 (gmt 0)

10+ Year Member



I'm running a VPS server and for so many years, I've been using http:// for my main website. In fact, to even avoid www and non-www issue, duplicate content issues, etc. in Google search engine, I even have the below code in my .htaccess currently:

-----------
RewriteEngine On
Options +FollowSymlinks
RewriteBase /
RewriteCond %{HTTP_HOST} !^www.example.com$ [NC]
RewriteCond %{REQUEST_URI} !^/[0-9]+\..+\.cpaneldcv$
RewriteCond %{REQUEST_URI} !^/[A-F0-9]{32}\.txt(?:\ Comodo\ DCV)?$
RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]
-----------

Aside from the above, I use absolute path showing http:// for internal links to deter content scrapers.

Now, recently, due to an update in WHM/cPanel, and because Comodo is now issuing free SSL certificates, and AutoSSL is enabled by default, I noticed that I can now access the https:// version of my site. That is, if I type https:// manually, the browser will state it is secure. For one, that is a good thing, since I'm planning to migrate to https:// in the future and I believe my site is ready for that I believe. But I don't plan to do the migration now, since that would be a very time consuming process.

Given my conditions above, if I just leave AutoSSL enabled by default and the free Cpanel certificate installed in my domain name, will this cause duplicate content issues for many search engines especially Google? Because I'm not very sure if major search engines are smart enough to think that although the https:// version of my site is apparently accessible, my links are still all http:// . What is your suggestion?

[edited by: phranque at 8:55 am (utc) on Mar 6, 2017]
[edit reason] exemplified "mydomain" [/edit]

phranque

3:38 am on Mar 7, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Dated May 1996.

HTTP/1.0 protocol definition work stopped in May 1996 because they started publishing the HTTP/1.1 protocol definition in January 1997 (see below) and further development was probably a waste of time.

Again, HTTP/1.0 requests include a Host header at my server. Guess things have changed since that document was written

indeed.
since then some non-compliant user agent developers have implemented a Host request header while still identifying themselves as HTTP/1.0 user agents.

haha even the wiki notes the virtual hosts problem with HTTP/1.0.
https://en.wikipedia.org/wiki/Shared_web_hosting_service [en.wikipedia.org]:
(Name-based virtual hosts) will not work with very old HTTP/1.0 browsers that do not send the hostname as part of requests. Since the "Host" header is mandatory in HTTP/1.1, which was issued in 1999 as RFC 2616, this is not a common issue.


I meant that if you (the browser or robot) are sending a request to a server that happens to have multiple domains living on it, how else do you tell the server what domain/hostname you're aiming for, if not with a Host: header?


you can't - the HTTP/1.0-compliant user agent can only do the DNS lookup and send the request to the IP address without the Host header.
hence HTTP/1.1...
https://tools.ietf.org/html/rfc2068#section-1
However, HTTP/1.0 does not sufficiently take into consideration the effects of hierarchical proxies, caching, the need for persistent connections, and virtual hosts.


keep in mind that in June 1995 there were <25K web sites and by June 1996 there were >250K.
Total number of Websites by Year:
http://www.internetlivestats.com/total-number-of-websites/ [internetlivestats.com]

clearly there exists some mechanism for HTTP/1.0 robots to include the header, even if they're not required to do so.

yes, you can construct a request that sends a Foo request header if you wish.
https://tools.ietf.org/html/rfc1945#section-5.2
Request-Header field names can be extended reliably only in combination with a change in the protocol version. However, new or experimental header fields may be given the semantics of request header fields if all parties in the communication recognize them to be request header fields. Unrecognized header fields are treated as Entity-Header fields.

an example of the reliability problem would be if your server saw a request from a HTTP/1.0 user agent and rejected it assuming it wouldn't be sending a Host request header.

lucy24

5:26 am on Mar 7, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



yes, you can construct a request that sends a Foo request header if you wish.

Funny you should say that. "Foo:" may be the only header field I've never seen, amid the blizzard of X-This and X-That and Obviously-Specific-To-This-Brand-Of-Smartphone-The-Other. To say nothing of the misspelled headers sent by especially inept robots...

IanCP

12:18 am on Mar 9, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@ killua
Comodo is now issuing free SSL certificates

I don't know why, but I was charged $US 8.95 by Comodo a few days ago. I got everything I needed so I'm not going to quibble.

keyplyr

4:36 am on Mar 9, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't know why, but I was charged $US 8.95 by Comodo a few days ago. I got everything I needed so I'm not going to quibble.
Did you go to the Comodo authority and look for the free cert... or were you limited to taking what was available from your host?

IanCP

5:55 am on Mar 9, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I did everything independently. Actually I went to ssls.com and finished up with Comodo - my credit card statement says ssls.com was the debit. Not something I would slash my wrists over. In any event, as a paying customer I would expect better support should it become necessary.

killua

6:17 am on Mar 19, 2017 (gmt 0)

10+ Year Member



RewriteEngine On
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$ [OR]
RewriteCond %{SERVER_PORT} =80
RewriteRule ^(.*)$ https://www.example.com/$1 [L,R=301]


So far, my tests of the above redirect has been working as suggested in this topic. I've used redirect-checker.org to test.

There's one issue I've noticed, though. For example, sometimes I use Pascal Casing to stand out as my domain name uses two words. If I do redirect test, say
https://www.MyDomain.com
, it would result to another 301 redirect to
https://www.mydomain.com
, which I believe is unnecessary overhead. Domain names aren't case sensitive in the first place and won't have any issues with duplicate content I think.

How will I modify the above code so that the final https domain is not case sensitive? I see that wired.com was able to implement this flawlessly.

not2easy

6:49 am on Mar 19, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Personally I would add one extra line (with the [OR] flag) to specifically rewrite that specific case and avoid using [NC] as a catch-all. Less work for the server. Which is no big deal if it is a low traffic site with a short name, but the longer the name, the more permutations to be examined before completing the process to move on to the next step.

In other words - EXampleExaMple rewritten to exampleexample is one step while with [NC] it becomes a comparison of every possible combination of all letters in either case. Not that it would noticeably slow the server, but if you know the case, it is simpler to use it.

phranque

12:47 am on Mar 20, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



There's one issue I've noticed, though. For example, sometimes I use Pascal Casing to stand out as my domain name uses two words. If I do redirect test, say
https://www.MyDomain.com
, it would result to another 301 redirect to
https://www.mydomain.com
, which I believe is unnecessary overhead. Domain names aren't case sensitive in the first place and won't have any issues with duplicate content I think.

How will I modify the above code so that the final https domain is not case sensitive? I see that wired.com was able to implement this flawlessly.


i forgot the NC flag in the RewriteCond.

you are correct that hostnames are case-insensitive and you don't want unnecessary chained redirects.

https://tools.ietf.org/html/rfc952 [tools.ietf.org]:
No distinction is made between upper and lower case.


http://tools.ietf.org/html/rfc1034#section-3.1:
By convention, domain names can be stored with arbitrary case, but domain name comparisons for all present domain functions are done in a case-insensitive manner


hostname case insensitivity also discussed extensively in DNS Case Insensitivity Clarification:
https://tools.ietf.org/html/rfc4343#ref-STD13

try this:
RewriteEngine On
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$ [NC,OR]
RewriteCond %{SERVER_PORT} =80
RewriteRule ^(.*)$ https://www.example.com/$1 [L,R=301]

killua

1:08 am on Mar 20, 2017 (gmt 0)

10+ Year Member


@phranque +, I tried the above code adding NC in "RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$ [NC,OR]" . But it doesn't work, redirect test shows its still doing 301 redirect to small caps domain when accessing [code]https://www.MyDomain.com[/code] .

lucy24

2:54 am on Mar 20, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Crystal ball says either your server or your browser--or both--is doing something clever and tweaking the casing on its own behalf. I've just tried it on my test site using CamelCase (ExAmple.com) in the domain-name-canonicalization redirect; the result was an "isn't redirecting properly" error from the browser. Since I don't log headers for 301 responses--in fact I wouldn't even know how to do it :( --I can only conjecture that the request was "example.com" by the time it reached the RewriteRule.

I then tried a simple request, and had a hell of a time even getting the browser to let me type in CamelCase. Access logs show no redirect, so it must have been the browser.

Conclusion: may as well stick with example.com without the [NC] flag. The only people who request EXAMPLE.COM will be robots. Um. As it were.

phranque

5:40 am on Mar 20, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



redirect test shows its still doing 301 redirect to small caps domain

make sure it's not a cached response problem.
maybe test on a different device/browser...

added: what are your server access log files showing for those requests?

[edited by: phranque at 5:44 am (utc) on Mar 20, 2017]

phranque

5:42 am on Mar 20, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Since I don't log headers for 301 responses--in fact I wouldn't even know how to do it

instead of using a [R=301] flag on the RewriteRule, you internally rewrite that request to a script which logs the headers and then reexamines the request with the same rules as your mod_rewrite ruleset(s) to respond with a proper Location: header and a 301 status code.

Conclusion: may as well stick with example.com without the [NC] flag.

so you're going with the unnecessary chained redirects? =8)

The only people who request EXAMPLE.COM will be robots. Um. As it were.

many companies use CamelCasing for branding in links on web sites, email, tweets, text messages, etc.
when a human clicks on one of those links, unless the user agent is folding to hostname to lower case first, the request will include the Host header and the user-agent-requested hostname which will be CamelCased.

lucy24

6:03 am on Mar 20, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



internally rewrite that request to a script which logs the headers and then reexamines the request with the same rules as your mod_rewrite ruleset(s) to respond with a proper Location: header and a 301 status code

Yikes. That sounds like something I could do on the test site if it was really important to get data on some specific question--but not something I'd want to mess with in real life.

unless the user agent is folding the hostname to lower case first

That's the interpretation I arrived at. Someone might like to try it with different browsers. I'm currently on Firefox for Mac. The only non-lowercase Host: headers I've ever seen were from pretty clear-cut robots. Who, come to think of it, must one-and-all have been blocked before they ever reached the RewriteRule (which doesn't carry the [NC] flag), or else I'd have found the redirects. (I log headers on the 403 page, which by its nature is exempt from domain-name canonicalization.)

:: detour to paw over headers ::

Heh. Now and then there's something like "www.EXAMPLE.com", as if they're going for the Extreme Sketchiness prize and aren't confident that claiming to be Chrome/11 or Firefox/5 is sufficient to guarantee a win.

killua

6:20 am on Mar 20, 2017 (gmt 0)

10+ Year Member



Finally, I found out why the [NC] is not working. The wordpress installation is causing it. I temporarily moved the wordpress to a subfolder and the redirect is now returning 200 OK instead of 301 for the CamelCase. It seems the NC rule isn't fully compatible with Wordpress.

not2easy

7:19 am on Mar 20, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Wordpress uses the URL you put in the settings. When you change to https you need to change that URL too. In the Admin interface go to General > Settings and set it to the new URL. Also be sure to add whatever changes to your htaccess file above (or before) that WP section.

phranque

8:51 am on Mar 20, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



That sounds like something I could do on the test site if it was really important to get data on some specific question--but not something I'd want to mess with in real life.

sometimes it is really important to get data on some specific question "in real life".

phranque

8:52 am on Mar 20, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Also be sure to add whatever changes to your htaccess file above (or before) that WP section.

stated in other words, external redirects typically precede internal rewrites.

lucy24

4:16 pm on Mar 20, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Incidentally, I tried Safari and Camino (on the off change that case-leveling is a recent browser thing). Both lower-cased whatever I typed in (ExamPle.com) before sending it along to the server, so no 301 shows up in logs.

phranque

5:06 pm on Mar 20, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I tried Safari and Camino (on the off change that case-leveling is a recent browser thing). Both lower-cased whatever I typed in (ExamPle.com) before sending it along to the server, so no 301 shows up in logs.

so you tried testing this on a few mac os web browsers but have you also tested on a few text messaging apps, social media apps, email apps, anything on windows/android/ios/linux, etc?
maybe you can rely on all your visitors using a popular, relatively modern and robust web browser...

(you might even want to test on some web apps - for example what happens if you try to validate your document on w3c or check for wcag compliance and provide a CamelCased hostname?)

lucy24

9:02 pm on Mar 20, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think I already have one crucial piece of information: the user doesn't necessarily have control over what casing the site uses. So you can advertise yourself as ExAmple.com, but as far as your browser is concerned it's plain old example.com.

try to validate your document on w3c or check for wcag compliance and provide a CamelCased hostname?

Well, I can do that right now.

:: shuffling papers ::

w3c link checker also levels the casing. (Request link-checking for ExAmple.com, results screen immediately changes to example.com--here obviously my browser can't be blamed--and no redirects show up in logs, either for the specified page or for the preceding robots.txt.)

If the argument is that the [NC] flag should be retained, then, hm, maybe. But I'd say only if it can be established that some user-agents persist in sending CamelCase requests even after they have been instructed to use lowercase.

Actually, I'm not sure what we're arguing about ;)

phranque

2:54 am on Mar 21, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I'm not sure what we're arguing about

proactively avoiding unnecessary chained redirects

lucy24

4:49 am on Mar 21, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



avoiding unnecessary chained redirects

Well, that should never be a problem, so long as the target of any specific redirect uses the same protocol and hostname as the fallback generic redirect. I always lean toward spelling things out instead of using locutions like %{HTTP_HOST} in a target. Keep the redirects for different hostnames separate.

Anyway it's been educational, because I never had any idea that the browser--any browser--officiously steps in and changes the casing on hostnames I've typed in.

phranque

8:34 am on Mar 21, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



that should never be a problem, so long as the target of any specific redirect uses the same protocol and hostname as the fallback generic redirect

it doesn't matter if the subset of user agents available for your testing fix this problem before the request is sent.
if only your fallback generic redirect triggers due to a CamelCase hostname request, that meets my definition of an unnecessary redirect.

all of the IETF documents i referred to above are consistent about case-insensitivity including foundation documents from >30 years ago.

lucy24

4:40 pm on Mar 21, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



if only your fallback generic redirect triggers due to a CamelCase hostname request, that meets my definition of an unnecessary redirect.

Yah, OK, but we're still not getting chained redirects, which was the original worry.
This 54 message thread spans 2 pages: 54