Some help with https vs http

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Some help with https vs http

shaunm

11:28 am on Jul 3, 2013 (gmt 0)

Hello fellow masters.

I had run through this situation where some URLs are available with both https and http such as http://www.example.com and [example.com....] They are internally linked as well.

This is the first time I face such an issue so I am clueless whether this is negative on the site or no impact at all.

Could you please help me guys?

When I copy and paste the https version in Google, it only returns the non-https version in the search.

Thanks for all your help

jimsthoughts

2:22 pm on Jul 3, 2013 (gmt 0)

On Windows, if you have ISAPI_Rewrite...

RewriteCond %HTTPS ^on$
RewriteRule /robots.txt /robots.https.txt [I,O,L]

------
For Apache:

RewriteEngine on
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]

------------------
Or you could consider using Canonicalization code
[support.google.com...]

aakk9999

3:18 pm on Jul 3, 2013 (gmt 0)

Can the same page be accessed via https and http version of URL? If so, you may have a duplicate content issue.

You can check whether google has indexed any https URLs using the following:

site:example.com inurl:https://

Couple of solutions to this are what jimsthoughts has suggested in his post above:
a) you can serve a different robots.txt if the request has arrived from https, and control Google access to https in there
b) you can use a canonical link element on all your pages to tell Google that http is the canonical version

There is a third way to address it, and this is to have a list of URLs that should be served as https, and if URL requested is not in this list, to do a 301 redirect to http version, and vice versa (however, be careful to link internally to a correct version of the URL).

shaunm

3:55 pm on Jul 3, 2013 (gmt 0)

@Jimsthoughts and @aakk9999

You guys are awesome! Thank you so much!

In anyway, do you think it could be intended?

aakk9999

4:02 pm on Jul 3, 2013 (gmt 0)

@shaunm, I am not understanding this question, perhaps you could clarify a bit?

In anyway, do you think it could be intended?

shaunm

5:06 pm on Jul 3, 2013 (gmt 0)

@aakk9999

Sorry I wasn't clear :) What I was wondering is
1. It's a MISTAKE - So I could notify my team about his mistake and take appropriate actions as both of you have said above.
2. It's not a MISTAKE - What could be possibly be the reason of having the same pages accessible via https and http?

Thanks :-)

aakk9999

5:42 pm on Jul 3, 2013 (gmt 0)

There should't be any reason to have the same page accessible via http and https.

But it does happen (and in fact it is not so rare). The main issue is because of the way how internal links are output in HTML - they are often relative or root-relative, rather than absolute.

This mean they "inherit" the protocol and domain name of the page they are linking out from - causing https pages to link to https version of the page that should remain http.

shaunm

5:46 pm on Jul 3, 2013 (gmt 0)

@aakk9999
Thanks again! Since it's an error how come the server returns 200 ok response for the [?...]

Best,

phranque

6:02 pm on Jul 3, 2013 (gmt 0)

there's actually 2 errors in this situation.
one is publishing content that links internally to non-canonical urls.
the other is configuring your server to respond with a 200 OK for non-canonical requests instead of 301 redirecting to the canonical url.

Robert Charlton

9:20 am on Jul 4, 2013 (gmt 0)

shaunm - "It's a MISTAKE". You definitely do not want the same pages returned as 200 OK under different urls. That is what's called "duplicate content". Our "Hot Topics" section is currently down for maintenance and updating, but I'd suggest looking at Hot Topics [webmasterworld.com] as it is and reading all that's there in the Duplicate Content section.

Regarding the current problem... I'm guessing that you've got a shopping cart on your site, which is going to involve pages with https protocol somewhere... and, as aakk9999 suggests, you're using relative or root relative urls in your nav links.

In layman's terms, here's how the problem generally happens... [webmasterworld.com...]

buckworks posted...

A year or two ago one site I work with had problems with https duplicates getting indexed. The origin of the problem was that legitimate https pages in the shopping cart were using the same templates as the rest of the site, which mostly used relative URLs for navigation.

The relative URLs meant that https pages were effectively linking to other pages as https too, so they'd get spidered as https.

When a page whose URL has unintentionally become "https-ified" is being spidered, any of its links which were relative URLs would become https too. That's how the cancer spreads and duplicate problems grow....

In my experience, robots.txt isn't going to control this. Sometimes, if you have a secure subdomain... as in http://secure.example.com ...for pages intended to be https, you can use robots.txt to control Google access to the secure subdomain, but that's not going to fully fix the problem. The secure subdomain is necessary to allow robots.txt to control access to the pages you want to be secure... there's no other way, I believe, to use robots.txt to control access to the http vs https protocols... but once you have the dupe https urls out there in the wild, robots.txt won't fix the problem, as references to the https protocol have spread beyond the pages that you would desire to be secure. (Hope you can follow that... it's a mouthful).

Note that the rel="canonical" link can help fix the problem for Google indexing, but not necessarily for the user. Also, as I remember, you have a IIS content management system which automatically sets the rel="canonical" url. This can make the problem worse. (We never did discuss how to turn this off, but you should find out, and make those settings manual).

As phranque suggests, 301 redirecting all requests to the proper canonical form of your urls is the proper way to handle the situation... but you should first change the relative urls in the secure pages in your CMS to the desired absolute urls. "https" throughout the site, on http pages, is not desirable.

I don't know how your CMS would interact with ISAPI Rewrite, but I would use ISAPI rewrite to do a full "canonicalization" of your site. Read about canonicalization also in Hot Topics. I'd get a specialist to do the programming.

shaunm

9:19 am on Jul 5, 2013 (gmt 0)

@phranque
Thank you so much!

the other is configuring your server to respond with a 200 OK for non-canonical requests instead of 301 redirecting to the canonical url.

Could you please tell me why did you say so? Or can I take it for only https:// non-canonical requests where the server returns 200 server response?

@Bob
That's the biggest answer I ever get on a post man! Thank you so much!

I'm guessing that you've got a shopping cart on your site, which is going to involve pages with https protocol somewhere

You are absolutely correct :-)

As phranque suggests, 301 redirecting all requests to the proper canonical form of your urls is the proper way to handle the situation

Will not that overwrite the shopping cart secure URLs with https:// to http:// as well?

Thanks again for all your help :-)

lucy24

9:41 am on Jul 5, 2013 (gmt 0)

Will not that overwrite the shopping cart secure URLs with https:// to http:// as well?

No. Any given URL should be accessible by either https or http but not both. If you're lucky, the secure and non-secure pages live in different folders, making the rules easier to construct.

I don't speak IIS so I don't know what the mechanics are. But all the user sees is a 200 or a 301, no matter how you arrive there.

And then you've got the problem of non-page content like pictures and stylesheets. There's no intrinsic reason for them to be secure-- but browsers tend to make alarming noises if the user isn't getting a matched set. ("You have requested a secure page that includes some non-secure content.")

So far, I don't think g### has started making a fuss about Duplicate Content when it discovers that
http://www.example.com/images/banner.jpg
is exactly the same as
[example.com...]

So you can back-burner the Non-Page issue and just get your pages sorted. No, the solution does not have to involve hard-coding all links in html. Brr.

Robert Charlton

9:47 am on Jul 5, 2013 (gmt 0)

Will not that overwrite the shopping cart secure URLs with https:// to http:// as well?

No... as aakk9999 suggests, you've got to decide which pages should be https (the specific secure shopping cart pages) and which should be http (most of the rest) of your site, and make a list of which should be which....

...have a list of URLs that should be served as https, and if URL requested is not in this list, to do a 301 redirect to http version, and vice versa (however, be careful to link internally to a correct version of the URL).

shaunm

10:04 am on Jul 5, 2013 (gmt 0)

@Lucy24
Thanks! But, even the home page and the products pages are accessible via both https:// and http:// not the non-page content :(

Such as
http://www.example.com
[example.com...]

http://www.example.com/products
[example.com...]

@Bob

No... as aakk9999 suggests, you've got to decide which pages should be https (the specific secure shopping cart pages) and which should be http (most of the rest) of your site, and make a list of which should be which

I am sorry I was actually referring to the following rewrite rules which redirects https:// requests to http:// (OR am I wrong?!)

RewriteCond %HTTPS ^on$
RewriteRule /robots.txt /robots.https.txt [I,O,L]

There is a third way to address it, and this is to have a list of URLs that should be served as https, and if URL requested is not in this list, to do a 301 redirect to http version, and vice versa (however, be careful to link internally to a correct version of the URL)

I know to redirect the http://www.example.com/page1.aspx to http://www.example.com/page2.aspx

But how do I redirect [example.com...] to http://www.example.com/page1.aspx?

Many thanks!

aakk9999

11:51 am on Jul 5, 2013 (gmt 0)

But how do I redirect [example.com...] to http://www.example.com/page1.aspx

In the same way, but the condition you should check is what is the 'scheme' of the request (http or https). I am not sure if you are using ISAPI Rewrite, custom rewrite script or Rewrite module (if you are on IIS 7), and the syntax would be different for all three.

Jimsthoughts has given you an example (above) of how to check for https scheme if you use ISAPI, so this could be your starting point.

However, you must make sure that https to http redirect is executed after any other more specific page redirects, otherwise you may get double redirect chain.

You may want to check Apache forum and search for https to http redirects and study the order of rules. Although the syntax will not be the same, the logic (order of rules) can be applied to IIS too.

<added>
You might have something like this:
IF the request is https AND the url should be http THEN do redirect to http.

Jimsthoughts post shows how to write the first condition in ISAPI. You have to add the second condition and the redirect rule.
</added>

shaunm

6:11 pm on Jul 18, 2013 (gmt 0)

@aakk9999

Thanks again for the help!

By the way, is that normal to configure server to redirect a request made for [example.com...] to http://www.example.com? Is that set by default?

Chrispcritters

6:32 pm on Jul 18, 2013 (gmt 0)

I manage a few classic ASP sites and we check for HTTP/HTTPS usage and requirement in a common include file that gets processed before most other things.

Prior to the include file a variable RequireSSL is set to either TRUE or FALSE. In the common include if the request is coming in HTTPS and the RequireSSL is FALSE then we redirect to the HTTP page and vice versa.

In my case the HTTPS pages (all within a folder) are blocked via robots.txt and meta robots as well.

phranque

10:45 pm on Jul 18, 2013 (gmt 0)

By the way, is that normal to configure server to redirect a request made for [example.com...] to http://www.example.com? Is that set by default?

it wouldn't be normal if www.example.com was intended to be secure content.
=8)
on some sites, everything uses secure protocol.
on other sites, it might be a specific secure url, subdirectory, or subdomain.

you would ideally design your url structure so that you can easily distinguish secure and non-secure content and then use mod_rewrite techniques (for apache) or various techniques as your environment requires to make sure all non-canonical request are redirected to the canonical url.

it's not done by default in any server environment i've seen.

phranque

10:47 pm on Jul 18, 2013 (gmt 0)

In my case the HTTPS pages (all within a folder) are blocked via robots.txt and meta robots as well.

there are several recent discussions about robots.txt-excluded urls which appear in the index.

shaunm

7:04 am on Jul 19, 2013 (gmt 0)

@Chris
Thank you!

In my case the HTTPS pages (all within a folder) are blocked via robots.txt and meta robots as well.

How would you normally block all the https: requests through Robots.txt? Is there a specific syntax for it?

@phranque
Thanks again :-)

it wouldn't be normal if www.example.com was intended to be secure content

haha yeah, it does never make sense doesn't it? I thought there might be some shopping cart type of pages or any pages that require secure log in in the website I am talking about. But there isn't any. So I can block all the https:// requests right? Can you please explain me how do I start it over?

you would ideally design your url structure so that you can easily distinguish secure and non-secure content and then use mod_rewrite techniques (for apache) or various techniques as your environment requires to make sure all non-canonical request are redirected to the canonical url.

100+

there are several recent discussions about robots.txt-excluded urls which appear in the index.

Interesting! So, it's just not the snippet where Google used to show the URL only version of a blocked content? Is is showing the complete page now? I also got to know from the same forum that if we that page in the sitemap it will get crawled and indexed no matter if we block it in robots.txt or not. Is that also true?

Thanks you all for the overwhelming knowledge! :-)

phranque

12:01 pm on Jul 19, 2013 (gmt 0)

How would you normally block all the https: requests through Robots.txt? Is there a specific syntax for it?

if you want to exclude all https: urls from being crawled, you should put the appropriate disallow directive in [example.com...]

User-agent: *
Disallow: /

Can you please explain me how do I start it over?

i don't understand the question.
start what over?

So, it's just not the snippet where Google used to show the URL only version of a blocked content? Is is showing the complete page now?

if google discovers a url that you have excluded googlebot from crawling and google decides to index that url, the following description will appear in the search result:

A description for this result is not available because of this site's robots.txt - learn more [support.google.com].

I also got to know from the same forum that if we that page in the sitemap it will get crawled and indexed no matter if we block it in robots.txt or not. Is that also true?

i have never seen a case where googlebot crawled an excluded url.
just because a url is indexed doesn't mean the content was crawled or indexed.
it simply means the url was discovered.
if you don't want the url to appear in the index you should allow it to be crawled and provide a meta robots noindex element in the html document head or send a X-Robots-Tag: noindex header with the HTTP Response.