Welcome to WebmasterWorld Guest from 54.198.170.159

Forum Moderators: Ocean10000 & phranque

htaccess and going SSL

Solving two issues at once

     
5:31 pm on Apr 26, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2006
posts:1529
votes: 98


Hi webmasters, as always when it comes to a question I do my research first, but just can't solve this. I've found lots of resources on the web (forums, etc) but honestly there is no clear answer to this, I will appreciate any help.

Going SSL involves 3 basic things.

1. Having your SSL certificate (self explanatory)

2. Making your site full https using redirects. You need to help crawlers, SEs, etc to move from your old already positioned pages on http to https, this means every link on the web example.com/history-of-webmasterworld should now point properly to https example.com/history-of-webmasterworld. This must be done using 301 redirects and will save you from serving duplicate content (something search engines don't like). If you are here, you probably know already the issue about http:// and [www...] serving duplicated content, well http and https are not only diff protocols, but also diff ways to serve the same page/content and thus must be avoided otherwise you will fall on the same issue, everything must point to your https version and only there, period.

3. Avoiding the www, removing it. Point #2 is about https, and mentions the www issue, so why a third point about it? because an htaccess redirect removing the www from any https is something different. You might have already redirects to remove www, but I bet many don't remove it from https yet.


My readings and test resulted in the following:

RewriteCond %{HTTP_HOST} ^www\.(.+)$ [NC]
RewriteRule ^(.*)$ https://%1/$1 [R=301,L]

RewriteCond %{HTTPS} !on [NC]
RewriteRule ^(.*)$ https://%{HTTP_HOST}/$1 [R=301,L]


I bet this could be improved, yes it is working so far, but it doesn't solve the [www...] thing. I found the same snippet on diff forums stating "it works" on the fly, here it is (this is just about removing the www on https). It is stated it works regardless of http or https, but in my case... it's not working.

RewriteCond %{HTTP_HOST} ^www\.
RewriteCond %{HTTPS}s ^on(s)|off
RewriteCond http%1://%{HTTP_HOST} ^(https?://)(www\.)?(.+)$
RewriteRule ^ %1%3%{REQUEST_URI} [R=301,L]


So found another bit of code repeated on several forums and also marked as "answer". It's explained why the previous and similar attempts failed, because it was never checked for secure, https, you know the port. So here it suppose to do that. There are comments regarding the port, because you can set diff ports so it just wouldn't "work". Anyway after testing this, it doesn't work on my website.

RewriteEngine On

# Check that you're on port 443 and the hostname starts with www
RewriteCond %{SERVER_PORT} ^443
RewriteCond %{HTTP_HOST} ^www\.

# Redirect to domain without the www
RewriteRule (.*) https://example.com$1 [L,R,QSA]


The tricky thing on htaccess is you can build rules and make it work, but sometimes those rules get applied on redirect 1, redirect 2 and redirect 3... something we should all avoid, instead it should work on the fly. I already set up SSL on the domain I'm testing, but the rule to remove www from https is not working, any help will be appreciated.

On a personal note, I always try to avoid direct insertion of the url, this helps me to use the same code on all my websites without any changes, it's not being lazy, is being practical and helps to avoid confusion or mistakes while doing maintenance.
6:05 pm on Apr 26, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3833
votes: 212


That first example above is scary. Remember that "{HTTP_HOST}" does not mean your domain, it means whatever domain was requested. Often innocent, but too vague for my comfort. The problem with that second test is that it is not a 301 rewrite. Apache's default is a 302 (Temporary) unless you have that [R=301 flag in the rule.

Beyond that, this question is the most commonly asked question in the Apache forum for quite awhile now and just a quick look at recent threads can both get you to a very fast answer and explanations for them.

A search for some of the rules' terms brings up a LOT of results. To start, I'd suggest reading through this one: [webmasterworld.com...] another with details is this older thread: [webmasterworld.com...]
6:07 pm on Apr 26, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14898
votes: 648


Edit: When I saw the subject header, I thought “Dammit, this question has been asked eight thousand times already, can’t they read?” But no, all those details and variants mean it really is a valid and carefully-considered question. Onward!

To avoid chained redirects, make a single rule with two OR-delimited conditions.

Express your domain-name condition as a negative: “anything OTHER THAN this single preferred form”. The version you see most often hereabouts is--assuming without-www--
RewriteCond %{REQUEST_URI} !^/robots\.txt
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$
RewriteRule (.*) https://example.com/$1 [R=301,L]
I added the RewriteCond involving robots.txt on my own HTTPS site because I noticed that some law-abiding robots seemed to get confused with a robots.txt redirect, and you should bend over backward to avoid giving them an excuse not to get this file. The alternative is to make a preliminary rule that says simply
RewriteRule ^robots\.txt - [L]
and put it before all other RewriteRules. Which approach you take will depend on the site.

If you are on shared hosting, and/or all your rules are in VirtualHost envelopes, the part making a hostname optional is actually not needed, because requests with no hostname simply will not reach the site in the first place. But it's only three bytes and a nanosecond of processing time between that and the simpler form
RewriteCond %{HTTP_HOST} !^example\.com$

As always, the domain-name-canonicalization redirect--which includes HTTPS--is the very last external redirect, typically right after your /index.html redirect.
7:51 pm on Apr 26, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2006
posts:1529
votes: 98


@Not2Easy, I wasn't able to find those threads before, thanks for taking the time and posting the links. I will keep reading and testing, will take a while and then post my findings here so others can take it from there in case they need it.

@Lucy24, yes!, thanks. I appreciate making clear the rules and order to make it work better. I'll see how much the code can be improved and corrected.

Thanks
2:30 am on May 17, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2006
posts:1529
votes: 98


Update:

I tried several approaches with the htaccess file. As reported before, they didn't fully work. It's easier to be frustrated with this because the same things can be achieved in diff ways using htacces, diff methods, so you can find tutorials and documents on how to do something, get confirmation from other people that "works perfectly" and then get horror stories from other users but not explaining why is that so, or how to fix it. This creates confusion.

As for htaccess rules, found two main methods being reported repeatedly on diff coding forums, all reporting to work. My case? didn't work. I could remove the (1) www and also (2) go from http to https, but never was able to achieve the two on the fly. Only one rule was being applied each time, not the two. Imagine your favorite coding language and performing a replace for letter A and letter C, every time I tried, one of the two were replaced but never both (and yes I understand it can be done).

Wordpress as a reference? found forum threads sharing the htaccess code of WP as something that worked for them, from there they discussed in diff websites how to achieve the two rules, some confirmed it worked, in other cases it didn't. I tried installing the latest WP and playing with it, honestly I could [https // www whatever] and also [https // whatever] there was no way to push for https AND removing the www. Yes I tried, something was off on my server, maybe. I wasn't getting any error messages.

Then this is how I achieved it. I know is not the perfect way but it works for me and I don't see any waste of time on loading the pages or redirects.

Basically my CMS (that I wrote for my own sites) catches all the requests via one main script, many CMS does this. So I just added a preference there (yes I'm using Perl), if $_gohttps equals "1", then the first lines of the main script send a 301 redirect and take you to the https link without www, all at once. The cms also converts all the internal links to that structure (calling images, javascript, etc and so on).

I would have tried even more with the httaccess file but... noticing it wasn't working, I wondered... because if something doesn't works by the book all the time, then a migration to another server might break specific stuff (if I goo to specific or using patches, tricks, etc), does that sounds like I want it to? so I went using code, and it's fast, no delays, in fact my tests using diff online tools report the website is faster now (even that I turned off the system cache, the one the cms provides).


So, at the end of the thread there is no fix or solution I can post regarding that htaccess, but that's just my story.

4:58 am on May 17, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11388
votes: 156


did you try something like this?
RewriteCond %{HTTPS} off [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$ [NC]
RewriteRule (.*) https://example.com/$1 [R=301,L]


if so where did you put it in your mod_rewrite rulesets and what response did you get?

you can still try this and solve it using mod_rewrite without touching your script.
apache would simply redirect those requests instead of internally rewriting them to the script.
(assuming you put the hostname canonicalization ruleset before the internal rewrite in .htaccess...)

also check your server access log file for these redirected requests.
on some servers, a 301 generated after an internal rewrite to a script looks like a 200 response in the access log.
this may be a disadvantage for your requirements.
4:30 pm on May 17, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2006
posts:1529
votes: 98


Thanks phranque

did you try something like this?
Found a few similar examples, tried and I don't remember any of those working. I tried tweaking the lines, my knowledge on htaccess is limited and when I don't use it for long it becomes more difficult. I didn't perform enough tests on such code examples because I wanted to stay away from writing the domain name directly on the htaccess file. I have a custom built CMS running several sites, it was created in such way I can move the files all around and update the core easily because all the specifics are on config files. Writing specific domains on htaccess or scripts is an option, but not one that fits what I'm aiming for.

RewriteCond %{HTTPS} off [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$ [NC]
RewriteRule (.*) https://example.com/$1 [R=301,L]


Thanks for providing such example code, I tried it and it works on all the tests (with and without WWW testing it on http and https), great. At this moment I already made modifications to my CMS and achieved the desired result, I know speed and performance could be something to discuss, so far so good, works and works fast. Even if there was some speed at stake it would only happen during the transition from HTTP (www) to HTTPS as the search engines update their database.

I would go with the htaccess option but on my way here and having the cms ready, now I only need to change one preference from 0 to 1 (HTTPS), this allows the CMS to change diff sets of behaviors at site level with just one change. I like it, I will keep it that way. I now can go http to https and back with just one change.

Thanks for the code example, to anyone looking for the same solutions I confirm this solves the issue perfectly on my side (opposed to the previous examples posted that I couldn't get it to work (other people on forums confirmed it worked, only on few cases it didn't).

also check your server access log file for these redirected requests. on some servers, a 301 generated after an internal rewrite to a script looks like a 200 response in the access log.
Checked using online header tools and also checked the server logs, it looks good, the script in my case is returning a solid 301.

Thanks for the help!
5:10 pm on May 17, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14898
votes: 648


on some servers, a 301 generated after an internal rewrite to a script looks like a 200 response in the access log.
I would say on most servers, since the server log only records that the request has been successfully handed off to the script; if the script itself physically exists, that's a 200. Similarly, a CMS-based site may never show a 404 in access logs, since it is the script's job to figure out whether the requested content exists.

I well remember the trouble I had wrapping my brain around the idea that the response the server sends out is not necessarily the response the visitor receives.
6:20 pm on May 17, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2006
posts:1529
votes: 98


Lucy24: I would say on most servers, since the server log only records that the request has been successfully handed off to the script; if the script itself physically exists, that's a 200.

True. Then is important add, and to properly:

1. Send the right http headers
2. Check the codes and responses on the server log
3. Check the codes and responses on your web developer tool of choice
4. Check for the same on online services that read your headers and responses

This means if we serve a 404 page it has to go with the right http headers, not just the title and page content "404", same for 301. Http headers is something we don't see as end user on the page, but we can see it using the right tools (chrome dev tools, firebug, etc). Also, all the needed redirects must be done, ideally at once, on the fly, on one single process. We should avoid one redirect to remove the WWW and another redirect to go from http to https, both should be done at once.
4:55 pm on May 24, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2006
posts:1529
votes: 98


Update:

Opened 2 threads about going SSL (HTTPS), the other one is about SSL certificates. I'm updating just this one as it goes around redirection and this means traffic, keeping your links, impact on search results etc. I attempted htaccess but ended up using a new routine on my scripts (own CMS). I remember threads about people concerned on the impact of the switch (I was one of them) so this happened.

1. Loosing traffic.
Zero. Simply put, the redirection on my script (and that can be achieved using htaccess too) posted above by @phranque (thanks for the help!). The thing is, script or no script, as long as you get all your http requests translated into https requests (redirections, with 301 http header), you loose nothing, everything will be directly turned into a proper redirect.

2. What the 301 redirect does.
Firs you have to set it the right way (http header on your script or htaccess code). The redirection tells the search engine where to go, and the code 301 tells the search engine "stop looking for A, from now on look only here...". Eventually the search engine will start ignoring the first link and will keep only the last one you set via 301. The 301 code is for "redirect 301 moved permanently".

3. Search engine updating the links.
I was able to see and confirm changes on popular visited links as fast as 24 hours, going from http:// WWW then appearing as https:// SITE . Then some more links, but some were not updated. I kept waiting and nothing. Then despite my redirects being properly set, I saw some links going https:// WWW don't know why (yes, I insist, I tested carefully). So, move to the next point.

4. Update your site on Webmaster Tools (Google)
There you can add your site (if you don't have it there already) you can add it as http, or https. With or without WWW? you can set your preference. After making that update the changes happened fast, in 6 hours for my most visited site I already saw search engine updates. Bing? not yet, still showing http.
5:30 pm on May 24, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14898
votes: 648


Eventually the search engine will
For a given definition of “eventually”. Bing, in particular, just refuses to give up. For some reason, this particularly applies to images; to this day I’m getting redirected requests, both for images that moved to a different site in 2013, and images that went HTTPS a year ago. Generally the same few images, so it's not as if the search engine didn't know or hasn't had time to catch up.

:: detour for some number-crunching ::

Search-engine HTTP requests have dropped significantly, but they still check the root many times a day. (I think the idea is that this tells them whether the whole site is still redirecting from HTTP to HTTPS; other pages only get requested a couple times a month for random spot-checking.) In the past month:
39% of redirects Bing
26% Google
22% Yandex
and the rest is miscellaneous. Yup, they're slow on the uptake.

Further checking reveals that some of those redirected image requests with bing referer never follow up the redirect. I kinda suspect these are pictures that come up on the image SERP, involving some kind of caching: what the human sees isn't what the search engine requests.
3:14 pm on May 25, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2006
posts:1529
votes: 98


just refuses to give up

True. I'm seeing on the logs crawlers still checking for images from old galleries removed 4 years ago, and a few pages related to the galleries. 4 years seems like a lot of time to me, and wasted resources from search engine/crawlers, unless it's something else disguised.