.htaccess rewrite for https non-www to www

Forum Moderators: phranque

Message Too Old, No Replies

.htaccess rewrite for https non-www to www

goldleviathan

3:19 pm on Dec 28, 2008 (gmt 0)

I have looked everywhere for .htaccess code to forward all inbound ssl traffic to my site from non-www to www. Example [domain.com...] to [domain.com....] My ssl cert is installed on www and all other requests to the main domain cause a browser error. I already have this working for non-ssl urls using .htaccess. Can someone please provide the proper code for ssl as well?

jdMorgan

6:20 pm on Dec 28, 2008 (gmt 0)

Adapted from a recent thread [webmasterworld.com]:


# Externally redirect from non-www, non-canonical hostname to the
# canonical www hostname, preserving current HTTP/HTTPS protocol
RewriteCond %{HTTP_HOST} !^www\.example\.com$
RewriteCond %{SERVER_PORT}s ^(443(s)Ś[0-9]+s)$
RewriteRule (.*) http%2://www.example.com/$1 [R=301,L]

Since it handles SSL or non-SSL, this one rule will replace your current non-SSL www redirect as well.

Replace the broken pipe "Ś" character in the second RewriteCond pattern with a solid pipe before use; Posting on this forum modifies the pipe characters.

Jim

goldleviathan

2:16 am on Dec 29, 2008 (gmt 0)

I used the code you provided but https rewrites are still not working. (Yes, I changed the broken pipe) Any other ideas? Maybe a port problem?

jdMorgan

4:23 pm on Dec 29, 2008 (gmt 0)

Make sure that the code is located in the .htaccess file in a directory path where it will run when either an http or https request is received.

You may have two different virtual servers with different DocumentRoots for each -- one for http, and the other for https. In this case, the code will have to be duplicated in both virtual servers' <DocumentRoot>/.htaccess files.

Jim

wildbest

8:09 pm on Jan 1, 2009 (gmt 0)

Jim, I'm reading your comments for quite sometime. You are very good! Congrats :)

I'm using this code on one of our sites, but Google WMT 'Content Analysis' start reporting duplicate content issues - duplicate meta descriptions and duplicate title tags for the index page:

/
‎/?sid=87-fcnh475-sjjhf-2644-fmk1978-vjfn
‎/?sid=11a0426a-8ec8-4c04-82cb-0987b2ac8e31‎
‎/?sid=9a81ac51-1a4b-447c-99ef-0aec83eef94e‎
so on

I suspect the RewriteRule must be:
RewriteRule (.*) http%2://www.example.com$1 [R=301,L]

But... if I change it - http(s)://example.com/somepage.html doesn't redirect to http(s)://www.example.com/somepage.html

Might need to study some PHP config settings in details. There is HTTP_ACCEPT = */* in there and I'm not sure what effect this may have? BTW, it's Apache/1.3.33.

g1smd

8:31 pm on Jan 1, 2009 (gmt 0)

The duplicates appear to be coming from session IDs.

How do you want to handle those?

wildbest

9:03 pm on Jan 1, 2009 (gmt 0)

The duplicates appear to be coming from session IDs

Yes they do, but they shouldn't! And not only sid's are listed there. As you can see, WMT is reporting duplicate content in both http://www.example.com and http://www.example.com/

Besides, does that mean that competitors can compromise your ranking by linking to your site using different query strings?

How do you want to handle those?

I don't know. Maybe have to send support request to Google WMT and ask for advice?!

g1smd

9:09 pm on Jan 1, 2009 (gmt 0)

I mean, "how do you want your site to handle requests with session IDs?"

Do you want to strip them off here, and issue a new one only for logged in users?

*** Besides, does that mean that competitors can compromise your ranking by linking to your site using different query strings? ***

Yes, if your site returns "200 OK" and content, for parameters that should really be returning "404 Not Found".

It looks like you have far wider issues with the whole site beyond the simple stuff you asked in your very first question.

wildbest

9:34 pm on Jan 1, 2009 (gmt 0)

It's a very simple site and I don't think I have any "wider issues"

Yes, if your site returns "200 OK" and content

Hmmmm, I'd expect Google should honor the sitemap I've submited... That is what sitemaps are for! There are no such pages out there and should not be indexed!

g1smd

9:40 pm on Jan 1, 2009 (gmt 0)

If I type www.example.com/blah-foo-wibble?woot=hoot-moot and your site returns "200 OK" as the status, then the URL does indeed "exist".

It is the designers responsibility to ensure that URLs that "do not exist" do actually return "404 Not Found" in the HTTP header, for any and all combinations.

wildbest

10:28 pm on Jan 1, 2009 (gmt 0)

It is the designers responsibility to ensure that URLs that "do not exist" do actually return "404 Not Found"

It's Google's responsibility to index only those URL's that are in my sitemap duly submitted and verified on file with Google's WMT!

g1smd

10:34 pm on Jan 1, 2009 (gmt 0)

Nope.

Google will index anything that returns "200 OK".

wildbest

10:40 pm on Jan 1, 2009 (gmt 0)

So compiling, submitting, verifying, updating sitemaps is just a marketing trick without any value for the web designer.

g1smd

10:42 pm on Jan 1, 2009 (gmt 0)

Your sitemap gives Google a clue as to the URLs that you deem to be important; nothing more.

It doesn't limit spidering to just those.

jdMorgan

11:20 pm on Jan 1, 2009 (gmt 0)

To be blunt: You are relying on others outside of your control to 'correctly' link to, spider, and index your site. That is a mistake, because if they make errors, you will have no power to get them corrected. A cliche from a movie comes to mind: "Trust, but verify."

1) Do not serve sessions IDs to search engine spiders. This is a well-known rule.

2) Correct any inherent canonicalization issues on your site: Your pages (and/or the scripts that create them) should link to one and only one URL for any given unique 'page' of content. This is to include protocols, domains, URL-paths, and query strings: No variation whatsoever in the values, case, or order of these fields should be allowed.

3) Implement server-side code to force protocol, domain, URL-path, and query string canonicalization by generating a 301-Moved Permanently redirect to the canonical URL if any non-canonical URL is requested from your site. Under some circumstances, you may prefer to return a 404-Not Found.

If you allow the same 'page' to be accessible at more than one URL, you lose control; Your competitors may release a 'storm' of non-canonical links to your pages. This will not give you a 'penalty' per se, but it has the potential to at least temporarily dilute the page rank of the canonical URL, and it will be the search engine that 'picks' what it thinks is the correct URL, and not you. You will be relying on a back-end process at each of the search engines to identify duplicate-content issues on your site and to take steps to rectify them. Will they get it right? Will they always get it right? -- You tell me, I don't know.

All of these subjects have been fairly well-covered here over the years; I commend the 'site search' links at the top of each page to you.

Jim

wildbest

11:21 pm on Jan 1, 2009 (gmt 0)

Your sitemap gives Google a clue as to the URLs that you deem to be important

Exactly! But they decided to index some other URLs as well, the only importance of which is that they are... duplicates!? And why should I be penalized for duplicate content? Because Google spent some money only to find out that my assessment about what is important in my website and what is not should have been honored?!

Come on... I don't trust you, but if I find out that you're telling the truth you should pay for it!? Very interesting concept, isn't it?

It doesn't limit spidering to just those.

They can spider whatever they want. I'm not opposed to what they spider but to what they put in their index!

g1smd

11:35 pm on Jan 1, 2009 (gmt 0)

Control what they can index by returning "404" for stuff that really doesn't exist, and return a 301 redirect pointing at the correct URL if they almost get the URL right (like if they miss off the www, or get the parameters in the wrong order, or add other unwanted parameters) in their request. Bomb-proof your site from anyone accessing it via URLs that should not be returning content. Ensure those requests do not return "200 OK".

jdMorgan

11:47 pm on Jan 1, 2009 (gmt 0)

You've been given good advice on best practices here. It's now up to you to act, or to leave your fate in the hands of others.

I'll recommend a particularly-well-titled thread to you in closing: Duplicate content - Get it right or perish [webmasterworld.com].

Jim

g1smd

11:50 pm on Jan 1, 2009 (gmt 0)

I'd also recommend the more recent: Canonical URL Issues - including some new ones [webmasterworld.com] too.

Caterham

11:53 pm on Jan 1, 2009 (gmt 0)

And why should I be penalized for duplicate content?

According to google, duplicated content is only a real problem, if it comes from another domain: [googlewebmastercentral.blogspot.com ]

If you don't want to worry about sorting through duplication on your site, you can let us worry about it instead.
Duplicate content doesn't cause your site to be penalized.

g1smd

11:59 pm on Jan 1, 2009 (gmt 0)

I'll quote what Jim said a year or so back...

"The fact remains that unless we take steps on the server side to absolutely prevent multiple URLs from accessing content, we hand over the 'welfare' of our PageRank and link-popularity to an 'extra step' of back-end de-duplication processing by the search engines.

"That de-duplication process might have a bug -- now or later. Or perhaps it can't always be run for all URLs before a new index is deployed, and the URLs from your site might not get processed.

"If you care about your ranking or if it's important to your revenue, I advise putting the preventative measures in place on your server so that any particular piece of content can be accessed by one and only one URL, and all other variants --whether caused by human or machine error, and regardless of technical validity-- be 301-redirected to the single canonical URL. In this way, you control your own destiny, rather than relying on the search engines to "figure it out."

Couldn't have put it better myself.

wildbest

12:06 am on Jan 2, 2009 (gmt 0)

Thanks, jdMorgan and g1smd. Sure I'm trying to achieve all that.

My original interest in that thread, however, was caused by the fact that duplicate issues in my WMT reports begun since I've applied code very similar to what jdMorgan recommended above. I guess this might be pure coincidence as we all know Google also changed their algo a month ago.

How can I exclude non existing query strings (404 or 301 redirect them) and keep the rewrite redirect from non-www, non-canonical hostname to the canonical www host name and preserving current HTTP/HTTPS protocol?

g1smd

12:23 am on Jan 2, 2009 (gmt 0)

You'll need to list out all the URL formats that are acceptable before you can begin to think in any way about code to do that job.

Last year, I worked on a site where all URLs used these formats:

/
/123
/123/1234567

The category URL was always three digits.
The page within, was always seven digits.

If you requested a URL with:
- the wrong number of digits, then you got a 404 error (irrespective of whether there were any other problems with the URL);
- the right number of digits but with a trailing slash, then the server returned a 301 redirect to remove the trailing slash, and force www at the same time whether or not www was in the original request, and remove any additional parameters too;
- the right number of digits, but with some parameters, then the server would redirect to strip the parameters, remove any trailing slash if one was one present, and force www whether or not that was in the original request;
- the right number of digits, but without the www, then the server redirected to add the www back on the URL, and it also removed any additional parameters at the same time.

Additionally, if you requested the "dynamic URL" format (with the right number of digits in the parameters) to try to directly access the script, then you were redirected to the correct "static" URL format for that page. At the same time, and within the same redirect, www was added whether or not it was present in the original request. If any additional parameters were present, they were also stripped in the same redirect. This redirect was listed first in the .htaccess file.

So, only "correct format" URL requests were ever passed to the PHP scripting running the site (via one of two simple rewrites). Everything else either responded with a 301 to the right URL, or with a 404.

So, for valid format requests, once the PHP script had examined the requested URL, it looked in the database to see if the corresponding records existed. If they did not, then the script itself would send a 404 HTTP Header and a "Not Found" error message to the user.

In this way there is no occasion for anything to ever be indexed under the "wrong" URL.

Some of the code for "pages" can be found at: [webmasterworld.com...] or at: [webmasterworld.com...]
Code for "categories" wasn't included in those posts.

[edited by: g1smd at 1:17 am (utc) on Jan. 2, 2009]

g1smd

1:01 am on Jan 2, 2009 (gmt 0)

Here's what happens when these things are not catered for in the site design:
[webmasterworld.com...]

wildbest

1:03 am on Jan 2, 2009 (gmt 0)

Ah, I see you apply server side scripting to check the validity of the query strings - PHP script to evaluate the requested URL. I thought this can be achieved on Apache mod_rewrite alone? Thanks for the last link.

g1smd

1:06 am on Jan 2, 2009 (gmt 0)

The PHP checked the database only when the numbers were in the right format.

If there are only 23561 products in the database and someone requests product ID 84001 then it is the PHP script that sends the 404 error.

If someone asked for product ID 78245698294381128 then it was .htaccess that sent the 404 Error - as the number of digits was incorrect. No point in even passing the request to the PHP to look anything up.

The "tightness" of the URL formats was designed-in from Day One. This layering of redirects (to canonical format, for some types of error in request), rewrites (only when URL request is in the right format), and scripting (to deliver the page, or an error message), ensures that all non-valid requests are rejected and are sent either a 404 error or a 301 redirect.

Nothing is left to chance.

There is an answer for everything.