Forum Moderators: phranque
# Externally redirect from non-www, non-canonical hostname to the
# canonical www hostname, preserving current HTTP/HTTPS protocol
RewriteCond %{HTTP_HOST} !^www\.example\.com$
RewriteCond %{SERVER_PORT}s ^(443(s)¦[0-9]+s)$
RewriteRule (.*) http%2://www.example.com/$1 [R=301,L]
Replace the broken pipe "¦" character in the second RewriteCond pattern with a solid pipe before use; Posting on this forum modifies the pipe characters.
Jim
You may have two different virtual servers with different DocumentRoots for each -- one for http, and the other for https. In this case, the code will have to be duplicated in both virtual servers' <DocumentRoot>/.htaccess files.
Jim
I'm using this code on one of our sites, but Google WMT 'Content Analysis' start reporting duplicate content issues - duplicate meta descriptions and duplicate title tags for the index page:
/
‎/?sid=87-fcnh475-sjjhf-2644-fmk1978-vjfn
‎/?sid=11a0426a-8ec8-4c04-82cb-0987b2ac8e31‎
‎/?sid=9a81ac51-1a4b-447c-99ef-0aec83eef94e‎
so on
I suspect the RewriteRule must be:
RewriteRule (.*) http%2://www.example.com$1 [R=301,L]
But... if I change it - http(s)://example.com/somepage.html doesn't redirect to http(s)://www.example.com/somepage.html
Might need to study some PHP config settings in details. There is HTTP_ACCEPT = */* in there and I'm not sure what effect this may have? BTW, it's Apache/1.3.33.
The duplicates appear to be coming from session IDs
Besides, does that mean that competitors can compromise your ranking by linking to your site using different query strings?
How do you want to handle those?
Do you want to strip them off here, and issue a new one only for logged in users?
.
*** Besides, does that mean that competitors can compromise your ranking by linking to your site using different query strings? ***
Yes, if your site returns "200 OK" and content, for parameters that should really be returning "404 Not Found".
It looks like you have far wider issues with the whole site beyond the simple stuff you asked in your very first question.
1) Do not serve sessions IDs to search engine spiders. This is a well-known rule.
2) Correct any inherent canonicalization issues on your site: Your pages (and/or the scripts that create them) should link to one and only one URL for any given unique 'page' of content. This is to include protocols, domains, URL-paths, and query strings: No variation whatsoever in the values, case, or order of these fields should be allowed.
3) Implement server-side code to force protocol, domain, URL-path, and query string canonicalization by generating a 301-Moved Permanently redirect to the canonical URL if any non-canonical URL is requested from your site. Under some circumstances, you may prefer to return a 404-Not Found.
If you allow the same 'page' to be accessible at more than one URL, you lose control; Your competitors may release a 'storm' of non-canonical links to your pages. This will not give you a 'penalty' per se, but it has the potential to at least temporarily dilute the page rank of the canonical URL, and it will be the search engine that 'picks' what it thinks is the correct URL, and not you. You will be relying on a back-end process at each of the search engines to identify duplicate-content issues on your site and to take steps to rectify them. Will they get it right? Will they always get it right? -- You tell me, I don't know.
All of these subjects have been fairly well-covered here over the years; I commend the 'site search' links at the top of each page to you.
Jim
Your sitemap gives Google a clue as to the URLs that you deem to be important
Come on... I don't trust you, but if I find out that you're telling the truth you should pay for it!? Very interesting concept, isn't it?
It doesn't limit spidering to just those.
I'll recommend a particularly-well-titled thread to you in closing: Duplicate content - Get it right or perish [webmasterworld.com].
Jim
And why should I be penalized for duplicate content?
According to google, duplicated content is only a real problem, if it comes from another domain: [googlewebmastercentral.blogspot.com ]
If you don't want to worry about sorting through duplication on your site, you can let us worry about it instead.
Duplicate content doesn't cause your site to be penalized.
"The fact remains that unless we take steps on the server side to absolutely prevent multiple URLs from accessing content, we hand over the 'welfare' of our PageRank and link-popularity to an 'extra step' of back-end de-duplication processing by the search engines.
"That de-duplication process might have a bug -- now or later. Or perhaps it can't always be run for all URLs before a new index is deployed, and the URLs from your site might not get processed.
"If you care about your ranking or if it's important to your revenue, I advise putting the preventative measures in place on your server so that any particular piece of content can be accessed by one and only one URL, and all other variants --whether caused by human or machine error, and regardless of technical validity-- be 301-redirected to the single canonical URL. In this way, you control your own destiny, rather than relying on the search engines to "figure it out."
Couldn't have put it better myself.
My original interest in that thread, however, was caused by the fact that duplicate issues in my WMT reports begun since I've applied code very similar to what jdMorgan recommended above. I guess this might be pure coincidence as we all know Google also changed their algo a month ago.
How can I exclude non existing query strings (404 or 301 redirect them) and keep the rewrite redirect from non-www, non-canonical hostname to the canonical www host name and preserving current HTTP/HTTPS protocol?
.
Last year, I worked on a site where all URLs used these formats:
/
/123
/123/1234567
The category URL was always three digits.
The page within, was always seven digits.
If you requested a URL with:
- the wrong number of digits, then you got a 404 error (irrespective of whether there were any other problems with the URL);
- the right number of digits but with a trailing slash, then the server returned a 301 redirect to remove the trailing slash, and force www at the same time whether or not www was in the original request, and remove any additional parameters too;
- the right number of digits, but with some parameters, then the server would redirect to strip the parameters, remove any trailing slash if one was one present, and force www whether or not that was in the original request;
- the right number of digits, but without the www, then the server redirected to add the www back on the URL, and it also removed any additional parameters at the same time.
Additionally, if you requested the "dynamic URL" format (with the right number of digits in the parameters) to try to directly access the script, then you were redirected to the correct "static" URL format for that page. At the same time, and within the same redirect, www was added whether or not it was present in the original request. If any additional parameters were present, they were also stripped in the same redirect. This redirect was listed first in the .htaccess file.
So, only "correct format" URL requests were ever passed to the PHP scripting running the site (via one of two simple rewrites). Everything else either responded with a 301 to the right URL, or with a 404.
So, for valid format requests, once the PHP script had examined the requested URL, it looked in the database to see if the corresponding records existed. If they did not, then the script itself would send a 404 HTTP Header and a "Not Found" error message to the user.
In this way there is no occasion for anything to ever be indexed under the "wrong" URL.
Some of the code for "pages" can be found at: [webmasterworld.com...] or at: [webmasterworld.com...]
Code for "categories" wasn't included in those posts.
[edited by: g1smd at 1:17 am (utc) on Jan. 2, 2009]
If there are only 23561 products in the database and someone requests product ID 84001 then it is the PHP script that sends the 404 error.
If someone asked for product ID 78245698294381128 then it was .htaccess that sent the 404 Error - as the number of digits was incorrect. No point in even passing the request to the PHP to look anything up.
The "tightness" of the URL formats was designed-in from Day One. This layering of redirects (to canonical format, for some types of error in request), rewrites (only when URL request is in the right format), and scripting (to deliver the page, or an error message), ensures that all non-valid requests are rejected and are sent either a 404 error or a 301 redirect.
Nothing is left to chance.
There is an answer for everything.