Forum Moderators: mack
How much of a problem is this, how does it happen, and what can I do about it? Please keeping in mind that until a couple of days ago, I never heard of "canonical problems," and am not even really sure what the term means (though I did Google it, and ended up back on another wemasterworld forum, where the discussions were over my head). :-)
Many thanks for any and all enlightenment.
It's a big problem; essentially, SEs see www.domain.com and domain.com as two different sites. This therefore divides your ranking between two domains, and induces Google to list half the pages as 'supplementary' results. And if other issues apply (such as identical meta tags), pages could be dropped completely.
It does not apply to all sites - if a site has always used one form (www.domain.com or domain.com), and no internal or external links exist to the alternate, then peace reigns.
But links to both forms precipitates the problem; Google follows the link, respiders the site, and adds it to the index. Then discovers it duplicates the other form, and problems go on from there.
The solution is alarmingly simple: Choose one - the convention is www.domain.com but the choice is yours - and '301 redirect' from the other to the chosen domain.
[edited by: Quadrille at 3:06 pm (utc) on Sep. 22, 2006]
I have been renaming files and reorganizing my directory structure a lot, so I guess I've discovered part of the problem.
I have never done anything with the .htaccess file directly, but my website's Cpanel provides a form for doing redirects -- this I assume will accomplish the same thing?
I see a 301 is a "permanent redirect." So far I've only done "temporary redirects" even though the moves were permanent -- maybe that was a mistake too? (Duh!)
"It does not apply to all sites - if a site has always used one form (www.domain.com or domain.com), and no internal or external links exist to the alternate, then peace reigns."
I have been trying to wrap my brain around this issue but am still not getting it. Nearly all my internal links are root-relative URLS. The only exceptions are some absolute url links to main sections of my site in a footer at the bottom of each page -- PLUS --
--until recently, all the links in my site directory were absolute. I changed them all to root-relative just to make things simpler for myself. By any chance, could this alone account for my website's nosedive?
If so, would creating and submitting a proper Google sitemap help correct this or make it worse?
As for external links, I don't see how I have much control over whether other sites link to me using www or no-www?
Welcome to the (mis)adventures of a newbie. :-)
"The solution is alarmingly simple: Choose one - the convention is www.domain.com but the choice is yours - and '301 redirect' from the other to the chosen domain."
Do you mean I should do a separate redirect for every page of my site that is listed w/o www, say, to point to the www version?
Your hosts cpanel may do the job (mine does) but you need to be sure, as not all will do a 301 - if in doubt, ask them.
You may or may not have other problems, but you know you have this one, and others will be easier to find once this is out of the way. Having said that, I'd certainly replace any other redirects you have with 301s. Redirects are often a source of problems, and you've nothing to lose by being safe.
You'd do well to check all your site navigation - look up xenu; it will become your best friend, once you've learned to stop it giving too much information!
A Google site map is worth doing for two reasons: 1. it does what it says on the packet and 2. It demands you sort out your site if it encounters problems, so it can be a valuable fault-finder.
[edited by: Quadrille at 4:43 pm (utc) on Sep. 22, 2006]
I will ask my host about the redirect. In the meantime, I found this "fix" you're supposed to add to your .htaccess file. Will this work?
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_HOST} !^www\.example\.com$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(([^/]+/)*)index\.html\ HTTP/
RewriteRule index\.html$ http://www.example.com/%1 [R=301,L]
And I will look up Xenu. Wow, good thing this stuff is interesting....
[edited by: Brett_Tabke at 2:32 pm (utc) on Sep. 24, 2006]
[edit reason] added space before ! per poster request [/edit]
I did that for 2 reasons, one being since we have so many long domains it makes the url's shorter and look better, and it just seems to make sense since so many people no longer typein the www anyway, including myself.
Before doing that we check the current Page Rank. If the GPR is higher for the www version we do not forward to the non-www but that is relatively rare as most of the sites have the same PR for both versions. In fact, the non-www occasionally has better PR even though we never forwarded to the non-www in the past.
Here is the htaccess code we are using (which works well and seems simple vs some other methods):
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{HTTP_HOST} ^www.name\.com [NC]
RewriteRule ^(.*)$ http://name.com/$1 [L,R=301]
ErrorDocument 404 http://name.com/index.html
ErrorDocument 401 http://name.com/index.html
ErrorDocument 402 http://name.com/index.html
ErrorDocument 400 http://name.com/index.html
I asked previously whether the "Redirect" form in my Cpanel would create a 301 redirect. I'm still waiting to hear back from my web host, but I just created a "permanent" redirect through the cpanel form, then looked at my .htaccess file to see what kind of code it created. This is the code:
RedirectMatch permanent ^/mypage.htm$ [mysite.com...]
Since this looks nothing like any of the other code, I assume it's not a 301 Redirect?
(I tried to edit the example link so it wouldn't appear as a link, but couldn't get it right--hope it's not a problem.)
[edited by: miki99 at 6:36 pm (utc) on Sep. 22, 2006]
Your ErrorDocument directives are malformed, and will lead to a 302-Moved Temporarily response for any of those errors. See the notes in the ErrorDocument documentation [httpd.apache.org] for details on this.
To return the correct response code for each error, use:
ErrorDocument 404 /index.html
ErrorDocument 401 /index.html
ErrorDocument 402 /index.html
ErrorDocument 400 /index.html
miki,
To avoid a double redirect on client index.html requests, I'd suggest:
RewriteEngine on
RewriteBase /
#
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(([^/]+/)*)index\.html\ HTTP/
RewriteRule index\.html$ http://www.example.com/%1 [R=301,L]
#
RewriteCond %{HTTP_HOST} !^www\.example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
It also removes the end-anchor from the domain name test, so it won't break if a port number is appended, and removes the unneeded anchoring on the ".*" RewriteRule pattern.
Jim
[edited by: jdMorgan at 6:48 pm (utc) on Sep. 22, 2006]
It's OK Jim -- that's what I figured you meant.
"miki, take this one step at a time. If you have not yet created custom error documents, then using those directives will cause a problem on your site. Read the documentation cited in my first post to find out more."
Thanks very much again, Jim. Frankly, I don't even know what programming language the .htaccess code is--ASP? I will read the documentation you cited (missed it the first time, sorry).
I just heard back from my host that using the form in my Cpanel WILL create a 301 redirect. Maybe I should just do it that way?
Using the way I did it in my earlier post I get this when testing a madeup htm name tested by using webconfs.com free header checker. It forwards to index with this: HTTP/1.1 302 Found =>
The after the change you suggested it still forwards to index and gets this: HTTP/1.1 404 Not Found =>
So it does work either way but the 404 error code now is accurate. Thanks again JD, most appreciated.
BTW, why does it make a difference if the error code 404 is indicated correctly or not since they both go to the index page? Sorry if that is a dumb question.
[mysite.com...] to [mysite.com...]
Just:
[mysite.com...] to [mysite.com...]
That wouldn't work for redirecting my whole domain, or would it?
miki
Glad you got it working!
Trader,
> BTW, why does it make a difference if the error code 404 is indicated correctly or not since they both go to the index page? Sorry if that is a dumb question.
HTTP/1.1 [w3.org] server status codes [w3.org] have very specific meanings to clients -- i.e. browsers and search engine robots. By returning a 302 status, you are telling search engines to keep the error URL (the one that caused a 404, for example) in their index, and ascribe the content of the redirect target page (your home page, in this instance) to the error URL.
Therefore, duplicate copies of your home page will appear to exist with these error URLs, in addition to your actual home page URL. This opens you up to duplicate-content problems (do a search [google.com] for dozens of threads here on WebmasterWorld) and to explots intended to create duplicate content problems for you...
Returning incorrect server status codes means your site is broken at the most fundamental level -- The HTTP protocol used to request and transmit Web pages.
Since you're asking for explanation, let me offer this advice: Do not feel free to toy with the HTTP protocol, post an untested robots.txt file, or leave canonicalization problems unremedied. If you confuse the search engine robots in the slightest way, your site is likely to suffer. At the very best, you leave yourself at their mercy to "figure out what I meant, not what I said.".
Jim
I will also be hesitant to post any code in the future for fear of giving bad code. Will be studying the 7,100 search results. Much appreciated and glad I asked the dumb question!
Also, there's no such thing as a "dumb question" -- There's only lots of broken, poorly-ranked Web sites because documentation wasn't read, questions weren't asked, or broken code wasn't posted.
Jim
RewriteEngine On
RewriteCond %{HTTP_HOST} ^mysite\.com
RewriteRule (.*) [mysite.com...] [R=301]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html?
RewriteRule ^index\.html?$ [mysite.com...] [R=301,L]
following this I have several simple redirects listed
some users are using[NC] after the first entry, some not,
some users are using a $ after the first .com, some not
Despite the tons of examples and docs, mod_rewrite is voodoo...Additionally you can set special flags for CondPattern by appending
[flags]
as the third argument to the RewriteCond directive. Flags is a comma-separated list of the following flags:
* 'nocase¦NC' (no case)
This makes the test case-insensitive, i.e., there is no difference between 'A-Z' and 'a-z' both in the expanded TestString and the CondPattern. This flag is effective only for comparisons between TestString and CondPattern. It has no effect on filesystem and subrequest checks.
* 'ornext¦OR' (or next condition)
And $ there is regex "end of line" anchor.
The apache manual has a lot of helpful information, but it is just helpful.
I think you have to hammer it, then try it out.
Then hammer, and try some more.
When it works like a charm, you need to test it a lot -- before google finds out what you forgot.
As a rewrite-newbie, I would like to see a collection or references to tested and proven rewrite rules for the different situations that have recently become required.
I think I'll read this too [webmasterworld.com]
For index pages it creates a Redirection Chain. You do not want a chain.
For domain.com/index.html you first get redirected to www.domain.com/index.html and then finally to www.domain.com/.
It is far better to do things in one step.
For */index.html ---> www.domain.com/
For domain.com/* ---> www.domain.com/*
To correct the problem, check for index redirection first, before detecting for non-www redirection.
This will work because the index redirection code does not test the source domain, but always specifies www as the destination.
I see that someone has posted a corrected version too (bottom of previous page). That works fine.
.
Just wading into the discussion on Error pages:
ErrorDocument 404 http://name.com/index.html
As noted above, this always produces a "302" error code, because you included "http://" in the URL. That is the wrong response code. It needs to produce a "404" code in the HTTP header for the bot to know that it really is a 404 page.
.
ErrorDocument 404 /index.html
This is still problematical.
Although it now returns the correct "404" response code, it is very bad form to serve an exact copy of your root index page in response to an error.
You are much better off making your own custom error pages containing two essential elements:
- the fact that an error has occurred.
- some basic site navigation to get the user on their way.
I always put those custom pages in their own folder.
The .htaccess directive then becomes:
ErrorDocument 404 /error.pages/error.404.html
Those custom error pages also contain <meta name="robots" content="noindex"> to stop them being indexed at their "real" URL.
That, too, is important.
I assume that you are talking about the $1 notation.
If you are referring to the RewriteRule line then I hope I can explain some things here.
In that line you are taking one URL and turning it in to some other.
.
The first bit specifies what the original URL was, the second bit specifies what it becomes.
You can collect up a chunk of URL in a (.*) and then re-use it in a $1 later.
If you had a second (.*) you could reuse it in a $2 later.
You can append ^ and $ around a (.*), or around a whole expression, to denote an exact match like : ^(.*)$ or even ^/somefolder(.*)$ or in this case perhaps ^(.*)index.html$
.
Some examples for your index redirection:
RewriteRule ^index\.html?$ http://www.mysite.com/ [R=301,L]
This takes only the root index.html page and rewrites to the root www.domain.com/ without the index.html appended.
It does not cater for index pages in folders on the site.
.
RewriteRule ^(.*)index\.html?$ http://www.mysite.com/ [R=301,L]
This takes any index.html page (even one in a sub-folder - that's the (.*) bit) and rewrites to the root www.domain.com/ without the index.html appended. It does NOT preserve the folder name - BAD!
.
RewriteRule ^(.*)index\.html?$ http://www.mysite.com/$1 [R=301,L]
This one takes any index.html page (even one in a sub-folder - that's the (.*) bit) and rewrites the URL to the same folder (that's the $1 bit) www.domain.com/folder/ but without the index.html appended.
.
A note about the ? in the RewriteRule ^index\.html?$ part.
RewriteRule ^index\.htm$ - rewrites only for index.htm
RewriteRule ^index\.html$ - rewrites only for index.html
RewriteRule ^index\.html?$ - rewrites both for index.htm and for index.html
.
This stuff is very very powerful. It is logical, but it is not always obvious.
RewriteEngine On
RewriteCond %{HTTP_HOST} ^mysite\.com$ [NC]
RewriteRule (.*) [mysite.com...] [R=301]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html?
RewriteRule ^index\.html?$ [mysite.com...] [R=301,L]
Thank you