Forum Moderators: phranque

Message Too Old, No Replies

Best Way to a Good Canonical?

Using mod_rewrite or canonical tag or change links

         

farflappin

2:49 pm on Jan 14, 2010 (gmt 0)

10+ Year Member



I've posted this request here because, at present, I'm using htaccess methods to try to achieve a single form of URL.

I have searched through the forum and found different ideas on achieving what I'm trying to do but the thought has occurred to me - am I understanding what I'm trying to do (probably not)?

I'm trying to get all calls to redirect to http://www.example.com/

A few months ago, I did a 301 htaccess redirect -

RewriteCond %{HTTP_HOST} !^www\.example\.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301]

That seems to work well enough when it comes to simply the www part of the URL. But I had thought it would also work with the http://www.example.com/index.php duality as well.

It doesn't appear to be working that way - I seem to have a far higher link popularity with index.php than to root.

Are these being generated by internal site links pointing to the home page, because the internal links do href to index.php? I've no doubt some links from outside might target index.php directly but the number surprises me if that's the case. So do I simply change my internal links to '/' or is a redirect still required?

Could I go for the canonical tag and hope ... or would it be a useful thing to put in anyway even if I change other things?

[edited by: jdMorgan at 8:32 pm (utc) on Jan. 14, 2010]
[edit reason] example.com [/edit]

jdMorgan

8:32 pm on Jan 14, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> the internal links do href to index.php

Fix your links first. Once search engines start requesting "/" instead of "/index.php", you can redirect all further direct client requests from "/index.php" to "/". This redirect *will not* help with your problem unless your links are fixed first.

There is no "duality" here. You are linking to filepath instead of linking to the correct URL. The duality exists only in confusing a URL with a filepath -- They are two very different things, associated by the action of the server, but not at all equivalent.

You also need an [L] flag on your existing rule.

Jim

g1smd

10:12 pm on Jan 14, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One error in your original code is that the [NC] negates what you want to do. Remove that NC flag.

The existing rule redirects non-www to www. You'll need a separate rule to redirect index requests and that rule also needs to fix the domain name for those requests. The new rule should be placed before the existing rule.

farflappin

10:58 pm on Jan 14, 2010 (gmt 0)

10+ Year Member



Many thanks Jim - very much appreciated.

I follow what you mean about duality. I picked up the problem on Google Webmaster Tools when they indicated duplicate title tags - would the linking of the filepath to index.php cause this or have I missed something else?

I didn't use the [L] flag on the redirect rule as the htaccess runs through to other rules, such as blocking some referers and bots so I have the [L] at the end of all of the rules. As I'm not using ifmodules, I thought that's the place it should go as it indicates the last rule. I don't mind if I'm wrong as I'm learning (slowly ... but I'm learning).

Ray

farflappin

11:07 pm on Jan 14, 2010 (gmt 0)

10+ Year Member



Cheers g1smd - the [NC] will go. I can understand why as it allows caps.

I'll do as Jim says first and redo the internal links - then do the redirect before the existing one in htaccess. I've seen a couple of versions of that redirect, both on here and elsewhere and I've noticed the odd warning about loops - if you don't mind me asking, which version do you advise?

jdMorgan

11:47 pm on Jan 14, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You should end all of your rules with an [L] flag unless you have a *very* specific reason not to, and you know how to 'stack' rules and stay out of trouble: There's a rather nasty Apache bug you can trigger by allowing multiple rules to run... However, it is *very* rarely necessary to run more than one rule, so put [L] on all of them.

The [L] flag means "stop processing if this rule is invoked" -- and only if the rule is invoked. You've left it out for a rather bad reason.

Put your rules in this order:

  1. Internal variable declarations with [E] flag, generally using null path substitution string.
  2. Access controls -- 403-Forbidden response with [F] flag.
  3. Removed URLs -- 410-Gone response with [G] flag.
  4. All external redirects with [R=301,L] flags, ordered from most-specific patterns and conditions to least-specific.
  5. All internal rewrites, again in order from most-specific patterns and conditions to least-specific.

To clarify the "most-specific" concept, a redirect for a specific URL should go first. Redirects for groups of URLs should go next. Your domain canonicalization redirect should (almost always) be the very last redirect. The last redirect is then followed by your first (most-specific) internal rewrite.

Rule-type ordering mnemonic: abcd [EFG] ... [R] ... xyz

In this way, stacked/chained/multiple redirects are avoided, and the filepaths to which URLs have been internally rewritten won't be exposed to clients as URLs.

And all rules end with an [L] flag. :)

Jim

farflappin

12:54 am on Jan 15, 2010 (gmt 0)

10+ Year Member



Hi JIm,

Am I glad that I came on here - I put my hand up to thinking that I knew and as a result, I'm grateful for being lucky enough not to come unstuck - so far ... not a good situation methinks.

I thought the [L] was taken as the last rule whether it was invoked or not - many thanks for putting me wise.

My stacking order needs a look too in that the 301 needs moving - in my defence on that, I found several references putting it as the first rule but I'll certainly follow your guidelines. It makes sense really in that anything getting a 403 won't need to know anyway.

Along those lines - up till now, I've been led to believe that 'ErrorDocument', any <FilesMatch> and 'Deny from ...' statements go prior to 'RewriteEngine on' - then the rules. I might as well get this right once and for all.

I really appreciate you taking the time to come off the original topic and help me on this as well.

I shall, of course, use the 'I put it down to old age' excuse for all that it's worth :-)

Ray

jdMorgan

4:50 pm on Jan 15, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I used the word "rules" specifically to indicate RewriteRule order, and the ordering given above is based on logic (plus a bit of experience) -- and not on something that I read somewhere else and present as dogma...

Be aware that Apache modules (e.g. mod_rewrite, mod_access, etc.) each execute in turn, with each handling only the directives that it understands in your .htaccess file. Therefore, the directives only "execute in order" if they belong to the same module. You cannot control the module execution order from within .htaccess, as that is determined by the reverse-ordering of the LoadModule list on Apache 1.x and by an internal priority scheme on Apache 2.x.

Therefore, you cannot view your code as a "linear sequential program" except within/among directives all targeted to the same Apache module.

So, in other words, it makes no difference at all whether you put your "Deny from" directives before or after your RewriteRule directives; Either all "Denys" will execute first, or all RewriteRules will execute first, and nothing you can do in your .htaccess file can change that. This is the main reason that several contributors here recommend *not* mixing mod_alias Redirect and RedirectMatch directives with mod_rewrite RewriteRule directives: The execution order might change after a server upgrade, a 'tweak' by your host, or an elective change in your hosting provider...

The module execution order is not totally arbitrary, and 98% of the time mod_access will execute before mod_alias, followed by mod_rewrite (just to name three). This is because that execution order "makes the most sense to the most server administrators." But the fact is that we do see exceptions here.

Another aspect to consider is that if a redirect is invoked, that terminates the current HTTP transaction, and informs the client that it should start a new one, using the new URL provided in the server's redirect response. It's important to realize that this means that all of your server-config code will be re-executed from the top, with no "memory" whatsoever of the previous transaction. This can also make .htaccess directives appear to execute out-of-order, if you don't realize that you're handling a second HTTP request distinct from the first one.

The domain canonicalization redirect rule should be the last external redirect 99.99% of the time. Why? Well, taking into account the above discussion, imagine that you put the recommended "index.php" canonicalization rule *after* the domain canonicalization rule, and a client requests "example.com/index.php" from your server. With the domain canonicalization rule first, that client will first get redirected to "www.example.com/index.php" and so will issue a second HTTP request for that new URL from your server. This time, since the domain is correct only the second rule will fire, and redirect that same client to "www.example.com/". And then the client will come back using a third HTTP request for the now-fully-canonicalized index page URL (which triggers neither rule), and it will finally get the content that it wanted in the first place.

Now reverse the rule, and let the request for "example.com/index.php" get redirected straight to "www.example.com/" -- One redirect, two HTTP transactions, and 50% faster/less request-handling work/time for the client and your server.

Note that even if you don't put the [L] flag on the first rule in this scenario, the server still has to process two rule instead of only one...

So anyway, that's the explanation of the "most-specific redirects first" part of the recommendations above.

By all means, if you have a question about how a particular directive or flag works, go straight to the source instead of getting "opinions" on some forum somewhere. Apache docs are all online at apache.org, and they are far more correct than most second-hand knowledge (including ours here). :)

I added *just a few* citations to our Apache Forum Charter when I came on board here, and I commend them to you.

Jim

P.S. "...use the ... excuse for all that it's worth."
Note grey chin in profile pic -- You're not the only one!

farflappin

10:10 pm on Jan 15, 2010 (gmt 0)

10+ Year Member



Hi Jim,

Many thanks for taking the time to run through the process with me. It's bit of an eye opener on how much I didn't know compared with what I simply assumed. Assumptions made, over the years by lumping together snippets that I gleaned when I wanted to do something specific ... I take your point about experience and this has been a good 'un for me.

I've changed all the internal links to "/" and moved the htaccess around and changed it according to what you and g1 have advised. I'll be putting the index.php redirect on as well.

Once again, many thanks for your help and to g1 too.

Ray

jdMorgan

12:44 am on Jan 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



When you get ready to redirect index page to "/", be sure to do a search here for the required method -- It's a bit tricky if you want to avoid an infinite redirect/rewrite loop... :)

Jim

farflappin

2:50 am on Jan 16, 2010 (gmt 0)

10+ Year Member



Hi Jim,

I've used one that you've shown several times -

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php
RewriteRule ^index\.php$ http://www.example.com/ [R=301,L]

I've tested it coming into the site by calling 'http://www.example.com(/)', 'http://example.com(/)', 'http://www.example.com/index.php' and 'http://example.com/index.php' and they have all come up showing the required 'http://www.example.com/' in the browser bar.

Testing the internal links has the desired effect as well.

I've used different browsers and also refreshed and cleared history. There was no looping, all pages appeared, no glitches or hang-ups and everything appeared to be working normally.

Is it the case that if it works ... then it works (subject to some change in the future) or is there a situation that I've not thought through and accounted for above?

Ray

jdMorgan

4:22 am on Jan 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Testing the internal links has the desired effect as well.
> Is it the case that if it works ... then it works (subject to some change in the future) or is there a situation that I've not thought through and accounted for above?

Sounds dangerous, so let's be very clear... If you have not changed all of those 'internal links' to point to "/" instead of /index.php, then you are forcing every client that 'clicks' on one of those links to do *two* HTTP requests -- One resulting in a redirect response, and the second actually returning the desired content. Users will be slowed down, search engines will cast a jaundiced eye, and your logs and stats will be severely skewed.

You cannot count on this redirect as a "magic fix" for your own site's linking errors; It will truly "help" only with linking errors on sites that you do not control.

You must link only to "/" from your own pages to avoid trouble. If you cannot do that, then remove the redirect and just live with the ugly URLs.

Jim

g1smd

7:59 am on Jan 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This would be a good time to run Xenu LinkSleuth over your site just to be sure all the internal links point to the correct URL.

farflappin

6:36 pm on Jan 16, 2010 (gmt 0)

10+ Year Member



Thanks Jim and g1,

I've got my links contained in 'include' files so that all pages call the same link set relative to their level - it was a matter of changing those. The odd exception has also been changed, as has the link to the home page from the forum. I've also changed the internal redirects where they apply to 'home' when logging in and out.

I've simply changed the links to ./ or ../ as that keeps it all in the relative form being used. The linking 'error' was using index.php and up until doing all of this, Google has shown no problems at all with any page or link. Plus, using the site, all links are working OK - I double check links every time something goes in or gets changed. But I'll still run LinkSleuth to treble check.

Nothing appears to have slowed and I've spent quite a time running it through.

I do now appreciate the difference between the internal link aspect and that the htaccess redirect should only affect those coming in from outside links if a change is needed.

If all that has been done helps a bit or a lot in achieving a 'one site view' - instead of bits scattered under different banners - then I'll be pleased. I don't expect miracles but it doesn't hurt to improve things if they can be bettered.

Once again, many thanks for your continuing assistance.

Ray

g1smd

11:52 pm on Jan 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I never use the ./ form.

I always use the /full/path/from/the/root.html

farflappin

9:30 pm on Jan 17, 2010 (gmt 0)

10+ Year Member



Do you mean just the ./ or the relative form generally?

I've tended to use it ... well, simply because it's more concise and has worked trouble free so far. I have sometimes had a wander through web looking at thoughts on which form to use and found both appear to have their good and bad points.

Again, that is me, perhaps basing my approach through the dogma that Jim refers to and it would be refreshing to know your reasons based on sound experience.