homepage Welcome to WebmasterWorld Guest from 54.204.77.26
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
How can I prevent serving multiple 301 redirects?
Multiple mod_rewrite rules causing potential server overhead and SEO pain
HenryUK




msg:3893797
 3:20 pm on Apr 16, 2009 (gmt 0)

I'm involved in planning a site migration of a site with a very large number of URLs.

The existing draft rewrite rules are a complex mix of individual redirects for key pages, directory changes, and pattern-matching for large numbers of dynamic pages.

There are >150 rules in place.

Happily, all old URLs are redirecting to the correct new URLs. Unhappily, they are doing so by way of serving multiple 301 redirects. For example, one to change a directory from old to new, another to replace a query string parameter, another to append a trailing slash to non-file URLs.

(Of course, this is all on dev at present, not live!)

I'm concerned that getting to the correct URL via multiple 301s will cause unnecessary demand on the server, and I know for a fact that Google will not like the taste of it very much.

The complexity and range of existing URLs means that the most logical way of dealing with them is by applying a series of simple rules, rather than writing a complicated rule for each URL format.

I'd like to be able to apply all of my rules and only return a single 301 redirect when done. Will mod_rewrite allow me to do this?

Here's hoping that the response isn't simply "write better rules..."

Thanks in advance.

 

jdMorgan




msg:3893862
 4:48 pm on Apr 16, 2009 (gmt 0)

Multiple chained redirects should be avoided if you want to pass the "link juice."

There are two ways to do this.

The first approach is to write each rule so that it accepts all permutations of input URL and redirects to the canonical URL all in one go. This is useful for a limited set of input URL permutations with a limited number of canonical URLs.

The second approach, likely more suited to your situation, is to specify a redirect in each rule, but to hold off invoking it until all fix-ups on the output URL are done. Then a final rule checks to see if any fix-ups were done, and if so, invokes the actual external redirect. To do this, you can set an environment variable in each fix-up rule to indicate that the external redirect should be invoked. For example:

# Redirect to add trailing slash if no trailing slash and no period in final URL-path-part of requested URL
RewriteRule ^(([^/]+/)*[^/.]*[^/])$ http://www.example.com/$1/ [R=301,E=doRed:Yes]
#
# Redirect to canonical hostname if non-canonical hostname requested
RewriteCond %{HTTP_HOST} !^www\.example\.com$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,E=doRed:Yes]
#
# Invoke an external redirect if any of the above fix-ups were applied
RewriteCond %{ENV:doRed} ^Yes$
RewriteRule ^(.*)$ - [R=301,L]

An important guideline that you should follow to avoid unexpected results and problems: Put all external redirects first, in order from most-specific patterns and conditions (fewest URLs affected) to least-specific pattern (most URLs affected), followed by all internal rewrites, again in order from most- to least-specific. Where patterns and conditions are different and mutually-exclusive, then order won't matter.

This is illustrated by the fact that in the code above, only URLs which don't end in a slash and don't contain a period in the final URL-path-part (indicating that a filetype is not present) are redirected to add a slash. (Side note: Doing this based on the absence of a filetype is much more efficient than checking the disk for "file exists" -- possibly thousands of times more efficient.)

This rule is then followed by the domain canonicalization rule, which will redirect *all* URLs if any non-canonical hostname is requested.

If any internal rewrites are present, they must follow all external redirects. Otherwise, a redirect will expose the internally-rewritten filepath as a URL -- almost always an unwanted result.

The code posted here is for use in .htaccess, as the majority of our readers are on shared name-based servers with no access to their server config files. For use at the server config level, add a leading slash to the regular-expressions patterns in the RewriteRules (only) -- e.g. "^(.*)$" becomes "^/(.*)$"

A bug exists in all versions of Apache mod_rewrite which can cause errors when multiple sequential rewrites are done. The result is that part of the URL gets re-injected into the substitution path, and you see "duplication" of parts of the URL-path. It's generally only a problem with internal rewrites, though. If this does occur, be sure to post back here; There is a solution, but it's ugly, inefficient, and unnecessarily complicated unless needed.

Jim

[edit] Corrected as noted below. [/edit]

[edited by: jdMorgan at 8:35 pm (utc) on April 16, 2009]

HenryUK




msg:3893953
 7:00 pm on Apr 16, 2009 (gmt 0)

Jim, this is absolutely brilliant, thanks very much for taking the time to set things out so clearly. I thought that there must be something like this available in mod_rewrite, but I hadn't been able to find it elsewhere despite a lot of searching.

I have had a good go at implementing this. I noticed what I thought were a couple of typos in your code, and I've done my best to correct them. In the spirit of the forum I thought I would share my experiences with everyone. As you'll see, I think I've very nearly got it!

I wanted to use your examples, so I tried adding this code to my .htaccess file.


RewriteRule ^(([^/]+/)*[^/.]*[^/])$ http://www.example.com/$1 [R=301,E=doRed:Yes]

I replaced www.example.com with the domain of my test site. And at first, I removed the environment variable as I wanted to be sure that this code was correct. When I tried it, anything without a trailing slash or a period at the end of the URL hit a redirect loop as it would effectively redirect to itself.

However, I managed to fix this by adding a trailing slash to the redirect, so that the rule read as follows:


RewriteRule ^(([^/]+/)*[^/.]*[^/])$ http://www.example.com/$1/ [R=301]

I thought I'd mention this in case anyone else tried the solution.

Then I added the next code snippet that you provided:


RewriteCond %{HTTP_HOST} !^www\.example\.com$
RewriteCond ^(.*)$ http://www.example.com/$1 [R=301,E=doRed:Yes]

Again, this caused a problem - a 500 error across the site! This time the error was a little easier to spot - the second line should have begun with RewriteRule rather than RewriteCond. Again I also removed the environment variable for testing purposes, so that my code was:


RewriteCond %{HTTP_HOST} !^www\.example\.com$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301]

At this point I did wonder briefly if you were setting me a little test! If so I hope I passed ;-)

I had both of these running now, which got me to the right URL in the end but of course it was via two separate 301 redirects.

At this stage, I noticed that the order of the redirects (which I track using the HttpFox plugin) was different from the order of the code, so that when I entered:


example.com/folder

it redirected first to:


www.example.com/folder

and then to:


www.example.com/folder/

I wondered if this was because I had the following line of code preceding the rules:


RewriteBase /

However, when I took this out I got a strange URL which was the equivalent of http://www.example.com/http://www.example.com/folder/

Having reinstated that line, I then introduced the environment variables and the "fix-up" code. So, at this point, this is my whole .htaccess file:


RewriteEngine On
Options +FollowSymLinks
RewriteBase /
RewriteRule ^(([^/]+/)*[^/.]*[^/])$ http://www.example.com/$1/ [R=301,E=doRed:Yes]
RewriteCond %{HTTP_HOST} !^www\.example\.com$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,E=doRed:Yes]
RewriteCond %{ENV:doRed} ^Yes$
RewriteRule ^(.*)$ - [R=301,L]

I entered the equivalent of the URL


http://example.com/folder

I hoped it would return a single 301 redirect to


http://www.example.com/folder/

It did return a single 301 redirect! However, it was to (the equivalent of)


http://www.example.com/http://www.example.com/folder/

which of course then went on to return a 403 forbidden (with a note that there had been a 404 also).

I feel as though I'm very close now to getting to implement the solution - can you give me a tip to get over this last hurdle? I'm sure it's something to do with looping through the rules and the way that I have ineptly amended your original code.

Thanks again, really appreciate the support.

Caterham




msg:3894025
 8:27 pm on Apr 16, 2009 (gmt 0)

You're injecting the host multiple times

RewriteEngine On
Options +FollowSymLinks
RewriteBase /
RewriteRule ^.++(?<!/)$ $1/ [E=doRed:Yes]
RewriteCond %{HTTP_HOST} !=www.example.com
RewriteRule ^ - [E=doRed:Yes]
RewriteCond %{ENV:doRed} =Yes
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

jdMorgan




msg:3894028
 8:33 pm on Apr 16, 2009 (gmt 0)

Yes, it's that d@#n Apache bug again. So we do it the hard way:

# Set up initial environment variables
RewriteRule ^(.*)$ - [E=myHost:%{HTTP_HOST},E=myURLpath:$1]
#
# Add trailing slash if no trailing slash and no period in final URL-path-part of requested URL
RewriteCond %{ENV:myURLpath} ^(([^/]+/)*[^/.]*[^/])$
RewriteRule ^.*$ - [E=myURLpath:%1/,E=doRed:Yes]
#
# Set canonical hostname if non-canonical hostname requested
RewriteCond %{HTTP_HOST} !^www\.example\.com$
RewriteRule ^.*$ - [E=myHost:www.example.com,E=doRed:Yes]
#
# Invoke an external redirect if any of the above fix-ups were applied
RewriteCond %{ENV:doRed} ^Yes$
RewriteRule ^.*$ http://%{myHost}/%{myURL} [R=301,L]

Here we use environment variables to avoid any rewriting or the URL until we are ready to do the actual redirect. Note that the "-" in each rule means "leave the current requested URL alone." In this way, we avoid the path re-injection problem.

It's ugly, but it works.

Jim

jdMorgan




msg:3894062
 8:57 pm on Apr 16, 2009 (gmt 0)

For use on Apache 1.3x servers, caterham's code will need to be changed silghtly, because the regular-expression libraries on Apache 1.3x and 2.x are different:

Options +FollowSymLinks
RewriteEngine On
RewriteBase /
#
RewriteRule ^(([^/]+/)*[^/.]*[^/])$ $1/ [E=doRed:Yes]
#
RewriteCond %{HTTP_HOST} !=www.example.com
RewriteRule ^ - [E=doRed:Yes]
#
RewriteCond %{ENV:doRed} =Yes
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

(Blank comment lines added only for readability)

This approach is much simpler, and preferable to doing everything in environment variables as I demonstrated above. However, you may find that if additional internal rewrites are done (as in the first rule shown here), the Apache bug will get triggered, and you'll find parts of the URL-path duplicated in the output. In that case, you'll need to use the clunky environment variable method I posted above. (I have confirmed through testing that this bug exists on Apache 1.3x and Apache 2.0 through 2.2).

Jim

Caterham




msg:3894076
 9:14 pm on Apr 16, 2009 (gmt 0)

Yes, it's that d@#n Apache bug again.

That's not the path-info issue which could be seen as a feature as well, otherwise certain very special constructs relying on a generation of request_filename won't work; configurable in 2.2.12 (to be released).

The path was rewritten to http://www.example.com/folder/ by the first rule. The pattern of the second rule would match against http://www.example.com/folder/ + path_info, takes the full match and prefixes it with http://www.example.com/ again. The result would be http://www.example.com/http://www.example.com/folder/ in case no path_info was left by the dir_walk.

jdMorgan




msg:3894193
 12:06 am on Apr 17, 2009 (gmt 0)

Yes, it just looked like the nasty mod_rewrite bug [webmasterworld.com] at first glance. As far as I know (it's been over a year since I tested), this bug exists in all Apache versions, and manifests when multiple sequential internal rewrites are done (i.e. no [L] flag).

The bug report gives a simple example that shows the problem.

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved