Forum Moderators: phranque

Message Too Old, No Replies

Best order of directives in long mod rewrite code

What is the best way to order directives in long mod_rewrite code?

         

JuddG

7:28 am on Oct 23, 2008 (gmt 0)

10+ Year Member



I stumbled upon these WebmasterWorld forums the other day and I must say they've been very useful - Jim your mod_rewrite examples in particular have been extremely helpful.

I've come in to help with a slightly involved project where we're trying to both implement a CMS & clean up our URLs for search engines at the same time.

There are about 240 existing unique URLs (all lower case, no pattern). The CMS generates a new unique URL for each existing page (again, no pattern). My mod_rewrite code needs to:

  • Rewrite each existing URL to it's new CMS location to preserve these URLs (as they have been indexed by search engines)
  • Redirect any requests for the CMS URLs (regardless of case) back to the correct (current) URL (to prevent duplicate content being exposed to search engines)
  • Catch any arbitrary incorrect case used to access the existing URLs & redirect to the correct (lower case) URL (e.g. if user enters /contactUS 301 to /contactus)
  • Redirect any non-www domain requests to the equivalent www version (again, to prevent duplicate content)

Based on my research & some of Jim's code here on the forums, I've generated the config below (which does what I'm after). However, it's rather long and I think it can be optimised (or at least the order of directives enhanced). I don't have a strong background in Apache though and so I'm not sure what to look for in this regard. Can anyone assist?

Note some names e.g. of the domain have been changed to protect the innocent :)

###
RewriteEngine On

# Redirect non-www requests to www
RewriteCond %{HTTP_HOST} ^mydomain\.com\.au
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

# Redirect default homepage file to domain root /
RewriteRule ^/index\.html$ / [NC,R=301,L]

# Text file key/value table for 301 of cms locations
RewriteMap redirectmap txt:/usr/local/apache/conf/redirects.txt
# Internal map for lowercase conversion
RewriteMap lowc int:tolower

## 301 physical locations to current URLs
# Put request into backreference
RewriteCond %{REQUEST_URI} ^/(.*)$
# Test if match found
RewriteCond ${redirectmap:${lowc:%1}} !^$
# If match found redirect request
RewriteRule ^/(.*)$ /${redirectmap:${lowc:$1}} [R=301,L]

# Text file key/value table for current-to-cms url translation
RewriteMap urlmap txt:/usr/local/apache/conf/urls.txt

## Redirect arbitrary incorrect case where valid url exists
# Check for upper case
RewriteCond $1 [A-Z]
# Put request into backreference
RewriteCond %{REQUEST_URI} ^/(.*)$
# Test if match found
RewriteCond ${urlmap:${lowc:%1}} !^$
# If match found rewrite request
RewriteRule ^/(.*)$ /${lowc:$1} [R=301,L]

## Rewrite current URLs to cms locations
# Put request into backreference
RewriteCond %{REQUEST_URI} ^/(.*)$
# Test if match found
RewriteCond ${urlmap:%1} !^$
# If match found rewrite request
RewriteRule ^/(.*)$ /${urlmap:$1}
###

[edited by: JuddG at 7:31 am (utc) on Oct. 23, 2008]

[edited by: jdMorgan at 1:20 am (utc) on Oct. 24, 2008]
[edit reason] example.com [/edit]

g1smd

9:32 am on Oct 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The non-www to www redirect rule must be the very last of the redirects -- otherwise a non-www URL is redirected to the equivalent www URL and then a another rule redirects it again. That redirection chain is a very bad idea. So, the individual rules each force www at the same time as they fix whatever else they are supposed to fix, and the "general" rule for non-www to www goes at the very end to work on any incoming URL request that hasn't been touched by any of the preceding rules.

The index rule you have has a leading / ... so that rule can never work in .htaccess. The URL seen by .htaccess RewriteRule is stripped of preceding path information, up to and including the preceding slash. I assume this code is going into the httpd.conf file instead.

The index rule is usually placed to be the one immediately before the non-www to www rule.

Place all of the rewrite code after all of the redirects, otherwise you will expose the internal filepaths back out into the URL seen by the browser.

Your redirects (the things with [R=301] in them) should include the target domain name in them too, so that the domain is fixed at the same time as whatever else the rule is supposed to do. This avoids the redirection chain that I mentioned above.

Give that a go, repost the new code and I'll take another look.

JuddG

12:20 pm on Oct 23, 2008 (gmt 0)

10+ Year Member



That's good advice, thank you - exactly what I was after. I hadn't realised that while my redirects were working, it would be easy to cause the redirection chain you explained. I was mistakenly ordering my redirects in terms of most common to least common (based on some vague idea about performance). As you've pointed out, that's backwards thinking.

I understand what you mean about the target domain fixing the non-www problem & thus saving a redirect. I've updated the redirects now to add this in & reordered them as you describe. This makes sense in terms of fixing two potential problems with one redirect.

The amended code is below. Thanks again for taking the time to help, much appreciated.

###
RewriteEngine On

# Text file key/value table for 301 of cms locations
RewriteMap redirectmap txt:/usr/local/apache/conf/redirects.txt
# Internal map for lowercase conversion
RewriteMap lowc int:tolower

## 301 physical locations to current URLs
# Put request into backreference
RewriteCond %{REQUEST_URI} ^/(.*)$
# Test if match found
RewriteCond ${redirectmap:${lowc:%1}} !^$
# If match found redirect request
RewriteRule ^/(.*)$ http://www.example.com/${redirectmap:${lowc:$1}} [R=301,L]

# Text file key/value table for current-to-cms url translation
RewriteMap urlmap txt:/usr/local/apache/conf/urls.txt

## Redirect arbitrary incorrect case where valid url exists
# Check for upper case
RewriteCond $1 [A-Z]
# Put request into backreference
RewriteCond %{REQUEST_URI} ^/(.*)$
# Test if match found
RewriteCond ${urlmap:${lowc:%1}} !^$
# If match found redirect request
RewriteRule ^/(.*)$ http://www.example.com/${lowc:$1} [R=301,L]

# Redirect default homepage file to domain root /
RewriteRule ^/index\.html$ http://www.example.com/ [NC,R=301,L]

# Redirect non-www requests to www
RewriteCond %{HTTP_HOST} example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

## Rewrite current URLs to cms locations
# Put request into backreference
RewriteCond %{REQUEST_URI} ^/(.*)$
# Test if match found
RewriteCond ${urlmap:%1} !^$
# If match found rewrite request
RewriteRule ^/(.*)$ /${urlmap:$1}
###

[edited by: jdMorgan at 1:22 am (utc) on Oct. 24, 2008]
[edit reason] example.com [/edit]

g1smd

12:26 pm on Oct 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The final rewrite needs an [L] flag on it. Other than that I can't see anything.

However, jd is good at sniffing out ways to make code more efficient.

jdMorgan

1:19 pm on Oct 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To be clear, is this code going into .htaccess, or into a server-config file such as httpd.conf or conf.d? There are subtle pattern differences required, as g1smd points out above.

Just a couple of quick minor points while I'm here: First, you can replace "!^$" with ".", as the two are logically equivalent: A non-blank string by definition will have at least one of any character in it, and "." is two characters shorter and doesn't require the logical-negation operation.

Second, be aware that the pattern of a RewriteRule is evaluated (and back-references stored) *before* any of its associated RewriteConds are evaluated. Therefore, back-references to parenthesized RewriteRule sub-patterns are available for RewriteCond evaluation, and back-references to parenthesized RewriteCond sub-patterns are available in the substitution field of RewriteRule. Therefore, you may eliminate the first RewriteCond in your currently-last rule above, and translate $1 with your urlmap call, rather than %1. The same optimization seems to be applicable to two other rules as well, where you're checking REQUEST_URI with a RewriteCond only to create a back-reference.

For the same reason (RewriteRule/RewriteCond pattern-evaluation order), the patterns in RewriteRules should be as specific as possible in order to avoid unnecessary RewriteCond parsing and execution. And while on this subject, RewriteConds that invoke DNS, filesystem checks, or OS shell functions should always go last whenever possible, so that they are only executed if all other conditions are met. And these "external call" functions should be avoided entirely if another "mod_rewrite-internal" method can be used. An example would be to exclude an explicit list of URL-paths from a RewriteRule rather than excluding URL-paths that resolve to existing files, as the first method relies on the "internally-available" requested URL-path-matching using RewriteRule/RewriteCond patterns, while the latter relies on an external call to the filesystem.

Jim

JuddG

12:25 am on Oct 24, 2008 (gmt 0)

10+ Year Member



Thanks g1smd - [L] flag added - your suggestions have been really helpful.

Sorry, I forgot to clarify earlier - this code is in httpd.conf.

Jim, good point about the logical equivalency of "!^$" and "." - I've fixed that.

Your description of how RewriteRule & RewriteCond work together is an eyeopener - it would have been great to have read that a couple days ago when I first started to research mod_rewrite. I guess I have a subconscious assumption that all code executes from top-to-bottom and didn't really understand why the "RewriteCond $1 [A-Z]" line worked for me.

I removed all the RewriteCond lines where I was just creating a back-reference & updated the RewriteMap calls accordingly.

I also had a think about what you said about only executing required calls. This actually gave me an idea - the code above that redirects cms URLs to the correct URLs is now currently written (after the amends from your suggestions above) as:

RewriteCond ${redirectmap:${lowc:$1}} .
RewriteRule ^/(.*)$ http://www.example.com/${redirectmap:${lowc:$1}} [R=301,L]

Would it be more efficient to use:

RewriteCond ${redirectmap:${lowc:$1}} (..*)
RewriteRule ^/(.*)$ http://www.example.com/%1 [R=301,L]

It seems to me that this saves time by eliminating the extra RewriteMap calls.

I've rewritten a few calls in this fashion - would I be correct in thinking this should speed things up? My fully amended code is below. Thanks again for your help Jim.

###
RewriteEngine On

# Text file key/value table for 301 of cms locations
RewriteMap redirectmap txt:/usr/local/apache/conf/redirects.txt
# Internal map for lowercase conversion
RewriteMap lowc int:tolower

## 301 physical locations to current URLs
# Test if match found
RewriteCond ${redirectmap:${lowc:$1}} (..*)
# If match found redirect request
RewriteRule ^/(.*)$ http://www.example.com/%1 [R=301,L]

## Redirect arbitrary incorrect case where valid url exists
# Check for upper case
RewriteCond $1 [A-Z]
# Test if match found
RewriteCond ${urlmap:${lowc:$1}} (..*)
# If match found redirect request
RewriteRule ^/(.*)$ http://www.example.com/%1 [R=301,L]

# Redirect default homepage file to domain root /
RewriteRule ^/index\.html$ http://www.example.com/ [NC,R=301,L]

# Redirect non-www requests to www
RewriteCond %{HTTP_HOST} ^example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

# Text file key/value table for current-to-cms url translation
RewriteMap urlmap txt:/usr/local/apache/conf/urls.txt

## Rewrite current URLs to cms locations
# Test if match found
RewriteCond ${urlmap:$1} (..*)
# If match found rewrite request
RewriteRule ^/(.*)$ /%1 [L]
###

[edited by: jdMorgan at 1:24 am (utc) on Oct. 24, 2008]
[edit reason] Please use eaxmple.com only. [/edit]

jdMorgan

2:00 am on Oct 24, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Would it be more efficient to use...

Yes, that's more efficient because it avoids a second system call, but actually, it would be even more efficient to use


RewriteCond ${redirectmap:${lowc:$1}} (.+)
RewriteRule ^/(.*)$ http://www.example.com/%1 [R=301,L]

because ".+" means the same thing as "..*", i.e. "one or more characters." :)

Do check to make sure that your rewritemap expects to be called without the leading slash on the URL-path, and that it returns the mapped new-URL-path value without a leading slash. If not, then you'll need to do a bit more tweaking.

---

If you have index.html declared as your DirectoryIndex file (either here or in the default server config), then this may not work:


# Redirect default homepage file to domain root /
RewriteRule ^/index\.html$ http://www.example.com/ [NC,R=301,L]

because it redirects index.html to "/" and mod_dir will immediately rewrite that back to index.html, which will then get redirected to "/", rewritten to index.html again... lather, rinse, repeat until the server redirection limit is exceeded...

A typical construct to avoid this is:


# Redirect default homepage file to domain root /
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html [NC]
RewriteRule ^/(([^/]+/)*)index\.html$ http://www.example.com/$1 [NC,R=301,L]

This limits redirection to the case where the URL-path is "index.html" because the client directly asked for "index.html" and not because of any internal rewrites, such as that done by mod_dir. This prevents the redirect-rewrite looping.

As shown, it redirects /<any_directory_level>/index.html to /<any_directory_level>/

---

Don't forget to check for an FQDN and/or port number appended to the canonical domain:


# Redirect non-www requests to www
RewriteCond %{HTTP_HOST} ^example\.com [NC,OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com(\.¦\.?:[0-9]+)$ [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

The 'detail' about RewriteRule/RewriteCond processing order and back-reference availability is described in the first part of the mod_rewrite documentation. I suggest a thorough top-to-bottom re-read of that documentation -- before you get into a really long mod_rewrite coding project (I'm still grinning that you said this code is long). :)

Jim

JuddG

3:53 am on Oct 27, 2008 (gmt 0)

10+ Year Member



Thanks for the advice Jim.

".+" = "..*"
- yikes, even I should have spotted that one.

Yep, I wrote the rewritemap without the leading slash to future proof the config for any .htaccess use should overrides ever be needed.

I took some time over the weekend to look into mod_dir & DirectoryIndex. Not having an Apache backround, it took me awhile before I got a grasp on the context. The Apache documentation ([httpd.apache.org ]) for mod_dir is pretty vague.

If I now understand correctly, the key issue here is that mod_dir uses DirectoryIndex to rewrite from e.g. /index.html to / and that this updates REQUEST_URI and restarts processing of the directives - as opposed to rewriting & returning as I do at the end of the code (the 'ah ha' moment came from reading your comments in [webmasterworld.com...] Correct me if I'm wrong here.

With this perspective, your explanation and code suggestion makes a lot of sense. I'm in the process of going through the ~240 URLs I'm trying to preserve to make sure your any-directory-level example won't break anything. If I just need to do this for the homepage, I assume the best thing to use would be:

# Redirect default homepage file to domain root / 
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html [NC]
RewriteRule ^/index\.html$ http://www.example.com/ [NC,R=301,L]

I never would have thought of checking for port number. That's a really good catch, thank you.

Just for the sake of understanding, in your suggested code you use the NC flag for both RewriteConds:

# Redirect non-www requests to www 
RewriteCond %{HTTP_HOST} ^example\.com [NC,OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com(\.¦\.?:[0-9]+)$ [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

I thought HTTP_HOST would always contain a lower case string because the server's understanding of the domain is always lower case. Am I off on this one?

You're right - the RewriteRule/RewriteCond processing order is in the mod_rewrite docs, but I didn't understand that explanation of processing order until I'd read yours. In my opinion, you just explained it with more clarity, especially for people assuming sequential processing.

Fair call - I guess this code isn't *that* long :)

It's long enough for me to write it inefficiently the first time though - thank you all for your help.

g1smd

6:40 am on Oct 27, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For your root-only "index" redirect, that code will work in httpd.conf but will not work in .htaccess.

In .htaccess the RewriteRule cannot "see" the leading "/".

As for the "casing" of HTTP_HOST I think that this covers one basic Internet tenet: "Be liberal in what you accept, and strict in what you send" so it will still work if a user with a badly coded browser drops by.

JuddG

7:09 am on Oct 27, 2008 (gmt 0)

10+ Year Member



Good point about .htaccess - the code was written for httpd.conf, but I'll put in a comment to flag this so no-one copies it to .htaccess (not that I can think of a reason why they would) and breaks something. Good advice.

That makes sense - write robust code to cover all comers. I'll definitely be using Jim's code to check for appended port numbers.

When I've confirmed the directory-ending URL situation, I'll post my final code in case anyone can benefit from it in future.

Thank you once again g1smd and Jim for your help.

JuddG

10:19 pm on Nov 18, 2008 (gmt 0)

10+ Year Member


I've copied the final config below as promised. Sorry about the delay - project got delayed somewhat.

Thanks again to all for your help.

-------
RewriteEngine On

# Text file key/value table for 301 of old locations
RewriteMap redirectmap txt:/usr/local/apache/conf/redirects.txt
# Internal map for lowercase conversion
RewriteMap lowc int:tolower

## 301 physical locations to current URLs
# Test if match found
RewriteCond ${redirectmap:${lowc:$1}} (.+)
# If match found redirect request
RewriteRule ^/(.*)$ http:/[smilestopper]/www.example.com/%1 [R=301,L]

## Redirect arbitrary incorrect case where valid url exists
# Check for upper case
RewriteCond $1 [A-Z]
# Test if match found
RewriteCond ${urlmap:${lowc:$1}} (.+)
# If match found redirect request
RewriteRule ^/(.*)$ http:/[smilestopper]/www.example.com/%1 [R=301,L]

# Redirect default homepage file requests to domain root /
### For .htaccess use, remove the leading / ###
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html [NC]
RewriteRule ^/index\.html$ http:/[smilestopper]/www.example.com/ [NC,R=301,L]

# Redirect non-www requests to www
RewriteCond %{HTTP_HOST} ^example\.com [NC,OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com(\.¦\.?:[0-9]+)$ [NC]
RewriteRule ^/(.*)$ http:/[smilestopper]/www.example.com/$1 [R=301,L]

# Text file key/value table for current-to-old url translation
RewriteMap urlmap txt:/usr/local/apache/conf/urls.txt

## Rewrite current URLs to old locations
# Test if match found
RewriteCond ${urlmap:$1} (.+)
# If match found rewrite request
RewriteRule ^/(.*)$ /%1 [L]
-------

[1][[b]edited by[/b]: JuddG at 10:21 pm (utc) on Nov. 18, 2008][/1]