Forum Moderators: phranque
I've come in to help with a slightly involved project where we're trying to both implement a CMS & clean up our URLs for search engines at the same time.
There are about 240 existing unique URLs (all lower case, no pattern). The CMS generates a new unique URL for each existing page (again, no pattern). My mod_rewrite code needs to:
Based on my research & some of Jim's code here on the forums, I've generated the config below (which does what I'm after). However, it's rather long and I think it can be optimised (or at least the order of directives enhanced). I don't have a strong background in Apache though and so I'm not sure what to look for in this regard. Can anyone assist?
Note some names e.g. of the domain have been changed to protect the innocent :)
###
RewriteEngine On
# Redirect non-www requests to www
RewriteCond %{HTTP_HOST} ^mydomain\.com\.au
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
# Redirect default homepage file to domain root /
RewriteRule ^/index\.html$ / [NC,R=301,L]
# Text file key/value table for 301 of cms locations
RewriteMap redirectmap txt:/usr/local/apache/conf/redirects.txt
# Internal map for lowercase conversion
RewriteMap lowc int:tolower
## 301 physical locations to current URLs
# Put request into backreference
RewriteCond %{REQUEST_URI} ^/(.*)$
# Test if match found
RewriteCond ${redirectmap:${lowc:%1}} !^$
# If match found redirect request
RewriteRule ^/(.*)$ /${redirectmap:${lowc:$1}} [R=301,L]
# Text file key/value table for current-to-cms url translation
RewriteMap urlmap txt:/usr/local/apache/conf/urls.txt
## Redirect arbitrary incorrect case where valid url exists
# Check for upper case
RewriteCond $1 [A-Z]
# Put request into backreference
RewriteCond %{REQUEST_URI} ^/(.*)$
# Test if match found
RewriteCond ${urlmap:${lowc:%1}} !^$
# If match found rewrite request
RewriteRule ^/(.*)$ /${lowc:$1} [R=301,L]
## Rewrite current URLs to cms locations
# Put request into backreference
RewriteCond %{REQUEST_URI} ^/(.*)$
# Test if match found
RewriteCond ${urlmap:%1} !^$
# If match found rewrite request
RewriteRule ^/(.*)$ /${urlmap:$1}
###
[edited by: JuddG at 7:31 am (utc) on Oct. 23, 2008]
[edited by: jdMorgan at 1:20 am (utc) on Oct. 24, 2008]
[edit reason] example.com [/edit]
The index rule you have has a leading / ... so that rule can never work in .htaccess. The URL seen by .htaccess RewriteRule is stripped of preceding path information, up to and including the preceding slash. I assume this code is going into the httpd.conf file instead.
The index rule is usually placed to be the one immediately before the non-www to www rule.
Place all of the rewrite code after all of the redirects, otherwise you will expose the internal filepaths back out into the URL seen by the browser.
Your redirects (the things with [R=301] in them) should include the target domain name in them too, so that the domain is fixed at the same time as whatever else the rule is supposed to do. This avoids the redirection chain that I mentioned above.
Give that a go, repost the new code and I'll take another look.
I understand what you mean about the target domain fixing the non-www problem & thus saving a redirect. I've updated the redirects now to add this in & reordered them as you describe. This makes sense in terms of fixing two potential problems with one redirect.
The amended code is below. Thanks again for taking the time to help, much appreciated.
###
RewriteEngine On
# Text file key/value table for 301 of cms locations
RewriteMap redirectmap txt:/usr/local/apache/conf/redirects.txt
# Internal map for lowercase conversion
RewriteMap lowc int:tolower
## 301 physical locations to current URLs
# Put request into backreference
RewriteCond %{REQUEST_URI} ^/(.*)$
# Test if match found
RewriteCond ${redirectmap:${lowc:%1}} !^$
# If match found redirect request
RewriteRule ^/(.*)$ http://www.example.com/${redirectmap:${lowc:$1}} [R=301,L]
# Text file key/value table for current-to-cms url translation
RewriteMap urlmap txt:/usr/local/apache/conf/urls.txt
## Redirect arbitrary incorrect case where valid url exists
# Check for upper case
RewriteCond $1 [A-Z]
# Put request into backreference
RewriteCond %{REQUEST_URI} ^/(.*)$
# Test if match found
RewriteCond ${urlmap:${lowc:%1}} !^$
# If match found redirect request
RewriteRule ^/(.*)$ http://www.example.com/${lowc:$1} [R=301,L]
# Redirect default homepage file to domain root /
RewriteRule ^/index\.html$ http://www.example.com/ [NC,R=301,L]
# Redirect non-www requests to www
RewriteCond %{HTTP_HOST} example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
## Rewrite current URLs to cms locations
# Put request into backreference
RewriteCond %{REQUEST_URI} ^/(.*)$
# Test if match found
RewriteCond ${urlmap:%1} !^$
# If match found rewrite request
RewriteRule ^/(.*)$ /${urlmap:$1}
###
[edited by: jdMorgan at 1:22 am (utc) on Oct. 24, 2008]
[edit reason] example.com [/edit]
Just a couple of quick minor points while I'm here: First, you can replace "!^$" with ".", as the two are logically equivalent: A non-blank string by definition will have at least one of any character in it, and "." is two characters shorter and doesn't require the logical-negation operation.
Second, be aware that the pattern of a RewriteRule is evaluated (and back-references stored) *before* any of its associated RewriteConds are evaluated. Therefore, back-references to parenthesized RewriteRule sub-patterns are available for RewriteCond evaluation, and back-references to parenthesized RewriteCond sub-patterns are available in the substitution field of RewriteRule. Therefore, you may eliminate the first RewriteCond in your currently-last rule above, and translate $1 with your urlmap call, rather than %1. The same optimization seems to be applicable to two other rules as well, where you're checking REQUEST_URI with a RewriteCond only to create a back-reference.
For the same reason (RewriteRule/RewriteCond pattern-evaluation order), the patterns in RewriteRules should be as specific as possible in order to avoid unnecessary RewriteCond parsing and execution. And while on this subject, RewriteConds that invoke DNS, filesystem checks, or OS shell functions should always go last whenever possible, so that they are only executed if all other conditions are met. And these "external call" functions should be avoided entirely if another "mod_rewrite-internal" method can be used. An example would be to exclude an explicit list of URL-paths from a RewriteRule rather than excluding URL-paths that resolve to existing files, as the first method relies on the "internally-available" requested URL-path-matching using RewriteRule/RewriteCond patterns, while the latter relies on an external call to the filesystem.
Jim
Sorry, I forgot to clarify earlier - this code is in httpd.conf.
Jim, good point about the logical equivalency of "!^$" and "." - I've fixed that.
Your description of how RewriteRule & RewriteCond work together is an eyeopener - it would have been great to have read that a couple days ago when I first started to research mod_rewrite. I guess I have a subconscious assumption that all code executes from top-to-bottom and didn't really understand why the "RewriteCond $1 [A-Z]" line worked for me.
I removed all the RewriteCond lines where I was just creating a back-reference & updated the RewriteMap calls accordingly.
I also had a think about what you said about only executing required calls. This actually gave me an idea - the code above that redirects cms URLs to the correct URLs is now currently written (after the amends from your suggestions above) as:
RewriteCond ${redirectmap:${lowc:$1}} .
RewriteRule ^/(.*)$ http://www.example.com/${redirectmap:${lowc:$1}} [R=301,L]
Would it be more efficient to use:
RewriteCond ${redirectmap:${lowc:$1}} (..*)
RewriteRule ^/(.*)$ http://www.example.com/%1 [R=301,L]
It seems to me that this saves time by eliminating the extra RewriteMap calls.
I've rewritten a few calls in this fashion - would I be correct in thinking this should speed things up? My fully amended code is below. Thanks again for your help Jim.
###
RewriteEngine On
# Text file key/value table for 301 of cms locations
RewriteMap redirectmap txt:/usr/local/apache/conf/redirects.txt
# Internal map for lowercase conversion
RewriteMap lowc int:tolower
## 301 physical locations to current URLs
# Test if match found
RewriteCond ${redirectmap:${lowc:$1}} (..*)
# If match found redirect request
RewriteRule ^/(.*)$ http://www.example.com/%1 [R=301,L]
## Redirect arbitrary incorrect case where valid url exists
# Check for upper case
RewriteCond $1 [A-Z]
# Test if match found
RewriteCond ${urlmap:${lowc:$1}} (..*)
# If match found redirect request
RewriteRule ^/(.*)$ http://www.example.com/%1 [R=301,L]
# Redirect default homepage file to domain root /
RewriteRule ^/index\.html$ http://www.example.com/ [NC,R=301,L]
# Redirect non-www requests to www
RewriteCond %{HTTP_HOST} ^example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
# Text file key/value table for current-to-cms url translation
RewriteMap urlmap txt:/usr/local/apache/conf/urls.txt
## Rewrite current URLs to cms locations
# Test if match found
RewriteCond ${urlmap:$1} (..*)
# If match found rewrite request
RewriteRule ^/(.*)$ /%1 [L]
###
[edited by: jdMorgan at 1:24 am (utc) on Oct. 24, 2008]
[edit reason] Please use eaxmple.com only. [/edit]
Yes, that's more efficient because it avoids a second system call, but actually, it would be even more efficient to use
RewriteCond ${redirectmap:${lowc:$1}} (.+)
RewriteRule ^/(.*)$ http://www.example.com/%1 [R=301,L]
Do check to make sure that your rewritemap expects to be called without the leading slash on the URL-path, and that it returns the mapped new-URL-path value without a leading slash. If not, then you'll need to do a bit more tweaking.
---
If you have index.html declared as your DirectoryIndex file (either here or in the default server config), then this may not work:
# Redirect default homepage file to domain root /
RewriteRule ^/index\.html$ http://www.example.com/ [NC,R=301,L]
A typical construct to avoid this is:
# Redirect default homepage file to domain root /
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html [NC]
RewriteRule ^/(([^/]+/)*)index\.html$ http://www.example.com/$1 [NC,R=301,L]
As shown, it redirects /<any_directory_level>/index.html to /<any_directory_level>/
---
Don't forget to check for an FQDN and/or port number appended to the canonical domain:
# Redirect non-www requests to www
RewriteCond %{HTTP_HOST} ^example\.com [NC,OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com(\.¦\.?:[0-9]+)$ [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
Jim
".+" = "..*"- yikes, even I should have spotted that one.
Yep, I wrote the rewritemap without the leading slash to future proof the config for any .htaccess use should overrides ever be needed.
I took some time over the weekend to look into mod_dir & DirectoryIndex. Not having an Apache backround, it took me awhile before I got a grasp on the context. The Apache documentation ([httpd.apache.org ]) for mod_dir is pretty vague.
If I now understand correctly, the key issue here is that mod_dir uses DirectoryIndex to rewrite from e.g. /index.html to / and that this updates REQUEST_URI and restarts processing of the directives - as opposed to rewriting & returning as I do at the end of the code (the 'ah ha' moment came from reading your comments in [webmasterworld.com...] Correct me if I'm wrong here.
With this perspective, your explanation and code suggestion makes a lot of sense. I'm in the process of going through the ~240 URLs I'm trying to preserve to make sure your any-directory-level example won't break anything. If I just need to do this for the homepage, I assume the best thing to use would be:
# Redirect default homepage file to domain root /
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html [NC]
RewriteRule ^/index\.html$ http://www.example.com/ [NC,R=301,L]
I never would have thought of checking for port number. That's a really good catch, thank you.
Just for the sake of understanding, in your suggested code you use the NC flag for both RewriteConds:
# Redirect non-www requests to www
RewriteCond %{HTTP_HOST} ^example\.com [NC,OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com(\.¦\.?:[0-9]+)$ [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
I thought HTTP_HOST would always contain a lower case string because the server's understanding of the domain is always lower case. Am I off on this one?
You're right - the RewriteRule/RewriteCond processing order is in the mod_rewrite docs, but I didn't understand that explanation of processing order until I'd read yours. In my opinion, you just explained it with more clarity, especially for people assuming sequential processing.
Fair call - I guess this code isn't *that* long :)
It's long enough for me to write it inefficiently the first time though - thank you all for your help.
In .htaccess the RewriteRule cannot "see" the leading "/".
As for the "casing" of HTTP_HOST I think that this covers one basic Internet tenet: "Be liberal in what you accept, and strict in what you send" so it will still work if a user with a badly coded browser drops by.
That makes sense - write robust code to cover all comers. I'll definitely be using Jim's code to check for appended port numbers.
When I've confirmed the directory-ending URL situation, I'll post my final code in case anyone can benefit from it in future.
Thank you once again g1smd and Jim for your help.
Thanks again to all for your help.
-------
RewriteEngine On
# Text file key/value table for 301 of old locations
RewriteMap redirectmap txt:/usr/local/apache/conf/redirects.txt
# Internal map for lowercase conversion
RewriteMap lowc int:tolower
## 301 physical locations to current URLs
# Test if match found
RewriteCond ${redirectmap:${lowc:$1}} (.+)
# If match found redirect request
RewriteRule ^/(.*)$ http:/[smilestopper]/www.example.com/%1 [R=301,L]
## Redirect arbitrary incorrect case where valid url exists
# Check for upper case
RewriteCond $1 [A-Z]
# Test if match found
RewriteCond ${urlmap:${lowc:$1}} (.+)
# If match found redirect request
RewriteRule ^/(.*)$ http:/[smilestopper]/www.example.com/%1 [R=301,L]
# Redirect default homepage file requests to domain root /
### For .htaccess use, remove the leading / ###
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html [NC]
RewriteRule ^/index\.html$ http:/[smilestopper]/www.example.com/ [NC,R=301,L]
# Redirect non-www requests to www
RewriteCond %{HTTP_HOST} ^example\.com [NC,OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com(\.¦\.?:[0-9]+)$ [NC]
RewriteRule ^/(.*)$ http:/[smilestopper]/www.example.com/$1 [R=301,L]
# Text file key/value table for current-to-old url translation
RewriteMap urlmap txt:/usr/local/apache/conf/urls.txt
## Rewrite current URLs to old locations
# Test if match found
RewriteCond ${urlmap:$1} (.+)
# If match found rewrite request
RewriteRule ^/(.*)$ /%1 [L]
-------
[1][[b]edited by[/b]: JuddG at 10:21 pm (utc) on Nov. 18, 2008][/1]