Forum Moderators: phranque

Message Too Old, No Replies

RedirectPermanent vrs mod_rewrite for page/site redirection

redirecting site while changing some individual filenames

         

Robert Charlton

8:46 am on Jun 4, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



We are simultaneously rebuilding a site and changing the domain name, all the filenames, the file extensions (from htm to html), and the host.

Only about 6 pages of the site are optimized, and I'll be redirecting those pages individually to new urls. For the rest of the site, I'll want to redirect queries to the default index page of the new site. There's no good reason to redirect the non-optimized pages individually.

I'll be using .htaccess. I think I understand the syntax for RedirectPermanent. I'm not at all sure about mod_rewrite.

For what I want to do, is there an advantage to using mod_rewrite over RedirectPermanent to redirect from olddomain.com?

(I will be using mod_rewrite on newdomain.com to force "www".)

For RedirectPermanent in .htaccess in the root of olddomain.com, the code I'd use would be:


RedirectPermanent / http://www.newdomain.com/
RedirectPermanent /oldpage1.htm http://www.newdomain.com/newpage1.html
RedirectPermanent /oldpage2.htm http://www.newdomain.com/newpage2.html
RedirectPermanent /oldpage3.htm http://www.newdomain.com/newpage3.html
- etc

Note that there is no regular relationship between old page names and new page names that would use the pattern matching capabilities of mod_rewrite.

For mod_rewrite, I think the equivalent of the first line above would be:


RewriteEngine on
RewriteCond %{HTTP_HOST} ^(www\.)?olddomain\.com
RewriteRule (.*) http://www.newdomain.com/ [R=301,L]

And for the second line etc would be:

RewriteRule ^oldpage1\.htm$ http://www.newdomain.com/newpage1.html [R=301,L]

I'm a mod_rewrite newbie, not well versed in the syntax. So I don't know how to combine the above RewriteRule lines, what flags go where, what happens with the additional RewriteRule lines in relation to the first RewriteCond, etc.

Whether or not I end up using mod_rewrite here, I'd like to have a much better idea how to do it.

jdMorgan

5:33 pm on Jun 4, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It looks like you're on the right track here.

A given RewriteCond applies to the immediately-following RewriteRule only, unless you use the [C] (chain) flag on the Rule. Chaining is not what you want to do here, though.

As to combining the rules, just list them out as you did with the RedirectPermanents. The [R=301] flag specifies a permanent redirect, and the [L] flag tells mod_rewrite to quit and do the redirect immediately without processing further rules if the pattern matches (and in your case there is no need to process further rules if the pattern matches).

As far as overall structure, you'll probably want to redirect individual pages first, and then do the catch-all domain redirects. In this way, each old page with a direct replacment (even though the pagename may be different) will be redirected to its replacement page. Those pages which have no direct replacement will be redirected to the home page of the new domain. Images, css, and scripts will be redirected to their corresponding replacements on the new domain (You can't redirect an image or script, etc. to an html page, anyway).

I assume from looking at the code you posted that your new domain is hosted in the same 'account' as the old domain, and so the following is prefaced with a rule that prevents any redirects if the requested host is the new domain name. This will prevent an infinite redirection loop. As a result, all of this code must be placed *after* any rules (that you might add later) that apply to the new domain; Otherwise, the new rules would be skipped.


# Quit rewriting if we're already in new domain
RewriteCond %{HTTP_HOST} ^(www\.)?newdomain\.com
RewriteRule .* - [L]
# Redirect specific URLs to new optimized pages
RewriteRule ^oldpage1\.htm$ http://www.newdomain.com/newpage1.html [R=301,L]
RewriteRule ^oldpage2\.htm$ http://www.newdomain.com/newpage2.html [R=301,L]
...
RewriteRule ^oldpage6\.htm$ http://www.newdomain.com/newpage6.html [R=301,L]
# Redirect all other old html pages to new home page
RewriteRule \.html$ http://www.newdomain.com/ [R=301,L]
# Redirect non-html resources to same filename on new domain
RewriteRule (.*) http://www.newdomain.com/$1 [R=301,L]

The Redirect family of directives is useful only if
  • The domains are separately hosted, or
  • The new and old pages have unique names which do not exist on the other domain.
    otherwise, you can end up in an infinite loop of redirects.

    Jim

  • Robert Charlton

    2:28 am on Jun 5, 2004 (gmt 0)

    WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



    Jim - Thanks. Good to know I'm on the right track. I was feeling great until I got to this sentence...

    I assume from looking at the code you posted that your new domain is hosted in the same 'account' as the old domain, and so the following is prefaced with a rule that prevents any redirects if the requested host is the new domain name.

    I'm not sure what it was about the code I posted made you think this. In this case, olddomain.com and newdomain.com will be on two separate accounts with unique IPs. What did I get wrong? ;)

    Taking a crack at it again, since you say...

    A given RewriteCond applies to the immediately-following RewriteRule only...
    and...
    ...you'll probably want to redirect individual pages first, and then do the catch-all domain redirects

    ...would the code work in the following form?

    # Redirect specific URLs to new optimized pages 
    RewriteRule ^oldpage1\.htm$ http://www.newdomain.com/newpage1.html [R=301,L]
    RewriteRule ^oldpage2\.htm$ http://www.newdomain.com/newpage2.html [R=301,L]
    ...
    RewriteRule ^oldpage6\.htm$ http://www.newdomain.com/newpage6.html [R=301,L]
    # Redirect all other old html pages to new home page
    RewriteCond %{HTTP_HOST} ^(www\.)?olddomain\.com
    RewriteRule \.htm$ http://www.newdomain.com/ [R=301,L]

    ...or, do I need the RewriteCond line as the first line anyway?

    Note that I've made the last rewrite rule \.htm rather than \.html, since the old site's pages have htm, not html extensions.

    Also, I have dropped the rule for "redirecting non-html resources to same filename on new domain." Since we are really building a new site on a new domain from the ground up, I think that redirecting non-html resources has the potential of creating conflicts for same-name files, should any pop up. Is there a way, though, to get the users who be following a link, say, to a PDF, to get them to the home page too, but not to the same filename?

    It looks like mod_rewrite does make the most sense. You've converted me. Several follow-up comments and questions...

    1) Options +FollowSymLinks:

    I should mention, for others who might be looking at this, that I have not included an Options +FollowSymLinks line at the beginning of all this because SymLinks is already enabled on the server I'm using. Options +FollowSymLinks has been tricky enough for a server-challenged person like me that, when I get the chance, I'll start a separate thread for feedback, just to have as a resource on the forum.

    2) Copying code from the above?

    When I copy the posted code and paste it into an ASCII file, I pick up a space between the end of the [R=301,L] flag and the carriage return. Just to be absolutely certain, does this space matter? I know it does on some code. Should I drop it out?

    3) Testing the redirects on the IP account?

    The site that is olddomain.com is currently being moved from a very lame hosting company to a new IP account. I'll being doing the page redirects to newdomain.com from the new IP for olddomain.com after the DNS redirect, once olddomain.com is live on the new account. Hope this makes sense.

    It occurs to me that before we point the olddomain.com DNS to this IP, I will have the IP account (which is blocked by robots.txt) available for testing the page redirects if I can substitute the IP address for olddomain.com into the urls I use for testing. Eg...

    http*//55.55..55.55 for http*//www.olddomain.com

    and

    http*//55.55..55.55/oldpage2.htm for http*//www.olddomain.com/oldpage2.htm

    Leaving ":" out of the above. For some reason, the board doesn't like them.

    Any reason why this wouldn't work to test the mod_rewrite?

    4) What about email?

    There is a separate mailserver, handled by someone else. Will any of the above code create problems for him? The email question might be another thread entirely. I'll be talking to the email person next week, but, prior to that, are there any threads on the forum that cover what happens to email when you redirect? I couldn't find any.

    Thanks.

    gergoe

    11:47 am on Jun 5, 2004 (gmt 0)

    10+ Year Member



    I'm not sure what it was about the code I posted made you think this. In this case, olddomain.com and newdomain.com will be on two separate accounts with unique IPs. What did I get wrong? ;)

    The fact that you used a RewriteCond to catch only requests for the old domain, which you only need if the old and the new domains are on the same account (same Apache server).

    ...would the code work in the following form?

    # Redirect specific URLs to new optimized pages
    RewriteRule ^oldpage1\.htm$ http://www.newdomain.com/newpage1.html [R=301,L]
    RewriteRule ^oldpage2\.htm$ http://www.newdomain.com/newpage2.html [R=301,L]
    ...
    RewriteRule ^oldpage6\.htm$ http://www.newdomain.com/newpage6.html [R=301,L]
    # Redirect all other old html pages to new home page
    RewriteCond %{HTTP_HOST} ^(www\.)?olddomain\.com
    RewriteRule \.htm$ http://www.newdomain.com/ [R=301,L]

    ...or, do I need the RewriteCond line as the first line anyway?

    Certainly; the reason why the RewriteCond was the first line is the assumption that both domains are on the same account, so you don't really need it in your case. Actually if the domains aren't on the same server then you can leave out the RewriteCond completely.
    If you want to be on the same side with the redirection of the html files, then use the \.html? pattern, which match html and html also.

    Also, I have dropped the rule for "redirecting non-html resources to same filename on new domain." Since we are really building a new site on a new domain from the ground up, I think that redirecting non-html resources has the potential of creating conflicts for same-name files, should any pop up. Is there a way, though, to get the users who be following a link, say, to a PDF, to get them to the home page too, but not to the same filename?

    You can redirect any resource to a new filename, but in most of the cases it does not have much use, better to send a 404 back for those files. For pdfs, and resources which aren't inline elements (so it is not the element of a html page) can be redirected to the index page. From this approach if you have a scanned paper which you placed on the server and it is displayed in a separate browser window can be redirected to the index page the browser will display the html page. In general it is better to send a 404, and if you still want to display some message in the browser then use a custom 404 page, which you can do with the ErrorDocument 404 directive.

    The insignificant white-spaces does not have any affect on mod_rewrite. It means that if a space is missing, then it will not work, but if you have 10 white-spaces instead of only one then it will be working as if it were only one.

    If you use ip instead of domain name, it might not work. If the Apache on the new server uses name based virtual hosting (which is very likely), then you can't use ip address instead of the domain name (...because the Apache decides which VirtualHost to use for the request based on the domain name you used for the request, so if you use ip address the apache will use the default VirtualHost). The best you can do is that you give it a try, open your browser type http;//ipaddress and see what happens.

    The email is a different thing from all this, that's depending on two things (MX record in the domain, the mail server and of course the ip addresses), which has nothing to do with mod_rewrite and not with Apache. I guess the only change in the old domain will be that they will change the ip address of the web pointers (www.olddomain.com olddomain.com), which should not affect the mail services (unless the company hosting the domain is seriously lame)

    Robert Charlton

    12:53 am on Jun 6, 2004 (gmt 0)

    WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



    Thanks... My grasp of regex and Apache syntax is iffy, so please bear with me on a few more questions.

    Re leaving out the RewriteCond...

    It makes sense that since this will be working in an .htaccess that's on the root of olddomain.com, it would be assumed that file references without domain names are for local files, and that specifying olddomain.com would be superfluous.

    The sole line for redirecting a request for www.olddomain.com to www.newdomain.com, then, is:

    RewriteRule \.htm$ http://www.newdomain.com/ [R=301,L]

    Question... how does the syntax work so that a request for the domain name, without a file specified, works in this line? Is the default file (index.htm) appended to the domain somewhere else on the server before it gets to .htaccess access? I assume it is understood somewhere along the line. The \.htm$ in the line is what's throwing me.

    the last RewriteRule line...

    use the \.html? pattern, which match html and html also

    I'm not sure I understand the ? quantifier syntax here. ?, as I've been able to find, is a quantifier for an optional match (0 or 1 times).

    Would this then be "htm?" to match both "htm" and "html"? (I can see where this wild-card syntax would be a problem, so I'm guessing probably not, but I'm asking to be sure).

    Would the final rule be?...

    # Redirect all other old html pages and the domain to new home page
    a) RewriteRule \.html?$ http://www.newdomain.com/ [R=301,L]
    - or:
    b) RewriteRule \.htm?$ http://www.newdomain.com/ [R=301,L]

    testing with the IP#...

    Regarding the use of the IP rather than the domain name, I can access the dummy index page in the account right now via the IP number in my browser. From what you are saying, do I understand correctly that mod_rewrite should behave the same, and that the above code in .htaccess should work, pre DNS redirection, either for the IP# or domain name account?

    jdMorgan

    5:21 am on Jun 6, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Question... how does the syntax work so that a request for the domain name, without a file specified, works in this line? Is the default file (index.htm) appended to the domain somewhere else on the server before it gets to .htaccess access? I assume it is understood somewhere along the line. The \.htm$ in the line is what's throwing me.

    The assumption here is that we are redirecting only file types that end in ".htm" or ".html" with that RewriteRule. "index.html" is *not* appended to the URL as seen by RewriteRule or RewriteCond %{REQUEST_URI}. So that rule, as written, would have no effect on a request for "http://oldomain.com/".

    Since that rule was originally intended as a "catch-all" at the end, you'll probably want to add yet another one to handle non-html-type files by redirecting to the same filename on the new server:


    # Redirect specific URLs to new optimized pages
    RewriteRule ^oldpage1\.htm$ http://www.newdomain.com/newpage1.html [R=301,L]
    RewriteRule ^oldpage2\.htm$ http://www.newdomain.com/newpage2.html [R=301,L]
    ...
    RewriteRule ^oldpage6\.htm$ http://www.newdomain.com/newpage6.html [R=301,L]
    # Redirect all other old htm pages to new home page
    RewriteRule \.htm$ http://www.newdomain.com/ [R=301,L]
    # Catch-all: Redirect remaining images, scripts, css, etc. to same-named files in new domain
    RewriteRule (.*) http://www.newdomain.com/$1 [R=301,L]

    Again, the reason for this is that page elements which are not themselves pages cannot be redirected to pages;. You cannot redirect from a jpg image to an html page, for example; the browser cannot handle this. Basically, objects included in a page using a <src> attribute cannot be redirected to a page-type object.

    The good news is that in most cases, this rule won't be invoked. If you go into the old site now and make sure that cache expiry times for all filetypes are set to 24 hours or less, and that must-revalidate appears in your cache-control headers, then you'll only need to keep that rule around for a day after making the domain changeover. After that, browsers will request new pages, see new <src> links, and get the new on-page objects from the new domain.

    I'm not sure I understand the? quantifier syntax here.?, as I've been able to find, is a quantifier for an optional match (0 or 1 times).

    Would this then be "htm?" to match both "htm" and "html"? (I can see where this wild-card syntax would be a problem, so I'm guessing probably not, but I'm asking to be sure).


    The "?" makes the immediately-preceding character (or parenthesized group of characters) optional, so "\.html?" would be correct. The question mark, asterisk, or plus-sign is not a "stand-in" for the character as in MS-DOS type wildcarding, it is a regular-expressions operator which affects the preceding character. Your "\.htm?" pattern would therefore match either ".htm" or ".ht".

    testing with the IP#...

    Regarding the use of the IP rather than the domain name, I can access the dummy index page in the account right now via the IP number in my browser. From what you are saying, do I understand correctly that mod_rewrite should behave the same, and that the above code in .htaccess should work, pre DNS redirection, either for the IP# or domain name account?

    Most members here (and most webmasters in general) have shared hosting plans where the IP address is not unique to a single domain. That was the thrust of gergoe's comment. The code will work whenever the server with the code on it is accessed; How it is accessed--by domain name or IP address--makes no difference unless you use RewriteCond to test %{HTTP_HOST}. In that case, you'd have to test for both the IP address and hostname, and accept either. Since we've decided that the new domain is separately-hosted, that is not an issue here.

    Jim

    Robert Charlton

    7:00 am on Jun 21, 2004 (gmt 0)

    WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



    Jim and gergoe,

    Thanks. Sorry to have disappeared for a while, but I've been immersed in another part of the same project and staying up much too late as it is coping with some of the learning curve.

    The good news is that I've run a bunch of tests with some dummy pages, and everything works like a charm. The bad news is that I have several questions about some of the variations I tried. ;) I'm asking some of these mainly out of curiosity and caution, as what I have seems to work.

    Is there a possible conflict if I use redundant RewriteRules? Eg, I note that including the old index page among the urls to be rewritten is redundant, and in fact both of the following seem to direct the index page. Any problem in having both? Anything to be gained?

    RewriteRule ^index\.htm$ http://www.newdomain.com/ [R=301,L]

    # Redirect all other old htm pages to new home page
    RewriteRule \.htm$ http://www.newdomain.com/ [R=301,L]

    I'm also wondering about the catch-all. In the particular site I'm redirecting, the new site with its page templates, etc, is being rebuilt from the ground up... and while some pdf files will carry over from one site to another, none of the images, scripts, css, etc, that the catch-all might redirect will be the same.

    What I'm concerned about is that the catch-all could accidentally pick up a generic name like spacer.gif and redirect it, but it would be a different file. Is this a valid concern? I don't quite understand how the caching mechanism would call up a file that's a component of the page, so this might not really be a concern.

    I'm also trying to understand what redirects the domain itself... not necessarily index.htm... if that's a question that makes sense.

    I ran a test in redirecting the domain name, and found that if I kept the Catch-all as below, without $1, I could drop the redirect of index.htm and of "all other old htm pages," and the domain would still redirect. Any point in keeping the catch-whatever in this form?...

    RewriteRule (.*) http://www.newdomain.com/ [R=301,L]

    What actually is going on when I drop $1? Would dropping $1 eliminate the hypothetical problem of, eg, spacer.gif?

    jdMorgan

    4:51 am on Jun 22, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    The catch-all shown above only redirects .html - extension files, so .pdf and .gif are unaffected.

    You can set this stuff up any way you like. Define what you need to do and implement it. Once you're used to it, writing mod_rewrite code is easy; it's defining the problem or goal that's difficult, and remains difficult for big sites.

    I recommend creating a technology-independent "map" of all old URLs and their disposition, either to a corresponding location on the new server, to a replacement page, to a site map, index page, etc., or to the dustbin. The more the architecture of the new site resembles the old, the easier this is, as you can often redirect entire subdirectories en-masse. But it's good to have a plan. After that plan is reviewed, then it can be implemented using whatever technology is needed and available, be it mod_rewrite, ISAPI rewrite, or a script-based approach.

    In a RewriteRule, the $1 is a back-reference that "copies" the contents of the first parenthesized matched sub-pattern in the request to the substitution URL.

    So, if you have a rule
    RewriteRule (.*)\.html$ /$1.shtml [L],
    this takes a request for any URL ending in ".html" and serves a file of the same name, but ending in ".shtml".

    And if you have a rule
    RewriteRule .*\.html$ /foo.shtml [L]
    this takes requests for any URL ending in ".html" and serves the file "foo.html", disregarding the originally-requested filename.

    Domains are not rewritten or redirected, URLs (whether canonical or partial) are.

    If you use the [L] flags on your rules, there can be no conflict; If a rule ending with an [L] flag matches and is invoked, then no further rules are processed for the current HTTP request.

    Jim