homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

A guide to fixing duplicate content & URL issues on Apache
How to canonicalize all of your URLs with a single redirect

 8:53 pm on Jan 4, 2007 (gmt 0)

Recently, we've had a lot of discussion about domain and URL canonicalization, mainly centered around avoiding duplicate-content problems in Google. There has also been some discussion of fixing type-in URLs, typos in inbound links, and badly-coded inbound links.

To be clear, a "canonical" domain is the single domain you want your site to be known by, and a canonical URL is the single URL you want your page to be known by. Any others are non-canonical.

The word canonical is a religion-related term, and means "according to canon law, scripture or doctrine." But in general use, it just means "usual, standard, conventional, customary, or according to the rules." So as a Webmaster, you choose what single domain you want to use for your site, and what single URL should be used to request each of your pages.

Member g1smd has posted in several of these threads the very good advice that it's best to avoid "stacked redirects" --multiple redirects invoked by a single client request-- while doing things like index page and domain canonicalization. This was reiterated recently by WebmasterWorld admin tedster in this recent thread [webmasterworld.com].

I have coded various routines to do these kinds of fix-ups on an ad-hoc basis, but have never actually written a single-redirect-does-it-all solution. Actually, that's not quite true -- I had *tried* before, but a nasty mod_rewrite bug in Apache 1.3.x [archive.apache.org] had repeatedly stymied my efforts.

However, after returning to the subject after almost a year, and having spent that year experimenting and dashing off code in the WebmasterWorld Apache forum, one trick I had figured out is a work-around for the bug.

So I set out anew to create a domain/URL canonicalizaton and type-in fixup routine that would do the following:

  • Canonicalize the domain (e.g. redirect non-www and IP address to www)
  • Canonicalize my index pages (redirect "/index.html" to "/")
  • Remove multiple slashes in the URL
  • Remove spurious query strings (my sites' pages are mostly 'static' with a few exceptions)
  • Fix-up common typos in type-in URLs
  • Fix-up invalid inbound links caused by bad HTML mark-up
  • Fix-up URLs resulting from bad copy-and-pastes
  • Fix-up outdated or otherwise incorrect query strings
  • Suppress the fix-up redirect if the resulting URL does not resolve to an existing file
  • Suppress the fix-up if the link is on my own site (In this case, I want to see the 404 error)
  • Suppress the fix-up if the remote user is me or a site tester (Again, we want to see the 404 error)
  • Avoid recursion in mod_rewrite running in a per-directory .htaccess context
  • Avoid the nasty mod_rewrite bug in Apache 1.3.x
  • Do all of the above using a single 301-Moved Permanently redirect

    The result is a routine that can "correct" a request from a badly-coded link like:
    <a href="http://example.com/index,hmtl>for more info, click here</a>

    where the closing quote has been omitted on the link, "html" was mis-typed, and a comma was typed where the filetype-separator period should be.

    The result of a click on that link is a request for "http://example.com/index,hmtl%3Efor%20more%20info,%20click%20here%3C/a%3E

    The code will redirect that to the canonical domain and index page URL "www.example.com/" using a single redirect, correcting the comma and "hmtl", and stripping off the spurious path info along the way. Or it can fix up multiple slashes or periods, or remove trailing punctuation from links improperly embedded in text, or automatically-linked in forum posts, e.g. "For help with this code, see [httpd.apache.org...]

    This is done with a 301-Moved Permanently redirect, so that search engines are notified not to list or use the incorrect URL, but to replace it with the corrected/canonicalized one.

    This code is intended for the most common Apache hosting set-ups: shared virtual hosting on Apache 1.3.x, with configuration options limited to .htaccess files only.

    Update: After extensive testing on a server kindly provided by coopster, I've discovered that Apache 2.0.52 has the same bug as Apache 1.3.x. Although the original bug report was closed with a statement that this bug was fixed in Apache 2.0.30, it was apparently not fixed completely. Therefore, the solution presented here applies to Apache 2.0 as well.

    This routine is the right solution for my sites, which follow *my* strict URL conventions, but likely not for yours. Modification will almost certainly be required. Like almost all mod_rewrite code, this is not a simple cut-and-paste or find-and-replace proposition.

    It does fix-ups on only the most common URL errors I have seen in my logs, but of course, there are many others; The code is not meant to exhaustively cover all possible errors, just the most common ones on my sites.

    As such, I put it forth as an example, to be examined and perhaps modified by Webmasters who are conversant and comfortable with mod_rewrite and regular expressions. Again, this code is not an entry-level exercise, and the most likely result of trying to modify it without thoroughly understanding and testing it is a disaster -- the best of which would be an immediate server crash, and the worst of which might be to thoroughly trash your search engine rankings.

    Webmasters without extensive mod_rewrite experience might be better off ignoring this post, or spending a lot of time studying the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com], until the code's operation becomes clear.

    I cannot offer installation and modification "support" for this code; It's rather complex, and attempting to do so would very likely exceed the limited time I have to post here. However, I would like to discuss improvements to efficiency --given the stated design goals-- or other common inbound URL errors you've seen for which a fix-up might be added.

    The warranty expired yesterday, so use at your own risk. :)

    Be sure to replace all broken pipe "¦" characters in the code below with solid pipes before use; Posting on this forum modifies the pipe characters.

    [edited by: jdMorgan at 8:40 pm (utc) on Jan. 6, 2007]



     8:53 pm on Jan 4, 2007 (gmt 0)

    So, all that said, here's what I came up with:

    # .htaccess
    # Specify IP address(es) used by Webmaster, admins, & testers. These may access
    # the server by its unique IP address without being redirected to the domain.
    # Also, URLs are *not* corrected for access by this group, in order to prevent
    # this code from "hiding" problems during development.
    # Note that these addresses are those of your workstations, not your server.)
    SetEnvIf Remote_Addr ^192\.168\.1\. TestIP=true
    SetEnvif Remote_Addr ^10\.10\.45\.3$ TestIP=true
    SetEnvIf Remote_Addr ^127\.0\.0\.[1-7]$ TestIP=true
    # Setup: Enable mod_rewrite, disable MultiViews
    Options +FollowSymLinks -MultiViews
    RewriteEngine on
    # Redirect non-problematic URLs
    # Note: The fix-up code below is complex, and is intended for use to fix only
    # generally-specified problematic URL requests. For administrative redirection
    # of specific non-problematic URLs, 'normal' redirects should be placed here.
    RewriteRule ^old_page\.html$ http://www.example.com/new_page.html [R=301,L]
    RewriteRule ^old_page2\.htm$ http://www.example.com/new_page2.htm [R=301,L]
    # This code corrects various problems with URLs, presumably due to typos in
    # links from other sites. It is complicated by measures taken to avoid a
    # mod_rewrite bug in Apache 1.3. ( See http://archive.apache.org/gnats/7879 )
    # This code uses a single external redirect to correct all detected problems.
    # Skip next two rules if lowercasing in progress
    # (Remove this rule if case-conversion plug-in below is removed)
    RewriteCond %{ENV:qLow} ^yes$ [NC]
    RewriteRule . - [S=2]
    # Prevent recursion and over-writing of myURI and myQS
    RewriteCond %{ENV:qRed} ^yes$ [NC]
    RewriteRule .? - [L]
    # Get the client-requested full URI and full query string
    RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ (/[^?]*)(\?[^\ ]*)?\ HTTP/
    RewriteRule .? - [E=myURI:%1,E=myQS:%2]
    # Uppercase to lowercase conversion plug-in
    # (This section, along with the first noted rule
    # above, may be removed if not needed or wanted)
    # Skip next 28 rules if no uppercase letters in URL
    RewriteCond %{ENV:myURI} ![A-Z]
    RewriteRule .? - [S=28]
    # Else swap them out, one at a time
    RewriteCond %{ENV:myURI} ^([^A]*)A(.*)$
    RewriteRule . - [E=myURI:%1a%2]
    RewriteCond %{ENV:myURI} ^([^B]*)B(.*)$
    RewriteRule . - [E=myURI:%1b%2]
    RewriteCond %{ENV:myURI} ^([^C]*)C(.*)$
    RewriteRule . - [E=myURI:%1c%2]
    RewriteCond %{ENV:myURI} ^([^D]*)D(.*)$
    RewriteRule . - [E=myURI:%1d%2]
    RewriteCond %{ENV:myURI} ^([^E]*)E(.*)$
    RewriteRule . - [E=myURI:%1e%2]
    RewriteCond %{ENV:myURI} ^([^F]*)F(.*)$
    RewriteRule . - [E=myURI:%1f%2]
    RewriteCond %{ENV:myURI} ^([^G]*)G(.*)$
    RewriteRule . - [E=myURI:%1g%2]
    RewriteCond %{ENV:myURI} ^([^H]*)H(.*)$
    RewriteRule . - [E=myURI:%1h%2]
    RewriteCond %{ENV:myURI} ^([^I]*)I(.*)$
    RewriteRule . - [E=myURI:%1i%2]
    RewriteCond %{ENV:myURI} ^([^J]*)J(.*)$
    RewriteRule . - [E=myURI:%1j%2]
    RewriteCond %{ENV:myURI} ^([^K]*)K(.*)$
    RewriteRule . - [E=myURI:%1k%2]
    RewriteCond %{ENV:myURI} ^([^L]*)L(.*)$
    RewriteRule . - [E=myURI:%1l%2]
    RewriteCond %{ENV:myURI} ^([^M]*)M(.*)$
    RewriteRule . - [E=myURI:%1m%2]
    RewriteCond %{ENV:myURI} ^([^N]*)N(.*)$
    RewriteRule . - [E=myURI:%1n%2]
    RewriteCond %{ENV:myURI} ^([^O]*)O(.*)$
    RewriteRule . - [E=myURI:%1o%2]
    RewriteCond %{ENV:myURI} ^([^P]*)P(.*)$
    RewriteRule . - [E=myURI:%1p%2]
    RewriteCond %{ENV:myURI} ^([^Q]*)Q(.*)$
    RewriteRule . - [E=myURI:%1q%2]
    RewriteCond %{ENV:myURI} ^([^R]*)R(.*)$
    RewriteRule . - [E=myURI:%1r%2]
    RewriteCond %{ENV:myURI} ^([^S]*)S(.*)$
    RewriteRule . - [E=myURI:%1s%2]
    RewriteCond %{ENV:myURI} ^([^T]*)T(.*)$
    RewriteRule . - [E=myURI:%1t%2]
    RewriteCond %{ENV:myURI} ^([^U]*)U(.*)$
    RewriteRule . - [E=myURI:%1u%2]
    RewriteCond %{ENV:myURI} ^([^V]*)V(.*)$
    RewriteRule . - [E=myURI:%1v%2]
    RewriteCond %{ENV:myURI} ^([^W]*)W(.*)$
    RewriteRule . - [E=myURI:%1w%2]
    RewriteCond %{ENV:myURI} ^([^X]*)X(.*)$
    RewriteRule . - [E=myURI:%1x%2]
    RewriteCond %{ENV:myURI} ^([^Y]*)Y(.*)$
    RewriteRule . - [E=myURI:%1y%2]
    RewriteCond %{ENV:myURI} ^([^Z]*)Z(.*)$
    RewriteRule . - [E=myURI:%1z%2]
    # Set lowercasing-in-progress flag
    RewriteRule . - [E=qLow:yes]
    # If any uppercase characters remain, re-start
    # mod_rewrite processing from the beginning
    RewriteCond %{ENV:myURI} [A-Z]
    RewriteRule . - [N]
    # If any characters were lowercased, set redirect required
    # flag and reset lowercasing-in-progress flag
    # (S=28 from above lands here)
    RewriteCond %{ENV:qLow} ^yes$ [NC]
    RewriteRule . - [E=qRed:yes,E=qLow:done]
    # End Uppercase to lowercase conversion plug-in
    # Fix non-canonical domain requests (except for valid
    # subdomains & stats accessed by unique server IP/address)
    RewriteCond %{HTTP_HOST} !^(www¦dev¦test)\.example\.com(:80)?$
    RewriteCond %{HTTP_HOST}<>%{ENV:TestIP} !^192\.168\.0\.101(:80)?<>true$ [NC]
    RewriteCond %{HTTP_HOST}<>%{REQUEST_URI} !^192\.168\.0\.101(:80)?<>/stats/
    RewriteRule .? - [E=qRed:yes]
    # Replace "hmtl" with "html"
    RewriteCond %{ENV:myURI} ^([^.,]+)[.,]+hmtl [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.html]
    # Replace comma(s) or multiple filetype delimiter periods in page filepaths
    # with a single period (e.g. "/page,html" or "/page..html")
    RewriteCond %{ENV:myURI} ^([^,.]+)([,.]{2,}¦,)((s?html?¦php[1-9]?¦pdf¦xls).*)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.%3]
    # Remove invalid trailing characters
    RewriteCond %{ENV:myURI} ^([/0-9a-z._\-]*)[^/0-9a-z._\-] [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    # Fix additional directory paths appended to filenames (/logo.jpg/<directory_path>)
    RewriteCond %{ENV:myURI} ^([^.]+\.[^/]+)/
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    # Remove trailing punctutation
    RewriteCond %{ENV:myURI} ^(.*)[._\-]+$
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    # Remove multiple contiguous slashes in URL (up to three instances)
    RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
    RewriteRule . - [E=qRed:yes,E=myURI:%1/%2,C]
    RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
    RewriteRule . - [E=myURI:%1/%2,C]
    RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
    RewriteRule . - [E=myURI:%1/%2]
    # Redirect direct client requests for "<anything>/index.html" to "<anything>/"
    RewriteCond %{ENV:myURI} ^(/([^/]+/)*)index\.html [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    # Redirect specific replaced/relocated pages to specific new pages
    # (Note: This is 'doing it the hard way,' and only URLs that have
    # been requested with typos/type-ins or other problems should be
    # included here. A straight 301 redirect rule located above all of
    # the code shown here can be used to redirect non-problematic URLs)
    RewriteCond %{ENV:myURI}<>/locales.html ^/location\.html<>(.+)$ [NC,OR]
    RewriteCond %{ENV:myURI}<>/about/widgets-intl.html ^/about/local-widgets\.html<>(.+)$ [NC,OR]
    RewriteCond %{ENV:myURI}<>/selector/widget-selector.html ^/selector/widgets[^.]+\.xls<>(.+)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    # Redirect all pages in old directories to same-named pages in new directories
    RewriteCond /new_dir1<>%{ENV:myURI} ^([^<]+)<>/old_dir1(.+)$ [NC,OR]
    RewriteCond /new_dir2<>%{ENV:myURI} ^([^<]+)<>/old_dir2(.+)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1%2]
    # Redirect old filetype to new filetype
    RewriteCond %{ENV:myURI}<>.jpg ^(/[^.]+)\.jpeg<>(.+)$ [NC,OR]
    RewriteCond %{ENV:myURI}<>.php5 ^(/[^.]+)\.php4<>(.+)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1%2]
    # Correct bad query string on products page link
    RewriteCond %{ENV:myURI} ^/products\.php$
    RewriteCond %{ENV:myQS} ^(([^&]+&)*)product=w1234(&.+)?$
    RewriteRule . - [E=qRed:yes,E=myQS:%1product=w01234%3,S=2]
    # Remove blank query strings from all URLs
    RewriteCond %{ENV:myQS} ^\?$
    RewriteRule .? - [E=qRed:yes,S=1]
    # Remove spurious query strings from non-dynamic pages
    RewriteCond %{ENV:myQS} ^\?
    RewriteCond %{ENV:myURI} !^/(locales¦test)\.html$
    RewriteCond %{ENV:myURI} !^/(cats¦products)\.php$
    RewriteCond %{ENV:myURI} !^/cgi-bin/
    RewriteRule .? - [E=qRed:yes,E=myQS:?]
    # Do the external 301 redirect only if the referrer is
    # not our own site, the resource exists at the corrected URL,
    # and the requesting IP is not that of our site tester.
    # (Note: Some of these conditions have been commented-out for code testing.
    # Once the code has been tested thoroughly, be sure to un-comment these lines.)
    RewriteCond %{ENV:qRed} ^yes$ [NC]
    #RewriteCond %{ENV:TestIP} !^true$ [NC]
    RewriteCond %{HTTP_REFERER} !^http://((www¦dev¦test)\.)?example\.(org¦com)
    RewriteCond %{HTTP_REFERER} !^http://192\.168\.0\.101(:80)?/?
    #RewriteCond %{DOCUMENT_ROOT}%{ENV:myURI} -f [OR]
    #RewriteCond %{DOCUMENT_ROOT}%{ENV:myURI} -d
    RewriteRule .? http://www.example.com%{ENV:myURI}%{ENV:myQS} [R=301,L]
    # ##### End URL fixup redirect routine #####


    The "<>" characters used in several RewriteConds above have no special meaning to mod_rewrite and are not regular-expressions operators. They are merely a unique character string that I use to enable unambiguous matching of combined server variable values on a single line by clearly delineating one value from the other.

    The rules are in a specific order; Some of the later rules depend upon the actions of previous rules.

    Some of the rules have exclusions implemented using RewriteConds. You may not need them at all, or you will very likely need to modify them to suit your site.

    The order of the RewriteConds is intentional. In some cases, the given order is required so that back-references will function correctly, and in other cases they are ordered based on performance considerations. For example, it is good to avoid directory-exists and file-exists checks if possible, since they take a lot of time and CPU resources. So these are deferred until all other conditions are met.

    A very simple but effective way to test this code is to create a page of non-canonical, mis-typed, and malformed links to your site, and then click those links using a Mozilla or Firefox browser with the "Live HTTP Headers" extension enabled. The server response can then be examined in detail to be sure it's working as expected.

    Remember that the code was intentionally designed to *not* correct requests referred from your own site or to correct links when clicked-on by you or testers within your organization, as listed in the exclusion section at the top. It will make your life easier if you leave the RewriteConds in the 301-redirec rule commented-out until you have adapted this code to your site and have thoroughly tested it. Then un-comment those RewriteConds and re-test from a machine that is not part of your development and test network.

    A brief explanation of the techniques used here and the point of this exercise: Apache has a nasty mod_rewrite bug that prevents multiple internal rewrites from working properly, except for a few cases where subdirectories are not present in the URL-path. If an attempt is made to rewrite /dir/a.html to /dir/b.html, and then to rewrite /dir/b.html to /dir/c.html in a second RewriteRule, the resulting URL will be /dir/c.html/c.html. The more sequential rewrites are done, the more times the filepath will be added to the end of the URL. And of course, if you add yet another RewriteRule to try to remove it, you still end up with two repeats of the filepath!

    As was stated at the outset, it is best to avoid 'stacked' redirects, both to avoid confusing search engine robots, and to facilitate the efficient passing of PageRank/link-popularity through to the redirect target URL.

    Both of the above problems are addressed in this code through the use of environment variables: "qRed" to flag a queued external redirect, "myURI" to hold the URL as it is tested and modified, and "myQS" to hold the query string as it is tested and modified. By using these second two variables to completely by-pass Apache's normal URI-handling variables, the Apache mod_rewrite bug is avoided.

    Unfortunately, this makes the code at least twice as long as it would be without the bug, but I haven't found a better way to work around it.

    I have provided a "plug-in" for doing uppercase-to-lowercase conversion in .htaccess. I call it a 'plugin' because I structured it so that it can be easily added or removed with minimum impact on the other code. I *do not* suggest including or using this case-conversion code unless it is absolutely necessary to correct a pre-existing or emerging problem; It is potentially a very-slow, high-CPU-load routine because it will invoke a restart of all mod_rewrite processing if more than one instance of any given capital letter appears in the requested URL. As such, you should take all steps possible to avoid depending on it for any purpose other than to correct inbound links from other sites which are non-responsive to requests for link correction and are completely out of your control. It should certainly not be used to "allow" you to use mixed-case URLs on your own pages; The result is almost certain to be an overloaded server if your site is even moderately popular.

    I've tried to be specific in the description of the individual routines. Some of them may not be useable on your site. For example, the "Fix additional directory paths appended to filenames" routine cannot be used as-is on sites which have periods in directory paths. It would have to be re-coded or removed for use on such a site.

    This code came off a live server, and has therefore been fairly-thoroughly tested. However, I have had to modify it for posting here, both to protect my own privacy and to comply with the WebmasterWorld Terms of Service. It is therefore possible that I have introduced an error or two. If so, please accept my apologies.

    If you use this code, I'd appreciate it if you'd attribute it to me. Thanks.

    Important: As noted above, replace all broken pipe "¦" characters in the code above with solid pipes before use; Posting on this forum modifies the pipe characters.

    I hope this is useful. Happy coding!


    [added] Incorporated the improvement from my post below, modified the comments to include Apache 2.0, and fixed one error in a "skip count." [/added]

    [edited by: jdMorgan at 9:11 pm (utc) on Jan. 6, 2007]


     9:34 pm on Jan 4, 2007 (gmt 0)

    Pure diamond posting Jim ..let me be the first to thank you for taking the time to help all members..

    sincerely ..M


     9:51 pm on Jan 4, 2007 (gmt 0)

    Wow! Flagged.

    Now for about a month for me to digest it all.

    Thanks, Jim.


     10:01 pm on Jan 4, 2007 (gmt 0)

    excellent post .. one question with regard to the lowercasing.. what is the reason behind this extensive 'plug-in'? would it not be possible to simply treat all URLs as [NC]. as unix systems are case-sensitive, i do see a reason but i was just wondering if this is why or? would case-sensitive urls be considered non-canonical from a search engine perspective if they in deed resolved correctly (assuming [nc] processing)?


     10:46 pm on Jan 4, 2007 (gmt 0)

    The problem is that you don't know what URLs will be linked-to by other sites or typed-in by people to reach your site. If someone types in "Example.com/Index.html" to get to your site, what will happen?

    On most sites using all-lowercase filenames, you'll get a 404-Not Found error. If that's the result of a link from another site, then any PageRank or link-popularity ascribed to the link by search engines will be lost. If it's the result of a bad link or of a person typing-in URL incorrectly, then it's an unpleasant surprise for the user, and becomes a usability issue.

    With the lowercasing code in place, all such requests will be lowercased and redirected to the URL-path which actually resolves to a file -- /index.html. This avoids the 404-Not Found error for improved usability, and passes the PageRank/Link-pop from even incorrectly-typed links to the correct page.

    The reason that I made it a "plug-in" is that it is big, slow, and may have to "call itself" several times, making it even slower. I suggest including it only if needed to solve an existing problem with other sites incorrectly linking to your site with uppercase letters in the URL, or if many people type-in URLs to get to your site with uppercase letters.

    If you have access to server config files, this is so easy/fast to fix; You just define a RewriteMap to call the system's "tolower:" function, and do:

    RewriteRule (.*[A-Z].*) http://www.example.com/${myToLowerMap:$1} [R=301,L]

    But that option isn't available in .htaccess, unless your host has pre-defined the tolower RewriteMap for you, and told you what its map name is.



     12:37 am on Jan 5, 2007 (gmt 0)


    That is just amazing! What an amazing piece of work that is - I can't tell you how much I appreciate the time you took to do this.

    You have no doubt saved many people from the pain and complexity of Google duplicate content issues.

    Thank you so much for your efforts!


     10:01 am on Jan 5, 2007 (gmt 0)

    Excellent topic flagged


     11:58 am on Jan 5, 2007 (gmt 0)



     1:20 pm on Jan 5, 2007 (gmt 0)

    thanks JD i will try to use it........


     2:29 pm on Jan 5, 2007 (gmt 0)

    Hao, flag it.


     5:20 pm on Jan 5, 2007 (gmt 0)

    Talk about a guy that can pull it all together! Thanks Jim!


     5:46 pm on Jan 5, 2007 (gmt 0)


    Thanks Jim.


     6:51 pm on Jan 5, 2007 (gmt 0)

    Again - awesome apache programming JD!

    <note to commenters: any google specific messages or comments should go in the Google forum...>


     7:05 pm on Jan 5, 2007 (gmt 0)

    Thanks all, and "hsieh hsieh ni" to codemeit

    I found an optimization to shorten this up. Replace these two rules:

    # Replace comma(s) in page filepaths with period (e.g. "/page,html")
    RewriteCond %{ENV:myURI} ^([^,]+),+\.*((s?html?¦php[1-9]?¦pdf¦xls).*)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.%2]
    # Remove multiple filetype delimiter periods (e.g. "/page..html")
    RewriteCond %{ENV:myURI} ^([^.]+)\.{2,}((s?html?¦php[1-9]?¦pdf¦xls).*)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.%2]

    with this one that does both:

    # Replace comma(s) or multiple filetype delimiter periods in page filepaths
    # with a single period (e.g. "/page,html" or "/page..html")
    RewriteCond %{ENV:myURI} ^([^,.]+)([,.]{2,}¦,)((s?html?¦php[1-9]?¦pdf¦xls).*)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.%3]

    Anything that can be done to make this pile of code smaller and easier to modify/maintain...

    Replace all broken pipe "¦" characters in the code above with solid pipes before use; Posting on this forum modifies the pipe characters.



     7:17 pm on Jan 6, 2007 (gmt 0)


    I have only ever used about half of that sort of code before.


     9:07 pm on Jan 6, 2007 (gmt 0)

    Well, your post in one of those Google forum threads was one of my incentives to finish this, so Thanks!

    Note to all: I have updated the initial post with several changes.

    It turns out that the same Apache mod_rewrite bug that requires this code to be so complex (compared to single rewrites and redirects) on Apache 1.3.x also exists on Apache 2.0.x. I have changed the comments to reflect this.

    I also incorporated a few code changes: One already mentioned above, another to support correction of multiple instances of multiple contiguous slashes in one URL, and another to fix an error in an [S=] flag skip count that was introduced when I 'sanitized' this code from my server for posting here.



     4:38 pm on Jan 7, 2007 (gmt 0)

    Wow, I think I'm speechless!

    Thanks! Flagged... now to read it through a bunch of times so I understand it. :)


     9:30 pm on Jan 7, 2007 (gmt 0)

    Flagged, exceptional work Jim, thanks for so much help in this regard.


     1:47 pm on Jan 8, 2007 (gmt 0)

    So are you saying that upper and lower case versions of urls could be considered dupe?

    I can't believe that google would be that picky since they convert all the domain names they display to lowercase anyway. I would think that their algo would compensate for that.

    [edited by: Bewenched at 1:53 pm (utc) on Jan. 8, 2007]


     3:32 pm on Jan 8, 2007 (gmt 0)

    RFC 2396 [faqs.org] is the controlling document for these issues.

    In practice, the schema ("http") and authority (server name or "domain") strings are always sent as lowercase-only by properly-implemented clients, but treated in a case-insensitive manner by servers and all intervening network components.

    However, the same is not true for the path component -- the "filename" portion of the URL.

    Therefore, a request for "http://HTTPD.APACHE.ORG/DoCs/1.3/" will arrive at the server as "http://httpd.apache.org/DoCs/1.3/" -- Note that only the scheme and the authority have been lowercased, and the path remains as-linked (try it with your own domain, and observe the browser address bar). Unless the server uses MultiViews or has rules similar to those above, this request will fail if the "filename" case is incorrect. (Note that if the server OS is Windows it will not fail, because Windows treats filenames in a case-insensitive manner.)

    So it is Googlebot (as a client) that is lowercasing domain names, but it won't do the same for path-names unless it is malfunctioning. However, I've been dealing with search engines for too long to dare predict what they (or other Webmasters) will do tomorrow -- intentionally or otherwise, and take no chances with the URL-structure of my sites. For resiliency, I've used all of the rules posted in the code above, and all have been called into action at one time or another.



     5:42 pm on Jan 8, 2007 (gmt 0)

    Unless the server uses MultiViews or has rules similar to those above, this request will fail if the "filename" case is incorrect.

    Just having MultiViews enabled will not correct this, at least not on Apache 2. A URI request that has a case mismatch will return a proper 404, even if MultiViews is enabled, and even for the portion of the resource located after the extension. For example, we have a resource on our (case-sensitive, non-Windows) server with the actual filename
    And each of the following requests will return the corresponding status codes ...

    http://example.com/images/picture 200 OK
    http://example.com/images/picture.jpg 200 OK
    http://example.com/images/picture.JPG 404 Not Found
    http://example.com/images/picture. 404 Not Found
    http://example.com/images/Picture 404 Not Found

    Personally, I like the approach that Apache has taken and I take the same. I return a 404 on lettercase mismatches. But then again, all resources are lowercase, standard practice. I imagine there can be good arguments made for lettercase corrections and missed traffic, but that is a whole new topic!


     6:06 pm on Jan 8, 2007 (gmt 0)

    Thanks for the clarification on MultiViews.

    The intent of doing case-conversion and fix-up redirects on incorrect links and typed-in URLs is threefold:

  • Avoid dupe-content issues
  • Recapture PageRank/link-popularity
  • Retain visitors who might be lost by an unexpected 404


  • g1smd

     10:53 pm on Jan 8, 2007 (gmt 0)

    >> So are you saying that upper and lower case versions of urls could be considered dupe? <<

    Upper and lower case versions of folder and/or file names are a cause of duplicate issues.


     4:42 am on Jan 9, 2007 (gmt 0)

    Wow... good work jdmorgan!


     6:34 am on Jan 9, 2007 (gmt 0)

    Wow, thanks jdMorgan. It's going to take me a while to digest all the code, but it's probably saved me several days of mod_rewrite frustration!

    You are indeed the mod_rewrite legend!

    Robert Charlton

     7:45 am on Jan 16, 2007 (gmt 0)

    Jim - I'm indebted to you for your ongoing advice over the years, and this one is very special. What a piece of work! Thanks... Bob


     4:33 pm on Jan 19, 2007 (gmt 0)

    I am not so good in interpreting .htaccess files

    How is it done, that it redirects only to a real existing page.

    I just also create more in handling this problem.
    My approach is to divide the work between the .htaccess and a script called as the error 404 handler.

    The script tries out different methods to quess what URL was real wanted. When a method has success, it redirects to this page.


     10:40 pm on Jan 20, 2007 (gmt 0)

    Yeah, you da man!



     11:02 pm on Jan 20, 2007 (gmt 0)


    The last rule in the code posted above uses RewriteConds to check for "directory exists" and "file exists".

    As a further example, I just added this code to a site today that someone had linked to without the ".html" extension. This was a link from a popular forum and so was causing lots of 404 errors on the site, as well as lots of confusion on the forum itself. So I added this to the code posted above, just before the query string fixup rules:

    # If no trailing slash or .html, and no "." in URL
    RewriteCond %{ENV:myURI} !(/$¦\.html$¦\.)
    # and if URL does not exist as a directory
    RewriteCond %{DOCUMENT_ROOT}%{ENV:myURI}/ !-d
    # and if page does exist with ".html" appended
    RewriteCond %{DOCUMENT_ROOT}%{ENV:myURI}.html -f
    # then append ".html" to URL
    RewriteRule . - [E=qRed:yes,E=myURI:%{ENV:myURI}.html]

    Note that this code is designed to work with the code posted above, and is not for general use. A general solution would be:

    # If no trailing slash or .html, and no "." in URL
    RewriteCond $1 !(/$¦\.html$¦\.)
    # and if URL with appended slash does not exist as a directory
    RewriteCond %{DOCUMENT_ROOT}/$1/ !-d
    # and if page does exist with ".html" appended
    RewriteCond %{DOCUMENT_ROOT}/$1.html -f
    # then append ".html" to URL
    RewriteRule (.*) http://www.example.com/$1.html [R=301,L]

    However, this last code example can't be used if you want to correct multiple problems with the URL using a single redirect, which was the whole point of this thread.

    Replace all broken pipe "¦" characters in the code above with solid pipe characters before use; Posting on this forum modifies the pipe characters.


    Global Options:
     top home search open messages active posts  

    Home / Forums Index / Code, Content, and Presentation / Apache Web Server
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
    © Webmaster World 1996-2014 all rights reserved