homepage Welcome to WebmasterWorld Guest from 54.227.215.140
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
A guide to fixing duplicate content & URL issues on Apache
How to canonicalize all of your URLs with a single redirect
jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3208525 posted 8:53 pm on Jan 4, 2007 (gmt 0)

Recently, we've had a lot of discussion about domain and URL canonicalization, mainly centered around avoiding duplicate-content problems in Google. There has also been some discussion of fixing type-in URLs, typos in inbound links, and badly-coded inbound links.

To be clear, a "canonical" domain is the single domain you want your site to be known by, and a canonical URL is the single URL you want your page to be known by. Any others are non-canonical.

The word canonical is a religion-related term, and means "according to canon law, scripture or doctrine." But in general use, it just means "usual, standard, conventional, customary, or according to the rules." So as a Webmaster, you choose what single domain you want to use for your site, and what single URL should be used to request each of your pages.

Member g1smd has posted in several of these threads the very good advice that it's best to avoid "stacked redirects" --multiple redirects invoked by a single client request-- while doing things like index page and domain canonicalization. This was reiterated recently by WebmasterWorld admin tedster in this recent thread [webmasterworld.com].

I have coded various routines to do these kinds of fix-ups on an ad-hoc basis, but have never actually written a single-redirect-does-it-all solution. Actually, that's not quite true -- I had *tried* before, but a nasty mod_rewrite bug in Apache 1.3.x [archive.apache.org] had repeatedly stymied my efforts.

However, after returning to the subject after almost a year, and having spent that year experimenting and dashing off code in the WebmasterWorld Apache forum, one trick I had figured out is a work-around for the bug.

So I set out anew to create a domain/URL canonicalizaton and type-in fixup routine that would do the following:

  • Canonicalize the domain (e.g. redirect non-www and IP address to www)
  • Canonicalize my index pages (redirect "/index.html" to "/")
  • Remove multiple slashes in the URL
  • Remove spurious query strings (my sites' pages are mostly 'static' with a few exceptions)
  • Fix-up common typos in type-in URLs
  • Fix-up invalid inbound links caused by bad HTML mark-up
  • Fix-up URLs resulting from bad copy-and-pastes
  • Fix-up outdated or otherwise incorrect query strings
  • Suppress the fix-up redirect if the resulting URL does not resolve to an existing file
  • Suppress the fix-up if the link is on my own site (In this case, I want to see the 404 error)
  • Suppress the fix-up if the remote user is me or a site tester (Again, we want to see the 404 error)
  • Avoid recursion in mod_rewrite running in a per-directory .htaccess context
  • Avoid the nasty mod_rewrite bug in Apache 1.3.x
  • Do all of the above using a single 301-Moved Permanently redirect

    The result is a routine that can "correct" a request from a badly-coded link like:
    <a href="http://example.com/index,hmtl>for more info, click here</a>

    where the closing quote has been omitted on the link, "html" was mis-typed, and a comma was typed where the filetype-separator period should be.

    The result of a click on that link is a request for "http://example.com/index,hmtl%3Efor%20more%20info,%20click%20here%3C/a%3E

    The code will redirect that to the canonical domain and index page URL "www.example.com/" using a single redirect, correcting the comma and "hmtl", and stripping off the spurious path info along the way. Or it can fix up multiple slashes or periods, or remove trailing punctuation from links improperly embedded in text, or automatically-linked in forum posts, e.g. "For help with this code, see [httpd.apache.org...]

    This is done with a 301-Moved Permanently redirect, so that search engines are notified not to list or use the incorrect URL, but to replace it with the corrected/canonicalized one.

    This code is intended for the most common Apache hosting set-ups: shared virtual hosting on Apache 1.3.x, with configuration options limited to .htaccess files only.

    Update: After extensive testing on a server kindly provided by coopster, I've discovered that Apache 2.0.52 has the same bug as Apache 1.3.x. Although the original bug report was closed with a statement that this bug was fixed in Apache 2.0.30, it was apparently not fixed completely. Therefore, the solution presented here applies to Apache 2.0 as well.

    This routine is the right solution for my sites, which follow *my* strict URL conventions, but likely not for yours. Modification will almost certainly be required. Like almost all mod_rewrite code, this is not a simple cut-and-paste or find-and-replace proposition.

    It does fix-ups on only the most common URL errors I have seen in my logs, but of course, there are many others; The code is not meant to exhaustively cover all possible errors, just the most common ones on my sites.

    As such, I put it forth as an example, to be examined and perhaps modified by Webmasters who are conversant and comfortable with mod_rewrite and regular expressions. Again, this code is not an entry-level exercise, and the most likely result of trying to modify it without thoroughly understanding and testing it is a disaster -- the best of which would be an immediate server crash, and the worst of which might be to thoroughly trash your search engine rankings.

    Webmasters without extensive mod_rewrite experience might be better off ignoring this post, or spending a lot of time studying the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com], until the code's operation becomes clear.

    I cannot offer installation and modification "support" for this code; It's rather complex, and attempting to do so would very likely exceed the limited time I have to post here. However, I would like to discuss improvements to efficiency --given the stated design goals-- or other common inbound URL errors you've seen for which a fix-up might be added.

    The warranty expired yesterday, so use at your own risk. :)

    Be sure to replace all broken pipe "¦" characters in the code below with solid pipes before use; Posting on this forum modifies the pipe characters.

    [edited by: jdMorgan at 8:40 pm (utc) on Jan. 6, 2007]

  •  

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 8:53 pm on Jan 4, 2007 (gmt 0)

    So, all that said, here's what I came up with:

    # .htaccess
    #
    # Specify IP address(es) used by Webmaster, admins, & testers. These may access
    # the server by its unique IP address without being redirected to the domain.
    # Also, URLs are *not* corrected for access by this group, in order to prevent
    # this code from "hiding" problems during development.
    # Note that these addresses are those of your workstations, not your server.)
    SetEnvIf Remote_Addr ^192\.168\.1\. TestIP=true
    SetEnvif Remote_Addr ^10\.10\.45\.3$ TestIP=true
    SetEnvIf Remote_Addr ^127\.0\.0\.[1-7]$ TestIP=true
    #
    #
    # Setup: Enable mod_rewrite, disable MultiViews
    Options +FollowSymLinks -MultiViews
    RewriteEngine on
    #
    # Redirect non-problematic URLs
    # Note: The fix-up code below is complex, and is intended for use to fix only
    # generally-specified problematic URL requests. For administrative redirection
    # of specific non-problematic URLs, 'normal' redirects should be placed here.
    #
    RewriteRule ^old_page\.html$ http://www.example.com/new_page.html [R=301,L]
    RewriteRule ^old_page2\.htm$ http://www.example.com/new_page2.htm [R=301,L]
    #
    #
    # URL FIXUP REDIRECT ROUTINE
    #
    # This code corrects various problems with URLs, presumably due to typos in
    # links from other sites. It is complicated by measures taken to avoid a
    # mod_rewrite bug in Apache 1.3. ( See http://archive.apache.org/gnats/7879 )
    # This code uses a single external redirect to correct all detected problems.
    #
    # Skip next two rules if lowercasing in progress
    # (Remove this rule if case-conversion plug-in below is removed)
    RewriteCond %{ENV:qLow} ^yes$ [NC]
    RewriteRule . - [S=2]
    #
    # Prevent recursion and over-writing of myURI and myQS
    RewriteCond %{ENV:qRed} ^yes$ [NC]
    RewriteRule .? - [L]
    #
    # Get the client-requested full URI and full query string
    RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ (/[^?]*)(\?[^\ ]*)?\ HTTP/
    RewriteRule .? - [E=myURI:%1,E=myQS:%2]
    #
    #
    ###############################################
    # Uppercase to lowercase conversion plug-in
    # (This section, along with the first noted rule
    # above, may be removed if not needed or wanted)
    #
    # Skip next 28 rules if no uppercase letters in URL
    RewriteCond %{ENV:myURI} ![A-Z]
    RewriteRule .? - [S=28]
    #
    # Else swap them out, one at a time
    RewriteCond %{ENV:myURI} ^([^A]*)A(.*)$
    RewriteRule . - [E=myURI:%1a%2]
    RewriteCond %{ENV:myURI} ^([^B]*)B(.*)$
    RewriteRule . - [E=myURI:%1b%2]
    RewriteCond %{ENV:myURI} ^([^C]*)C(.*)$
    RewriteRule . - [E=myURI:%1c%2]
    RewriteCond %{ENV:myURI} ^([^D]*)D(.*)$
    RewriteRule . - [E=myURI:%1d%2]
    RewriteCond %{ENV:myURI} ^([^E]*)E(.*)$
    RewriteRule . - [E=myURI:%1e%2]
    RewriteCond %{ENV:myURI} ^([^F]*)F(.*)$
    RewriteRule . - [E=myURI:%1f%2]
    RewriteCond %{ENV:myURI} ^([^G]*)G(.*)$
    RewriteRule . - [E=myURI:%1g%2]
    RewriteCond %{ENV:myURI} ^([^H]*)H(.*)$
    RewriteRule . - [E=myURI:%1h%2]
    RewriteCond %{ENV:myURI} ^([^I]*)I(.*)$
    RewriteRule . - [E=myURI:%1i%2]
    RewriteCond %{ENV:myURI} ^([^J]*)J(.*)$
    RewriteRule . - [E=myURI:%1j%2]
    RewriteCond %{ENV:myURI} ^([^K]*)K(.*)$
    RewriteRule . - [E=myURI:%1k%2]
    RewriteCond %{ENV:myURI} ^([^L]*)L(.*)$
    RewriteRule . - [E=myURI:%1l%2]
    RewriteCond %{ENV:myURI} ^([^M]*)M(.*)$
    RewriteRule . - [E=myURI:%1m%2]
    RewriteCond %{ENV:myURI} ^([^N]*)N(.*)$
    RewriteRule . - [E=myURI:%1n%2]
    RewriteCond %{ENV:myURI} ^([^O]*)O(.*)$
    RewriteRule . - [E=myURI:%1o%2]
    RewriteCond %{ENV:myURI} ^([^P]*)P(.*)$
    RewriteRule . - [E=myURI:%1p%2]
    RewriteCond %{ENV:myURI} ^([^Q]*)Q(.*)$
    RewriteRule . - [E=myURI:%1q%2]
    RewriteCond %{ENV:myURI} ^([^R]*)R(.*)$
    RewriteRule . - [E=myURI:%1r%2]
    RewriteCond %{ENV:myURI} ^([^S]*)S(.*)$
    RewriteRule . - [E=myURI:%1s%2]
    RewriteCond %{ENV:myURI} ^([^T]*)T(.*)$
    RewriteRule . - [E=myURI:%1t%2]
    RewriteCond %{ENV:myURI} ^([^U]*)U(.*)$
    RewriteRule . - [E=myURI:%1u%2]
    RewriteCond %{ENV:myURI} ^([^V]*)V(.*)$
    RewriteRule . - [E=myURI:%1v%2]
    RewriteCond %{ENV:myURI} ^([^W]*)W(.*)$
    RewriteRule . - [E=myURI:%1w%2]
    RewriteCond %{ENV:myURI} ^([^X]*)X(.*)$
    RewriteRule . - [E=myURI:%1x%2]
    RewriteCond %{ENV:myURI} ^([^Y]*)Y(.*)$
    RewriteRule . - [E=myURI:%1y%2]
    RewriteCond %{ENV:myURI} ^([^Z]*)Z(.*)$
    RewriteRule . - [E=myURI:%1z%2]
    #
    # Set lowercasing-in-progress flag
    RewriteRule . - [E=qLow:yes]
    #
    # If any uppercase characters remain, re-start
    # mod_rewrite processing from the beginning
    RewriteCond %{ENV:myURI} [A-Z]
    RewriteRule . - [N]
    #
    # If any characters were lowercased, set redirect required
    # flag and reset lowercasing-in-progress flag
    # (S=28 from above lands here)
    RewriteCond %{ENV:qLow} ^yes$ [NC]
    RewriteRule . - [E=qRed:yes,E=qLow:done]
    #
    # End Uppercase to lowercase conversion plug-in
    ###############################################
    #
    # Fix non-canonical domain requests (except for valid
    # subdomains & stats accessed by unique server IP/address)
    RewriteCond %{HTTP_HOST} !^(www¦dev¦test)\.example\.com(:80)?$
    RewriteCond %{HTTP_HOST}<>%{ENV:TestIP} !^192\.168\.0\.101(:80)?<>true$ [NC]
    RewriteCond %{HTTP_HOST}<>%{REQUEST_URI} !^192\.168\.0\.101(:80)?<>/stats/
    RewriteRule .? - [E=qRed:yes]
    #
    # Replace "hmtl" with "html"
    RewriteCond %{ENV:myURI} ^([^.,]+)[.,]+hmtl [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.html]
    #
    # Replace comma(s) or multiple filetype delimiter periods in page filepaths
    # with a single period (e.g. "/page,html" or "/page..html")
    RewriteCond %{ENV:myURI} ^([^,.]+)([,.]{2,}¦,)((s?html?¦php[1-9]?¦pdf¦xls).*)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.%3]
    #
    # Remove invalid trailing characters
    RewriteCond %{ENV:myURI} ^([/0-9a-z._\-]*)[^/0-9a-z._\-] [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Fix additional directory paths appended to filenames (/logo.jpg/<directory_path>)
    RewriteCond %{ENV:myURI} ^([^.]+\.[^/]+)/
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Remove trailing punctutation
    RewriteCond %{ENV:myURI} ^(.*)[._\-]+$
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Remove multiple contiguous slashes in URL (up to three instances)
    RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
    RewriteRule . - [E=qRed:yes,E=myURI:%1/%2,C]
    RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
    RewriteRule . - [E=myURI:%1/%2,C]
    RewriteCond %{ENV:myURI} ^(.*)//+(.*)$
    RewriteRule . - [E=myURI:%1/%2]
    #
    # Redirect direct client requests for "<anything>/index.html" to "<anything>/"
    RewriteCond %{ENV:myURI} ^(/([^/]+/)*)index\.html [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Redirect specific replaced/relocated pages to specific new pages
    # (Note: This is 'doing it the hard way,' and only URLs that have
    # been requested with typos/type-ins or other problems should be
    # included here. A straight 301 redirect rule located above all of
    # the code shown here can be used to redirect non-problematic URLs)
    RewriteCond %{ENV:myURI}<>/locales.html ^/location\.html<>(.+)$ [NC,OR]
    RewriteCond %{ENV:myURI}<>/about/widgets-intl.html ^/about/local-widgets\.html<>(.+)$ [NC,OR]
    RewriteCond %{ENV:myURI}<>/selector/widget-selector.html ^/selector/widgets[^.]+\.xls<>(.+)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1]
    #
    # Redirect all pages in old directories to same-named pages in new directories
    RewriteCond /new_dir1<>%{ENV:myURI} ^([^<]+)<>/old_dir1(.+)$ [NC,OR]
    RewriteCond /new_dir2<>%{ENV:myURI} ^([^<]+)<>/old_dir2(.+)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1%2]
    #
    # Redirect old filetype to new filetype
    RewriteCond %{ENV:myURI}<>.jpg ^(/[^.]+)\.jpeg<>(.+)$ [NC,OR]
    RewriteCond %{ENV:myURI}<>.php5 ^(/[^.]+)\.php4<>(.+)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1%2]
    #
    # Correct bad query string on products page link
    RewriteCond %{ENV:myURI} ^/products\.php$
    RewriteCond %{ENV:myQS} ^(([^&]+&)*)product=w1234(&.+)?$
    RewriteRule . - [E=qRed:yes,E=myQS:%1product=w01234%3,S=2]
    #
    # Remove blank query strings from all URLs
    RewriteCond %{ENV:myQS} ^\?$
    RewriteRule .? - [E=qRed:yes,S=1]
    #
    # Remove spurious query strings from non-dynamic pages
    RewriteCond %{ENV:myQS} ^\?
    RewriteCond %{ENV:myURI} !^/(locales¦test)\.html$
    RewriteCond %{ENV:myURI} !^/(cats¦products)\.php$
    RewriteCond %{ENV:myURI} !^/cgi-bin/
    RewriteRule .? - [E=qRed:yes,E=myQS:?]
    #
    # Do the external 301 redirect only if the referrer is
    # not our own site, the resource exists at the corrected URL,
    # and the requesting IP is not that of our site tester.
    # (Note: Some of these conditions have been commented-out for code testing.
    # Once the code has been tested thoroughly, be sure to un-comment these lines.)
    RewriteCond %{ENV:qRed} ^yes$ [NC]
    #RewriteCond %{ENV:TestIP} !^true$ [NC]
    RewriteCond %{HTTP_REFERER} !^http://((www¦dev¦test)\.)?example\.(org¦com)
    RewriteCond %{HTTP_REFERER} !^http://192\.168\.0\.101(:80)?/?
    #RewriteCond %{DOCUMENT_ROOT}%{ENV:myURI} -f [OR]
    #RewriteCond %{DOCUMENT_ROOT}%{ENV:myURI} -d
    RewriteRule .? http://www.example.com%{ENV:myURI}%{ENV:myQS} [R=301,L]
    #
    # ##### End URL fixup redirect routine #####

    Notes:

    The "<>" characters used in several RewriteConds above have no special meaning to mod_rewrite and are not regular-expressions operators. They are merely a unique character string that I use to enable unambiguous matching of combined server variable values on a single line by clearly delineating one value from the other.

    The rules are in a specific order; Some of the later rules depend upon the actions of previous rules.

    Some of the rules have exclusions implemented using RewriteConds. You may not need them at all, or you will very likely need to modify them to suit your site.

    The order of the RewriteConds is intentional. In some cases, the given order is required so that back-references will function correctly, and in other cases they are ordered based on performance considerations. For example, it is good to avoid directory-exists and file-exists checks if possible, since they take a lot of time and CPU resources. So these are deferred until all other conditions are met.

    A very simple but effective way to test this code is to create a page of non-canonical, mis-typed, and malformed links to your site, and then click those links using a Mozilla or Firefox browser with the "Live HTTP Headers" extension enabled. The server response can then be examined in detail to be sure it's working as expected.

    Remember that the code was intentionally designed to *not* correct requests referred from your own site or to correct links when clicked-on by you or testers within your organization, as listed in the exclusion section at the top. It will make your life easier if you leave the RewriteConds in the 301-redirec rule commented-out until you have adapted this code to your site and have thoroughly tested it. Then un-comment those RewriteConds and re-test from a machine that is not part of your development and test network.

    A brief explanation of the techniques used here and the point of this exercise: Apache has a nasty mod_rewrite bug that prevents multiple internal rewrites from working properly, except for a few cases where subdirectories are not present in the URL-path. If an attempt is made to rewrite /dir/a.html to /dir/b.html, and then to rewrite /dir/b.html to /dir/c.html in a second RewriteRule, the resulting URL will be /dir/c.html/c.html. The more sequential rewrites are done, the more times the filepath will be added to the end of the URL. And of course, if you add yet another RewriteRule to try to remove it, you still end up with two repeats of the filepath!

    As was stated at the outset, it is best to avoid 'stacked' redirects, both to avoid confusing search engine robots, and to facilitate the efficient passing of PageRank/link-popularity through to the redirect target URL.

    Both of the above problems are addressed in this code through the use of environment variables: "qRed" to flag a queued external redirect, "myURI" to hold the URL as it is tested and modified, and "myQS" to hold the query string as it is tested and modified. By using these second two variables to completely by-pass Apache's normal URI-handling variables, the Apache mod_rewrite bug is avoided.

    Unfortunately, this makes the code at least twice as long as it would be without the bug, but I haven't found a better way to work around it.

    I have provided a "plug-in" for doing uppercase-to-lowercase conversion in .htaccess. I call it a 'plugin' because I structured it so that it can be easily added or removed with minimum impact on the other code. I *do not* suggest including or using this case-conversion code unless it is absolutely necessary to correct a pre-existing or emerging problem; It is potentially a very-slow, high-CPU-load routine because it will invoke a restart of all mod_rewrite processing if more than one instance of any given capital letter appears in the requested URL. As such, you should take all steps possible to avoid depending on it for any purpose other than to correct inbound links from other sites which are non-responsive to requests for link correction and are completely out of your control. It should certainly not be used to "allow" you to use mixed-case URLs on your own pages; The result is almost certain to be an overloaded server if your site is even moderately popular.

    I've tried to be specific in the description of the individual routines. Some of them may not be useable on your site. For example, the "Fix additional directory paths appended to filenames" routine cannot be used as-is on sites which have periods in directory paths. It would have to be re-coded or removed for use on such a site.

    This code came off a live server, and has therefore been fairly-thoroughly tested. However, I have had to modify it for posting here, both to protect my own privacy and to comply with the WebmasterWorld Terms of Service. It is therefore possible that I have introduced an error or two. If so, please accept my apologies.

    If you use this code, I'd appreciate it if you'd attribute it to me. Thanks.

    Important: As noted above, replace all broken pipe "¦" characters in the code above with solid pipes before use; Posting on this forum modifies the pipe characters.

    I hope this is useful. Happy coding!

    Jim

    [added] Incorporated the improvement from my post below, modified the comments to include Apache 2.0, and fixed one error in a "skip count." [/added]

    [edited by: jdMorgan at 9:11 pm (utc) on Jan. 6, 2007]

    Leosghost

    WebmasterWorld Senior Member leosghost us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 9:34 pm on Jan 4, 2007 (gmt 0)

    Pure diamond posting Jim ..let me be the first to thank you for taking the time to help all members..

    sincerely ..M

    jimbeetle

    WebmasterWorld Senior Member jimbeetle us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 9:51 pm on Jan 4, 2007 (gmt 0)

    Wow! Flagged.

    Now for about a month for me to digest it all.

    Thanks, Jim.

    jexx

    10+ Year Member



     
    Msg#: 3208525 posted 10:01 pm on Jan 4, 2007 (gmt 0)

    excellent post .. one question with regard to the lowercasing.. what is the reason behind this extensive 'plug-in'? would it not be possible to simply treat all URLs as [NC]. as unix systems are case-sensitive, i do see a reason but i was just wondering if this is why or? would case-sensitive urls be considered non-canonical from a search engine perspective if they in deed resolved correctly (assuming [nc] processing)?

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 10:46 pm on Jan 4, 2007 (gmt 0)

    The problem is that you don't know what URLs will be linked-to by other sites or typed-in by people to reach your site. If someone types in "Example.com/Index.html" to get to your site, what will happen?

    On most sites using all-lowercase filenames, you'll get a 404-Not Found error. If that's the result of a link from another site, then any PageRank or link-popularity ascribed to the link by search engines will be lost. If it's the result of a bad link or of a person typing-in URL incorrectly, then it's an unpleasant surprise for the user, and becomes a usability issue.

    With the lowercasing code in place, all such requests will be lowercased and redirected to the URL-path which actually resolves to a file -- /index.html. This avoids the 404-Not Found error for improved usability, and passes the PageRank/Link-pop from even incorrectly-typed links to the correct page.

    The reason that I made it a "plug-in" is that it is big, slow, and may have to "call itself" several times, making it even slower. I suggest including it only if needed to solve an existing problem with other sites incorrectly linking to your site with uppercase letters in the URL, or if many people type-in URLs to get to your site with uppercase letters.

    If you have access to server config files, this is so easy/fast to fix; You just define a RewriteMap to call the system's "tolower:" function, and do:

    RewriteRule (.*[A-Z].*) http://www.example.com/${myToLowerMap:$1} [R=301,L]

    But that option isn't available in .htaccess, unless your host has pre-defined the tolower RewriteMap for you, and told you what its map name is.

    Jim

    AndyA

    5+ Year Member



     
    Msg#: 3208525 posted 12:37 am on Jan 5, 2007 (gmt 0)

    Jim,

    That is just amazing! What an amazing piece of work that is - I can't tell you how much I appreciate the time you took to do this.

    You have no doubt saved many people from the pain and complexity of Google duplicate content issues.

    Thank you so much for your efforts!

    sandyk20



     
    Msg#: 3208525 posted 10:01 am on Jan 5, 2007 (gmt 0)

    Excellent topic flagged

    bts111

    10+ Year Member



     
    Msg#: 3208525 posted 11:58 am on Jan 5, 2007 (gmt 0)

    Beautiful!

    ddwebguru

    5+ Year Member



     
    Msg#: 3208525 posted 1:20 pm on Jan 5, 2007 (gmt 0)

    thanks JD i will try to use it........

    codemeit

    5+ Year Member



     
    Msg#: 3208525 posted 2:29 pm on Jan 5, 2007 (gmt 0)

    Hao, flag it.

    BillyS

    WebmasterWorld Senior Member billys us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 5:20 pm on Jan 5, 2007 (gmt 0)

    Talk about a guy that can pull it all together! Thanks Jim!

    SteveJohnston

    10+ Year Member



     
    Msg#: 3208525 posted 5:46 pm on Jan 5, 2007 (gmt 0)

    Stunning.

    Thanks Jim.

    Brett_Tabke

    WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 6:51 pm on Jan 5, 2007 (gmt 0)

    Again - awesome apache programming JD!

    <note to commenters: any google specific messages or comments should go in the Google forum...>

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 7:05 pm on Jan 5, 2007 (gmt 0)

    Thanks all, and "hsieh hsieh ni" to codemeit

    I found an optimization to shorten this up. Replace these two rules:

    # Replace comma(s) in page filepaths with period (e.g. "/page,html")
    RewriteCond %{ENV:myURI} ^([^,]+),+\.*((s?html?¦php[1-9]?¦pdf¦xls).*)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.%2]
    #
    # Remove multiple filetype delimiter periods (e.g. "/page..html")
    RewriteCond %{ENV:myURI} ^([^.]+)\.{2,}((s?html?¦php[1-9]?¦pdf¦xls).*)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.%2]

    with this one that does both:

    # Replace comma(s) or multiple filetype delimiter periods in page filepaths
    # with a single period (e.g. "/page,html" or "/page..html")
    RewriteCond %{ENV:myURI} ^([^,.]+)([,.]{2,}¦,)((s?html?¦php[1-9]?¦pdf¦xls).*)$ [NC]
    RewriteRule . - [E=qRed:yes,E=myURI:%1.%3]

    Anything that can be done to make this pile of code smaller and easier to modify/maintain...

    Replace all broken pipe "¦" characters in the code above with solid pipes before use; Posting on this forum modifies the pipe characters.

    Jim

    g1smd

    WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 7:17 pm on Jan 6, 2007 (gmt 0)

    Awesome.

    I have only ever used about half of that sort of code before.

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 9:07 pm on Jan 6, 2007 (gmt 0)

    Well, your post in one of those Google forum threads was one of my incentives to finish this, so Thanks!

    Note to all: I have updated the initial post with several changes.

    It turns out that the same Apache mod_rewrite bug that requires this code to be so complex (compared to single rewrites and redirects) on Apache 1.3.x also exists on Apache 2.0.x. I have changed the comments to reflect this.

    I also incorporated a few code changes: One already mentioned above, another to support correction of multiple instances of multiple contiguous slashes in one URL, and another to fix an error in an [S=] flag skip count that was introduced when I 'sanitized' this code from my server for posting here.

    Jim

    LunaC

    5+ Year Member



     
    Msg#: 3208525 posted 4:38 pm on Jan 7, 2007 (gmt 0)

    Wow, I think I'm speechless!

    Thanks! Flagged... now to read it through a bunch of times so I understand it. :)

    CainIV

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 3208525 posted 9:30 pm on Jan 7, 2007 (gmt 0)

    Flagged, exceptional work Jim, thanks for so much help in this regard.

    Bewenched

    WebmasterWorld Senior Member 5+ Year Member



     
    Msg#: 3208525 posted 1:47 pm on Jan 8, 2007 (gmt 0)

    jdMorgan,
    So are you saying that upper and lower case versions of urls could be considered dupe?

    I can't believe that google would be that picky since they convert all the domain names they display to lowercase anyway. I would think that their algo would compensate for that.

    [edited by: Bewenched at 1:53 pm (utc) on Jan. 8, 2007]

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 3:32 pm on Jan 8, 2007 (gmt 0)

    RFC 2396 [faqs.org] is the controlling document for these issues.

    In practice, the schema ("http") and authority (server name or "domain") strings are always sent as lowercase-only by properly-implemented clients, but treated in a case-insensitive manner by servers and all intervening network components.

    However, the same is not true for the path component -- the "filename" portion of the URL.

    Therefore, a request for "http://HTTPD.APACHE.ORG/DoCs/1.3/" will arrive at the server as "http://httpd.apache.org/DoCs/1.3/" -- Note that only the scheme and the authority have been lowercased, and the path remains as-linked (try it with your own domain, and observe the browser address bar). Unless the server uses MultiViews or has rules similar to those above, this request will fail if the "filename" case is incorrect. (Note that if the server OS is Windows it will not fail, because Windows treats filenames in a case-insensitive manner.)

    So it is Googlebot (as a client) that is lowercasing domain names, but it won't do the same for path-names unless it is malfunctioning. However, I've been dealing with search engines for too long to dare predict what they (or other Webmasters) will do tomorrow -- intentionally or otherwise, and take no chances with the URL-structure of my sites. For resiliency, I've used all of the rules posted in the code above, and all have been called into action at one time or another.

    Jim

    coopster

    WebmasterWorld Administrator coopster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 5:42 pm on Jan 8, 2007 (gmt 0)

    Unless the server uses MultiViews or has rules similar to those above, this request will fail if the "filename" case is incorrect.

    Just having MultiViews enabled will not correct this, at least not on Apache 2. A URI request that has a case mismatch will return a proper 404, even if MultiViews is enabled, and even for the portion of the resource located after the extension. For example, we have a resource on our (case-sensitive, non-Windows) server with the actual filename
    <DOCUMENT_ROOT>/images/picture.jpg
    And each of the following requests will return the corresponding status codes ...


    http://example.com/images/picture 200 OK
    http://example.com/images/picture.jpg 200 OK
    http://example.com/images/picture.JPG 404 Not Found
    http://example.com/images/picture. 404 Not Found
    http://example.com/images/Picture 404 Not Found

    Personally, I like the approach that Apache has taken and I take the same. I return a 404 on lettercase mismatches. But then again, all resources are lowercase, standard practice. I imagine there can be good arguments made for lettercase corrections and missed traffic, but that is a whole new topic!

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 6:06 pm on Jan 8, 2007 (gmt 0)

    Thanks for the clarification on MultiViews.

    The intent of doing case-conversion and fix-up redirects on incorrect links and typed-in URLs is threefold:

  • Avoid dupe-content issues
  • Recapture PageRank/link-popularity
  • Retain visitors who might be lost by an unexpected 404

    Jim

  • g1smd

    WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 10:53 pm on Jan 8, 2007 (gmt 0)

    >> So are you saying that upper and lower case versions of urls could be considered dupe? <<

    Upper and lower case versions of folder and/or file names are a cause of duplicate issues.

    jbgilbert

    10+ Year Member



     
    Msg#: 3208525 posted 4:42 am on Jan 9, 2007 (gmt 0)

    Wow... good work jdmorgan!

    zulu_dude

    5+ Year Member



     
    Msg#: 3208525 posted 6:34 am on Jan 9, 2007 (gmt 0)

    Wow, thanks jdMorgan. It's going to take me a while to digest all the code, but it's probably saved me several days of mod_rewrite frustration!

    You are indeed the mod_rewrite legend!

    Robert Charlton

    WebmasterWorld Administrator robert_charlton us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



     
    Msg#: 3208525 posted 7:45 am on Jan 16, 2007 (gmt 0)

    Jim - I'm indebted to you for your ongoing advice over the years, and this one is very special. What a piece of work! Thanks... Bob

    jetteroheller

    WebmasterWorld Senior Member jetteroheller us a WebmasterWorld Top Contributor of All Time 5+ Year Member



     
    Msg#: 3208525 posted 4:33 pm on Jan 19, 2007 (gmt 0)

    I am not so good in interpreting .htaccess files

    How is it done, that it redirects only to a real existing page.

    I just also create more in handling this problem.
    My approach is to divide the work between the .htaccess and a script called as the error 404 handler.

    The script tries out different methods to quess what URL was real wanted. When a method has success, it redirects to this page.

    jd01

    WebmasterWorld Senior Member 5+ Year Member



     
    Msg#: 3208525 posted 10:40 pm on Jan 20, 2007 (gmt 0)

    Yeah, you da man!

    Thanks,
    Justin

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3208525 posted 11:02 pm on Jan 20, 2007 (gmt 0)

    jetteroheller,

    The last rule in the code posted above uses RewriteConds to check for "directory exists" and "file exists".

    As a further example, I just added this code to a site today that someone had linked to without the ".html" extension. This was a link from a popular forum and so was causing lots of 404 errors on the site, as well as lots of confusion on the forum itself. So I added this to the code posted above, just before the query string fixup rules:

    # If no trailing slash or .html, and no "." in URL
    RewriteCond %{ENV:myURI} !(/$¦\.html$¦\.)
    # and if URL does not exist as a directory
    RewriteCond %{DOCUMENT_ROOT}%{ENV:myURI}/ !-d
    # and if page does exist with ".html" appended
    RewriteCond %{DOCUMENT_ROOT}%{ENV:myURI}.html -f
    # then append ".html" to URL
    RewriteRule . - [E=qRed:yes,E=myURI:%{ENV:myURI}.html]

    Note that this code is designed to work with the code posted above, and is not for general use. A general solution would be:

    # If no trailing slash or .html, and no "." in URL
    RewriteCond $1 !(/$¦\.html$¦\.)
    # and if URL with appended slash does not exist as a directory
    RewriteCond %{DOCUMENT_ROOT}/$1/ !-d
    # and if page does exist with ".html" appended
    RewriteCond %{DOCUMENT_ROOT}/$1.html -f
    # then append ".html" to URL
    RewriteRule (.*) http://www.example.com/$1.html [R=301,L]

    However, this last code example can't be used if you want to correct multiple problems with the URL using a single redirect, which was the whole point of this thread.

    Replace all broken pipe "¦" characters in the code above with solid pipe characters before use; Posting on this forum modifies the pipe characters.

    Jim

    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Code, Content, and Presentation / Apache Web Server
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved