homepage Welcome to WebmasterWorld Guest from 54.145.183.169
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Resolving Canonical URL's with .htaccess
Preventing duplicate content penalties
codeman

5+ Year Member



 
Msg#: 3692704 posted 10:33 pm on Jul 7, 2008 (gmt 0)

Hi,

I have a large e-commerce site built on PHP Codeigniter framework. I have the following in my current .htaccess file:


RewriteEngine On
RewriteBase /

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php/$1 [L]

#several lines of standard 301 redirects

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.example.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R,L]

There are some more issues I would like to resolve with .htaccess, unfortunately some of it becomes tricky because Codeigniter has its own set of routing rules, and some quirks with mod_rewrite -- to start with:

1.) If you exclude "www." from the request, it redirects to the domain followed by index.php, which I do not want. For example:

- If you request example.com/category/gifts

You get redirected to: http://www.example.com/index.php/category/gifts

How can I get it to redirect to:

http://www.example.com/category/gifts ?

2.) What about enforcing lower case?

3.) Is it important to force a trailing slash at the end of the URL?

4.) Is there a standard set of htaccess rules anywhere that helps to prevent duplicate canonical url problems, e.g. when the same page can be shown with differences in the url?

Any help would be greatly appreciated!

 

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3692704 posted 12:14 am on Jul 8, 2008 (gmt 0)

1) Put your external redirects first, in order from most-specific to least-specific. Then include your internal rewrites, again in order from most-specific to least-specific.

2) Enforcing lowercase is the server default behaviour. More accurately, the requested URL-path must exactly match the casing of the filename, otherwise you'll get a 404 error. You can 'correct' casing errors, but it is slow and *very* inefficient in .htaccess, since each incorrect-case character must be corrected one-at-a-time.

3) A trailing slash on the end of a URL means that the requested resource is a directory, and not a file or a "page."

4) A standard set of rules is impossible, because of differences in the "correct" URL-set used by each site.

Some common URL errors often corrected using 301 redirects are:

  • Direct linking to "index pages" such as your /index.php example problem above. (Note that you must use "special" code to fix this in order to avoid recursion.)
  • Multiple consecutive slashes, e.g. "example.com//page.php"
  • Trailing characters due to unclosed quotes in HTML links, e.g. "example.com/page.html>link text here</a> more text..."
  • Trailing periods due to incorrectly-typed URLs or auto-linked URLs in forums, e.g. "example.com/page.asp."
  • Spurious or blank query strings appended to URLs which do not resolve to dynamic pages.
  • Various typos in URLs, e.g. commas typed instead of periods, missing "l" on "html", etc.

    5) Literal periods in regular-expressions patterns must be escaped. A period standing alone in a regex pattern is a token meaning "one of any character." Therefore, a pattern of ^www.example.com$ will match "wwwxexample.com", "www3example.com", and many others -- probably not what you want.

    6) You should remove the [NC] from your domain canonicalization rule, so that it will not "approve" incorrectly-cased domains, and will instead redirect them.

    7) If your domain has a unique (non-shared) IP address, you should add a RewriteCond to your hostname canonicalization rule so that it will accept a blank HTTP_HOST header. These are sent by true HTTP/1.0 clients, and though rare, should be accommodated. If you do not accommodate them, the result will be an infinite redirection loop, since HTTP/1.0 clients cannot send the correct HTTP_HOST header. Add this to prevent the problem:

    RewriteCond %{HTTP_HOST} .

    There is a single period in that pattern -- don't omit it.

    Sites hosted on shared name-based virtual servers need not implement this work-around; Name-based virtual hosts are unreachable via HTTP/1.0

    Jim

  • jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3692704 posted 12:31 am on Jul 8, 2008 (gmt 0)

    I'd like to add that "duplicate-content penalties" are largely self-inflicted. That is, it is not that search engines penalize sites with multiple URLs pointing to the same content. While this may happen in the most egregious cases --where it has been done intentionally-- the actual "penalty" is that you have your own URLs competing against each other, diluting the PageRank/link-popularity that would otherwise be focused on a single URL by spreading it thinly across multiple URLs.

    Another problem is that the search engines will "pick" one URL -- whichever one they perceive as having the most merit based on their algorithm, and this may or may not be the URL that you'd prefer.

    I've left the word "penalty" in the title of this thread, since that is what the problem is commonly called. But as stated, duplicate content "penalties" are usually self-inflicted ranking-dilution problems; Sloppy terminology, hyperbole, and uninformed fear have promoted a technical problem to dark and scary "penalty" status.

    Jim

    codeman

    5+ Year Member



     
    Msg#: 3692704 posted 4:20 pm on Jul 8, 2008 (gmt 0)

    Thanks for your informative post, JD!

    As far as the case-sensitivity goes, here is one of those situations where the Codeigniter framework comes into play. The following url is actually dynamic, and does not point to a file name:

    http://www.example.com/category/gift-baskets

    The last segment in the url is actually a unique identifier which is used in a SQL query, so it will provide the same page if it is upper, lower, or mixed case. So, Google shows some sites linking back to:

    http://www.example.com/category/Gift-Baskets

    The worrisome thing is, Google separately shows the lower case URL, and a different number of sites linking back to the capitalized URL...

    g1smd

    WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3692704 posted 4:38 pm on Jul 8, 2008 (gmt 0)

    If the URL is different, even if it is different by only one character, or by the case of one character, then it IS, by definition, a different URL. That's the Duplicate Content problem defined.

    I use .htaccess to pre-process URL requests and reject those that do not conform to the correct naming conventions for the site.

    Mostly that is in checking query-string parameter names and their allowed range of values, but also includes checking the domain name .co.uk vs. .com, correct "www" sub-domain, and so on.

    It would be fairly easy to set up your system so that it simply rejected any URL with an upper-case path or filename within (be aware that domain names are NOT case-sensitive) using a few lines of code in the .htaccess file.

    It would be a LOT more difficult to correct those requests to have the correct case using .htaccess but it would be fairly easy to have this check at the beginning of your script. The script would check the requested URL, and issue a redirect for anything wrongly cased. Only correctly-cased requests would be dealt with by the database-facing code. There must be a section of code that evaluates requests for validity, and stripping out requests that could be security breaches or hacks, so the code for this would be added to that section of the script.

    rocknbil

    WebmasterWorld Senior Member rocknbil us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3692704 posted 5:32 pm on Jul 8, 2008 (gmt 0)

    RewriteCond %{HTTP_HOST} .
    There is a single period in that pattern -- don't omit it.

    To clarify, are you saying it should be this?

    RewriteCond %{HTTP_HOST} . !^www.example.com$

    I can only get the canonical redirect to work without the period.

    RewriteCond %{HTTP_HOST} !^www.example.com$

    It works, and am not experiencing any problems, what could be the cause?

    g1smd

    WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3692704 posted 5:42 pm on Jul 8, 2008 (gmt 0)

    No. The entire line is this:

    RewriteCond %{HTTP_HOST} .

    It is an extra line. The "dot" is a part of the line, and must be there.

    rocknbil

    WebmasterWorld Senior Member rocknbil us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3692704 posted 9:17 pm on Jul 8, 2008 (gmt 0)

    Sorry for derailing the thread, but it seems on topic. I was constructing from memory (bad idea at my age.)

    Entire rewrite section of my .htaccess. Works fine as posted, but removing the comment causes a server error. I have this on multiple servers, different configs and ISP's.

    #RewriteCond %{HTTP_HOST} .
    RewriteCond %{HTTP_HOST} !^www\.example\.com
    RewriteRule (.*) http://www.example.com/$1 [R=301,L]

    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule ^(.*)$ /cgi-bin/product-search.cgi [L]

    codeman

    5+ Year Member



     
    Msg#: 3692704 posted 4:22 pm on Jul 9, 2008 (gmt 0)

    I have one specific issue with Codeigniter that I'm sure I can resolve in htaccess but I'm unsure of the syntax...I would like to redirect:

    http://www.example.com/index.php/category/gifts

    to

    http://www.example.com/category/gifts

    So that the string "index.php" is removed from the url, but everything before and after it is the same...

    g1smd

    WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3692704 posted 4:40 pm on Jul 9, 2008 (gmt 0)

    What is the URL like on the page, in the outgoing links of other pages on the site?

    The URL in the link, defines the URL of the page it links to.

    .

    Is the content on the server stored at:
    /index.php/category/gifts ?

    Do you want your URLs to be indexed like this:
    www.example.com/category/gifts ?

    If so, then you will need to have:

    - links on your pages that point to URLs like:
    www.example.com/category/gifts

    - a 301 redirect:
    from: www.example.com/index.php/category/gifts
    to: www.example.com/category/gifts
    in order to stop the Duplicate Content URLs being indexed.

    - an internal rewrite so that URL requests for:
    www.example.com/category/gifts
    pull the content from this internal filepath:
    /index.php/category/gifts or whatever internal filepath with parameters it is really located at.

    You might also need another 301 redirect from a URL something like:
    from: www.example.com/index.php?category=gifts
    to: www.example.com/category/gifts
    in order to stop direct accesses to the script itself.

    .

    [edit]Fixed typos in examples[/edit]

    [edited by: g1smd at 5:02 pm (utc) on July 9, 2008]

    codeman

    5+ Year Member



     
    Msg#: 3692704 posted 4:44 pm on Jul 9, 2008 (gmt 0)

    Ok, but I don't want Google to index the URL with index.php, and theyhave already indexed some...if rewriting will allow the bots to see what the correct URL is, I'm all for it...

    What's the code to do it, I'll give it a shot and see...

    codeman

    5+ Year Member



     
    Msg#: 3692704 posted 4:53 pm on Jul 9, 2008 (gmt 0)

    Hi g1smd....to respond to your edited post:

    Because of the Codeigniter framework, there is no such directory as index.php/category etc...Codeigniter has an SEO-friendly url system which is translated into query strings. So this url:

    /category/gift-baskets

    is interpreted by the web aplication as:

    index.php?category=gift-baskets

    There is already one rewrite rule which helps to have shorter url's which omit index.php:

    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule ^(.*)$ index.php/$1 [L]

    Which is standard practice in all Codeigniter applications. My problem is, the URL still works when index.php is included, which is a risk because there are two different url's which can produce the same content...so, if somehow a URL is accidentally indexed with "index.php" in there, I would like it to force the URL without it...

    Oddly enough, we have not published any URL's that contain index.php, but somehow Google's bot has found a way to index a few of them that way!

    So, I am looking for a few lines to put into htaccess that will accomplish this, if anyone knows it would be a huge help...

    [edited by: codeman at 4:55 pm (utc) on July 9, 2008]

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3692704 posted 4:55 pm on Jul 9, 2008 (gmt 0)

    Sorry, you can't do it that way. Mod_rewrite can't fix the problem by itself, since the 'wrong' URL is already defined by your page. It can only help with the 'cleanup' and in "reconnecting" the new URL as defined on your pages to the correct server filepath.

    Fix the links on your pages, then follow g1smd's procedure.

    Jim

    codeman

    5+ Year Member



     
    Msg#: 3692704 posted 4:59 pm on Jul 9, 2008 (gmt 0)

    Hi jd,

    I think something was miscommunicated - all of the links on our site correctly do NOT include index.php, and appear how we wish them to be indexed.

    Somehow, at some point, probably during the site's development, the mysterious Googlebot picked up a few that contain index.php. There is already the Rewrite in place above so that the web application will use index.php even if it is not in the url - what I need to do is redirect any url's containing index.php to the same URL without it, I'm alreday up to that point where I just don't know how to write this in mod_rewrite...

    codeman

    5+ Year Member



     
    Msg#: 3692704 posted 5:02 pm on Jul 9, 2008 (gmt 0)

    Maybe this helps...I'm looking for something like this:

    redirect 301 /index.php/^(.*)$ /$1

    g1smd

    WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3692704 posted 5:06 pm on Jul 9, 2008 (gmt 0)

    Sorry, I had a phone call while writing the post, and published it half-written. Check the finished post again, and I think I nailed all the things you need.

    Yes, it looks like a simple redirect will fix the last of your issues, but make sure that it is using a RewriteRule and not Redirect otherwise you cannot guarantee the order in which it runs.

    codeman

    5+ Year Member



     
    Msg#: 3692704 posted 5:14 pm on Jul 9, 2008 (gmt 0)

    Thanks g1smd!

    I think you might be overestimating me! Based on my beginner-level knowledge of regular expressions and htaccess syntax, I came up with this, but it does not work (although at least it is not harming anything, which is more than what I usually accomplish):

    RewriteEngine On
    RewriteCond %{HTTP_HOST} ^www.example.com$ [NC]
    RewriteRule /index.php/^(.*)$ /$1 [R,L]

    codeman

    5+ Year Member



     
    Msg#: 3692704 posted 5:30 pm on Jul 9, 2008 (gmt 0)

    I've also tried this:

    RewriteCond %{HTTP_HOST} ^(www.)?example\.com/index\.php/$ [NC]
    RewriteRule ^(.+)/$ [%{HTTP_HOST}...] [R=301,L]

    g1smd

    WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3692704 posted 5:36 pm on Jul 9, 2008 (gmt 0)

    *** RewriteRule /index.php/^(.*)$ /$1 [R,L] ***

    Nearly!

    ^ goes at the front, the rule cannot see the leading slash, and [R] gives a 302 redirect. The redirect should also force the domain.

    This is closer:

    RewriteRule ^index.php/(.*) http://www.domain.com/$1? [R=301,L]

    I also like to add a ? to the end of the target, so that unnecessary parameters are removed from the final URL.

    .

    *** RewriteCond %{HTTP_HOST} ^www.example.com$ [NC] ***

    Why have you got a RewriteCond that only runs the rule if a www URL was requested?

    Surely you want it to run for both www and non-www and force both over to www at the same time.

    That line can be omitted.

    codeman

    5+ Year Member



     
    Msg#: 3692704 posted 6:56 pm on Jul 9, 2008 (gmt 0)

    Thanks g1smd!

    I tried this line:

    RewriteRule ^index.php/(.*) [domain.com...] [R=301,L]

    Unfortunately, it's conflicting with something else...it causes an error on the site that the server is redirecting in a way that will never complete...

    It could also be something specific to CodeIgniter's url re-routing...

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3692704 posted 4:26 am on Jul 10, 2008 (gmt 0)

    Checking THE_REQUEST (the original client request) can often stop redirect-vs-rewrite conflicts:

    RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php/[^\ ]*\ HTTP/
    RewriteRule ^index\.php/(.*)$ http://www.domain.com/$1? [R=301,L]

    Jim

    g1smd

    WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3692704 posted 9:32 am on Jul 10, 2008 (gmt 0)

    Ah, the beauty of Time Zones. JD nailed it before I was even awake. :-)

    That should fix the problem.

    codeman

    5+ Year Member



     
    Msg#: 3692704 posted 5:02 pm on Jul 11, 2008 (gmt 0)

    jd, you rock! That works beautifully!

    Thanks to all for your help...I have a few small remaining issues, but I think that the rest probably need to be taken care of with PHP code rather than htaccess at this point, such as the last bit of the dynamic url's enforcing lower case...

    g1smd

    WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3692704 posted 10:50 pm on Jul 11, 2008 (gmt 0)

    Yes, .htaccess is very inefficient at changing case. It has to be done one character at a time.

    PHP will nail it far easier.

    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Code, Content, and Presentation / Apache Web Server
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved