Forum Moderators: phranque

Message Too Old, No Replies

Rewrites and Duplicate Content

Looking to fix duplicate content issues with my "friendly" URL rewrites.

         

fmchris

1:35 am on Jan 18, 2009 (gmt 0)

10+ Year Member



I have several rules set up like this:

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^about$ about.php [L]

in my .htaccess file. This just rewrites the http://example.com/about URL to the actual file, http://example.com/about.php

Now, this works great so far.

The only issue I've had was specifying the rules with a trailing slash (RewriteRule ^about/?$ about.php [L]) where I had problems with the site's relative directories for the CSS, for example, ./styles/screen.css. To fix this, I simply removed the /? from the rule so that typing /about/ returns a 404.

Now, my problem is this: both http://example.com/about and http://example.com/about.php are index-able. I've heard this can create a bad situation for stats tracking and SEO performance. I could block search engines from accessing the .php files with a robots.txt file, but I would much rather create a redirect that redirects any request to http://example.com/about.php to http://example.com/about.

If, however, I add such a 301 redirect, the server goes into a loop since its rewriting the about to about.php, and redirecting the about.php to about.

Anyone have any ideas on what the best course of action to implement these friendly URLs are, or any other suggestions? Thanks! The full htaccess file is below:


Options +FollowSymLinks
RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^generate$ generate.php [L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^about$ about.php [L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^privacy$ privacy.php [L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^help$ support.php [L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^terms$ terms.php [L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^login$ login.php [L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^logout$ logout.php [L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^register$ register.php [L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^register$ register.php [L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^account$ account.php [L]

g1smd

1:52 am on Jan 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Place redirects before rewrites so that internal paths from rewrites are not exposed.

On the redirects use a

RewriteCond
to check
%{THE_REQUEST}
to see that the URL is coming from a direct client request and not as a result of a prior rewrite.

Make sure the redirect contains the canonical hostname as a part of the target URL.

There are many prior examples here in the forum, one posted in the last 48 hours: [google.com...]

fmchris

2:02 am on Jan 18, 2009 (gmt 0)

10+ Year Member



Thanks for your reply!

I may be doing this terribly wrong, but before the rewrites I added:

RewriteCond %{THE_REQUEST} ^/about\.php\ HTTP/
RewriteRule ^about.php$ http://example.com/about [R=301,L]

However, nothing actually happens. The site works as expected, but it doesn't redirect.

jdMorgan

4:25 am on Jan 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here is the whole rule-set for trailing-slash fix-up, dynamic-to-static redirect, and static to dynamic rewrite. Note that you do not need to do the file and directory exists checks, since you are rewriting specific URLs.

# Externally redirect to remove trailing slash
RewriteRule ^about/$ http://example.com/about
#
# Externally redirect direct client requests for dynamic URL to static URL
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /about\.php\ HTTP/
RewriteRule ^about\.php$ http://example.com/about [R=301,L]
#
# Internally rewrite static extensionless URL to script
RewriteRule ^about$ about.php [L]

Now you *could* collapse all the the internal rewrites into one, but you then have to check for file-exists:

# If requested extensionless URL-path exists as a file when ".php" is appended
RewriteCond %{DOCUMENT_ROOT}/$1.php -f
# then internally rewrite extensionless URL-path to add ".php"
RewriteRule ^([a-z]+)$ $1.php [L]

Similarly, the redirect to remove trailing slashes rules can be collapsed into one rule:

# If requested extensionless URL-path with trailing slash exists as
# a file when the slash is removed and ".php" is appended
RewriteCond %{DOCUMENT_ROOT}/$1.php -f
# then externally redirect to remove the trailing slash
RewriteRule ^([a-z]+)/$ http://www.example.com/$1 [R=301,L]

And finally, the direct client dynamic URL request to static URL redirects can also be collapsed into one rule:

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /[^.]+\.php\ HTTP/
RewriteCond %{DOCUMENT_ROOT}/$1.php -f
RewriteRule ^([^.]+)\.php$ http://example.com/$1 [R=301,L]

On a busy shared server, you may find that the individual rules run faster because they do not have to call the filesystem to do 'file exists' checks, which are slow and CPU-intensive. On the other hand, since you can use only three rules to support as many extensionless URLs as you like, you might just want to go ahead and let the server do the work -- It's your choice.

BTW, a typical value of %{THE_REQUEST} would look like this:

GET /about.php HTTP/1.1

It is exactly the client request logged in your raw server access logs.

The code above is somewhat complex and my eyes are a bit fuzzy. It won't surprise me one bit if I added a typo or two, so don't go away and debug this for three days before coming back and asking... :)

Jim

[edited by: jdMorgan at 4:29 am (utc) on Jan. 18, 2009]

g1smd

5:07 pm on Jan 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



*** This just rewrites the http://example.com/about URL to the actual file, http://example.com/about.php ***

Actually, it rewrites a URL to an internal bath, and it does so whether a www or a non-www URL was requested.

Ahead of the rewrites (placed as the very last redirect) you need a site-wide non-www to www 301 redirect. This rule will fix up anything that the previous rules did not, and will preserve the requested path in the redirect.

g1smd

6:38 pm on Jan 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Pfft.

bath --> path

fmchris

2:07 am on Jan 19, 2009 (gmt 0)

10+ Year Member



Wow, thanks for all the help. I'll look at implementing/debugging it this evening and see what I can do.

It's running as a standalone site on a VPS and running highly CPU intensive scripts already, so Apache speed isn't much of a concern within reason.

And I know I can use variables to collapse it to one rule, but not all the PHP files are named like I want the URLs to be named.

fmchris

3:42 am on Jan 19, 2009 (gmt 0)

10+ Year Member



Also, sorry for the double post, but @g1smd, I already take into account a www redirect. Been implementing that into all my sites for awhile :)