Forum Moderators: phranque

Message Too Old, No Replies

Batch redirects of all .html extensions with .htaccess

         

euphemus

9:06 pm on May 12, 2010 (gmt 0)

10+ Year Member



Hello. I was hoping a 301 redirect guru might be able to help me out with my htaccess.

I am using:

RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME}\.html -f
RewriteRule ^([^/]+)/$ $1.html


to send requests for a page with no extension (eg. mysite.com/about) to the corresponding mysite.com/about.html page.

I then followed it up with:

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !(\.[a-zA-Z0-9]{1,5}|/)$
RewriteRule (.*)$ /$1/ [R=301,L]


To force a trailing slash after the filename because it looks pretty and seems to do no harm.

This works to a point - users can request a page with or without the .html extension.

However, I now have a duplicate content issue I think, given Google may try to index both the .html pages and the 'virtual' version with no extension. I appreciate that there is only one page but will Google agree?

So I think I need a 301 redirect for any page ending in .html to the same page without. I've tried individual 301 redirects on a page by page basis but this results in a loop. I was hoping for a one line redirectMatch or 301 redirect rule that would take care of this. Surprisingly, changing extensions is covered numerous time across teh interwebs but cleaning up the old .html extensions isn't.

For completeness, I should also point out that I am also using:


AddType application/x-httpd-php htm html php
AddHandler application/x-httpd-php .htm .html

RewriteEngine On
RewriteCond %{HTTP_HOST} ^mysite.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]


in order to show one or two .php pages in my site as .html and the other one handles www and non-www.

Any help much appreciated and apologies if this has been covered here I just can't find it!

Thanks.

[edited by: jdMorgan at 12:40 am (utc) on May 13, 2010]
[edit reason] Please use example.com only [/edit]

jdMorgan

1:07 am on May 13, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your rule order is incorrect. It will 'expose' your .html filepaths as URLs if an extensionless page is requested using the non-canonical domain -- try it and look at your address bar...

"Looks pretty" is not really a good reason to waste a byte in a URL. I think example.com/page is prettier that example.com/page/ myself, especially since /page is NOT a directory. Opinions vary, but the shorter your URLs, the better.

File- and Directory-exists checks are horribly expensive in terms of server resources, and may actually invoke two additional disk reads per HTTP request unless you take steps to avoid these checks when they are not absolutely necessary. In your add-slash rule, you could have put the RewriteCond first to avoid at least some of these unnecessary disk checks.

In addition, a file on the server must have a file extension unless it is a directory, so the -f test is not needed.

The -d test isn't needed because REQUEST_FILENAME will add a slash before testing if it's not there anyway.

The requested URL-path must also be at least one character long if you're going to add a slash to it, so the test for an existing slash can be moved out of the RewriteCond and back into the RewriteRule pattern.


Then use [NC] on the RewriteCond, so you do not have to test both [a-z] and [A-Z] -- That's more than a third faster...

You cannot rewrite /a to /a.html and then expect to redirect /a.html back to /a -- Of course that creates an infinite rewrite/redirect loop.

The way out of the sack is to only redirect a.html if it is directly requested by the client, and not as a result of a previously-invoked rewrite.

Sorting all of this out, you'll get something like:

AddType application/x-httpd-php htm html php
AddHandler application/x-httpd-php .htm .html
#
RewriteEngine On
#
# Externally redirect to add missing trailing slashes if no filetype on requested URL-path
RewriteCond $1 !\.[a-z0-9]{1,5}$ [NC]
RewriteRule ^(.*[^/])$ http://www.example.com/$1/ [R=301,L]
#
# Externally redirect direct client requests for URLs
# with .html extensions to new extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]\ /([^/]+/)*([^.]+\.)+html([?#][^\ ]*)?\ HTTP/
RewriteRule ^(.+)\.html$ http://www.example.com/$1/ [R=301,L]
#
# Externally redirect requests for non-blank non-canonical hostname to canonical hostname
RewriteCond %{HTTP_HOST} !^(www\.mysite\.com)?$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
#
# Internally rewrite extensionless URL-paths to .html files if they
# end in a slash, do not resolve to physically-existing directories,
# and do resolve to existing files when ".html" is appended.
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME}\.html -f
RewriteRule ^([^/]+)/$ /$1.html [L]

Jim

[edit] Corrected as noted below. [/edit]

[edited by: jdMorgan at 2:47 am (utc) on May 13, 2010]

euphemus

2:16 am on May 13, 2010 (gmt 0)

10+ Year Member



Thank, Jim for a great reply.

I agree with your logic on the appending of additional / - there needs to be a distinction between files and directories so I will scratch that idea.

Also, your explanation on unnecessary requests is helpful. Having gone to great lengths to build a standards compliant site by hand, rather than using a bloated CMS, it would be a pity to undo that effort with an .htaccess rule that wastes resources - so thanks for explaining that.

Anyway, I've tried the exact code above, changing the example.com > mydomain.com but no dice.

What I get in the address bar if I request a page.html file is page.html/ - (note the extra /)

If I request an extensionless file eg /about I get mydomain.com/myerrorpage.html/- (note the extra /)

Requesting extensionless file with trailing slash /about/ yields correctpage.html/ but without the stylesheet loading (I guess I can fix that elswhere)

Other than the code you have provided here, there's nothing else in the htaccess other than the ErrorDocument rule.

So I'm a little lost now. Only reply if you have time - I guess I'll figure it out eventually on my own. I think you have to really understand rewrites and use them often to get this stuff right :)

jdMorgan

2:54 am on May 13, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, I forgot the negation operator on the first RewriteCond in the code I posted. I corrected it above to prevent others from copying and pasting bad code.

The missing stylesheet is because of your trailing slash. Your link to the stylesheet is a page-realtive link, I'll wager, so the browser uses "/page/" as the base directory when constructing the full URL that it must used to request the css file, instead of taking "/page.html" or "/page" and stripping that back to the current directory level before adding the css path to that.

So... there's another reason not to add a trailing slash.

You can fix it by linking to the css file as <link rel="stylesheet" type="text/css" href="/style.css"> with a leading slash on the stylesheet 'path' -- This now has to be the full path to the stylesheet from the documentroot of your server, with my example will resolve to example.com/style.css as shown. Alternatively, you can use a full URL in the <link rel>

Jim

jdMorgan

3:07 am on May 13, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oh, and should you decide to go both extensionless and slash-less, this would likely do the trick:

AddType application/x-httpd-php htm html php
AddHandler application/x-httpd-php .htm .html
#
RewriteEngine On
#
# Externally redirect to remove trailing slashes if not a
# directory request, to enforce clean extensionless URLs
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(([^/]+/)*[^/]+)/$ http://www.example.com/$1 [R=301,L]
#
# Externally redirect direct client requests for URLs
# with .html extensions to new extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]\ /([^/]+/)*([^.]+\.)+html([?#][^\ ]*)?\ HTTP/
RewriteRule ^(.+)\.html$ http://www.example.com/$1 [R=301,L]
#
# Externally redirect requests for non-blank
# non-canonical hostnames to canonical hostname
RewriteCond %{HTTP_HOST} !^(www\.mysite\.com)?$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
#
# Internally rewrite extensionless URL-paths to .html files
RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.html [L]

Jim

euphemus

5:09 pm on May 13, 2010 (gmt 0)

10+ Year Member



Worked like a charm, Jim. Can't thank you enough.

Cheers.