Forum Moderators: phranque
What I want is (with no particular order):
1. Set charset to UTF-8 for ".html" and ".xml"
2. Redirect ".htm" files to ".html" and set a 301 for ".htm"
3. Redirect "www.example.org" to "example.org" and set a 301 for "www"
4. Redirect "foo/index.html" to "foo/"
AddCharset UTF-8 .html .htm
AddType "text/html; charset=UTF-8" .html .htm
AddType "application/rss+xml; charset=UTF-8" .xml
RewriteEngine on
RewriteCond %{HTTP_HOST} !^example\.org$
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule ^(.*)$ http://example.org/$1 [R=301,L]
RewriteBase /
RewriteRule ^(.*)\.htm$ $1 [C,E=WasHTM:yes]
RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule ^(.*)$ $1.html [S=1]
RewriteCond %{ENV:WasHTM} ^yes$
RewriteRule ^(.*)$ $1.htm
[edited by: jdMorgan at 4:13 pm (utc) on April 15, 2007]
[edit reason] examplified [/edit]
Next, you should force the non-www for all other www files.
For any URL that is redirected the redirection should take place in one step - not as a redirection chain.
Several threads in the last week here contain the bulk of the code that you would need to use.
You have no code to support your #4 requirement, and I assume that you want to redirect *only* /index.html to "/" and not "/<any_dir>/index.html" to "/<any_dir>/". Note that the use of {THE_REQUEST} to check the original client request is required to prevent an infinite loop; Without it, the RewriteRule will interact with the DirectoryIndex directive, with each countermanding the other, looping until either the server or the client reaches its maximum redirection limit.
Along with a few more corrections and tweaks, here's what I'd use:
AddCharset UTF-8 .html .htm
AddType "text/html; charset=UTF-8" .html .htm
AddType "application/rss+xml; charset=UTF-8" .xml
#
RewriteEngine on
RewriteBase /
#
# Redirect "/index.html" to "/" in canonical domain
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html\ HTTP/
RewriteRule ^index\.html$ http://example.com/ [R=301,L]
#
# Redirect all non-canonical domain requests
# except for robots.txt to canonical domain
RewriteCond %{HTTP_HOST} !^example\.org
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule (.*) http://example.org/$1 [R=301,L]
#
# If requested .htm URL exists as an .html file
RewriteCond %{DOCUMENT_ROOT}/$1.html -f
# rewrite .htm URLs to .html files
RewriteRule ^(.*)\.htm$ /$1.html [L]
# Redirect "/<any_directory>/index.html" to "/<any_directory>/" in canonical domain
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html$ http://example.com/$1 [R=301,L]
Does a request for index.htm or for index.html get you to the correct new URL in just one step. Try it starting at www and then again at non-www for both examples.
Use an online HTTP Header Checker to confirm the correct action.
I'm not doubting jd's code; but you should always test code like this to ensure that every possibility has been covered and that each one does in fact return the correct response.
first of all, thanks *so* much for your time. As I said, my knowledge in this topic is very basic yet and I'm trying to follow your directions and suggestions.
So far I only would like to be sure these lines refer to allow access to ".html" files but using ".htm"
# If requested .htm URL exists as an .html file
RewriteCond %{DOCUMENT_ROOT}/$1.html -f ...whereas this...
# rewrite .htm URLs to .html files
RewriteRule ^(.*)\.htm$ /$1.html [L] ...rewrites ".htm" to ".html", right? If so, wouldn't be good to add a 301 directive to ".htm"? Just to learn a little more ;)
Lastly, I would like to know which is the correct position for my 404 declaration (after or before) which line:
ErrorDocument 404 http://example.org/path/file.html Thanks again!
...rewrites ".htm" to ".html", right? If so, wouldn't be good to add a 301 directive to ".htm"?
Very correct, right down to the terminology.
Yes, a 301 redirect would be a good idea.
You could probably be efficient with something like:
RewriteRule ([^.]+)\.htm$ http://example.org/$1.html [R=301,L]
The regular expression says, "Any one or more characters, which are not a .(dot), followed by a .(dot)htm" should "qualify" for the redirect.
Lastly, I would like to know which is the correct position for my 404 declaration (after or before) which line:ErrorDocument 404 http://example.org/path/file.html
I usually put error docs first, but either jdMorgan or g1smd might have a better location.
Either way, you will want to change the document path to:
ErrorDocument 404 /path/file.html
Using the canonical URL will result in a 302 "undefined" redirect, rather than the desired 404.
Justin
Figured I'd throw my two cents in too.
The order does not matter; Each Apache module scans your .htaccess file, and executes only the directives it understands. For example, mod_alias runs, and executes all of your mod_alias directives, ignoring all the others. Then mod_rewrite runs, and executes all of your mod_rewrite directives, skipping all the others, etc.
Therefore, your directives for the same module are executed in the order that you write them in your .htaccess file, but the server controls what order the modules execute in, and therefore, what order your directives for each module are executed.
The module execution order is determined by the reverse order of the LoadModule list on Apache 1.x, and by an internal priority scheme on Apache 2.x.
Jim
Thanks! I've changed my 404 to
ErrorDocument 404 /path/file.html You could probably be efficient with something like:
RewriteRule ([^.]+)\.htm$ http://example.org/$1.html [R=301,L]
As far as I can tell, what you wrote is for the main index.htm(l) page only, but what I was thinking about was for the whole site... so the question would be if this is enough for that already including the 301 redirect:
# rewrite *all* .htm URLs to .html files and do a 301
RewriteRule ^(.*)\.htm$ /$1.html [R=301,L] @Jim
After looking again at the code you provided, I began thinking that if I'm to use a 301 for ".htm" files, it wouldn't make too much sense to still use this directive:
# If requested .htm URL exists as an .html file
RewriteCond %{DOCUMENT_ROOT}/$1.html -f
so that
RewriteRule ^(.*)\.htm$ /$1.html [R=301,L] would be enough. Is this right?
Thanks again guys!
Does the same thing as this:
RewriteRule ^(.*)\.htm$ http://example.org/$1.html [R=301,L]
With a different regular expression.
The .* expression will match *everything* to the end of the line, then have to "backup" from the m to the .(dot), causing extra processing. (jdMorgan can explain the process more accurately than I can. I believe it matches "through" once, then "backs-up" one character, then attempts to match again, and then "backs-up" another, and so on.)
This regular expression [^.]+ is "forward looking" and matches *anything* except a .(dot), so when it gets to the .(dot) in the URL it "breaks" rather than matching all the way to the end of the line, then the literal .(dot) is matched along with htm in the first pass.
Also a good "rule", much like, "always use the L flag, unless you *know* you do not need it", is "use a canonical (http://example.org/) URL when redirecting (seen by the browser, R=NUM flag) and a relative (/local-path/to-file.ext) URL when rewriting (transparent to the browser)."
Hope this helps and makes sense.
Justin
Added: Yes to your last question. If you are redirecting one version to the other, you do not need to block access to either version. (Actually, you are blocking access in a more "friendly" manner by redirecting.)