Remove file extensions and 301 to the new page by directory

Forum Moderators: phranque

Message Too Old, No Replies

Remove file extensions and 301 to the new page by directory

madmatt69

12:49 am on Jul 15, 2008 (gmt 0)

Hey all,

I'm trying to setup a directory rule in my httpd.conf to test out removing filenames from the documents in that directory.

So right now I have:
www.mysite.com/test/test.shtml

I'd like to change it to www.mysite.com/test/test

And redirect any requests to test.shtml to the new test.

How would I go about doing that? I think I can get the 301's working, but I'm not sure how to get it to remove the extensions, and still render the pages properly.

Thanks for any help!

jdMorgan

1:16 am on Jul 15, 2008 (gmt 0)

You first remove the extensions from the links on your pages.

Then you use one or more mod_rewrite RewriteRules to 'find' the proper files associated with those extensionless URLs, when those URLs are requested from your server by a client (e.g. a click on an extensionless link).

Finally, if a URL is requested *with* an extension, you 301-redirect to the extensionless URL. It is important to do this only if the original client request was sent with an extension -- You must use a RewriteCond to check %{THE_REQUEST} to do this, so as to prevent an 'infinite' rewrite/redirect loop due to interaction with the rule in the second step above.

While the first two steps are required, this last step is not. It is usually done to speed up re-indexing of the site with its new extensionless URL-set, to preserve traffic from old, un-updated inbound links and the PageRank/Link-popularity from those links, and to preserve the function of your visitors' old bookmarks. On a new site with extensionless URLs, the only reasons to do this last step would be to make sure that your internal site workings remain 'hidden' and to prevent certain exploits.

Remember, files need extensions (for example, to tell the server how to handle them properly as well as which MIME-type header to send to tell the client how to handle them), but URLs don't -- That's why we speak of "file extensions" even when discussing URLs. URLs and filenames are not at all the same thing, and need not resemble each other, because mod_rewrite and/or scripts can be used to modify the default URL-to-filename mapping of the server.

The most basic function of an HTTP server is to translate from the URL system used on the Web to the proprietary (and often arbitrary) file-naming system used by the server, its operating system, and its webmaster(s). The purpose of a URL is to provide a resource location method that is independent of servers' operating systems and filesystems.

There are many threads here on extensionless URL-handling; Try a site search (link at top left of this page) for "extensionless URL RewriteCond THE_REQUEST" for fairly-well targeted results.

Jim

g1smd

9:41 pm on Jul 15, 2008 (gmt 0)

Can we sticky the above explanation somewhere?

It clarifies so many things that perpetually trip people up.

madmatt69

10:15 pm on Jul 15, 2008 (gmt 0)

Heya,

I must say that's an excellent response that helps explain everything to me. It gives me a better idea of what needs to be done.

Thanks again!

zedjay

6:35 pm on Jul 26, 2008 (gmt 0)

That's a great response. I've been trying for hours to make this technique work (specifically, removing 'file extensions' from URLs, redirecting to URLs with a trailing slash then finally rewriting to files with normal html extensions). Only after reading this post was I finally able to succeed. However, I still don't fully understand what my original problem was.

I managed to fix the problem by changing my code to match html extensions in %{THE_REQUEST} rather than in %{REQUEST_URI}. Before I had done this, I was creating an infinite rewrite/redirect as Jim describes.

My question is, why did this make a difference when I changed from %{REQUEST_URI} to %{THE_REQUEST}? I've posted the external redirect code here. Happy to post the full remaining code I was using if this necessary.

Old code to catch .html requests that created infinite loop:
RewriteCond %{REQUEST_URI} .*\.html$
RewriteRule ^(.*)\.html$ http://www.mysite.co.uk/private/development/$1 [R=301,L]

New code that works:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^.]+\.html\ HTTP
RewriteRule ^(.*)\.html$ http://www.mysite.co.uk/private/development/$1 [R=301,L]

Thanks for your help!
zedjay

g1smd

8:10 pm on Jul 26, 2008 (gmt 0)

It makes a difference because THE_REQUEST is what the browser originally requested, and isn't altered by any of the processing that may have already gone on in the previous .htaccess rules.

zedjay

9:47 pm on Jul 26, 2008 (gmt 0)

Ah, I understand. Thanks @g1smd.

I presume this means that the REQUEST_URI is altered step by step in the Rewrite rules we give it (ie, the whole idea behind being able to incrementally adapt a URL with rewrites), however the THE_REQUEST will remain static. I'm really new to using .htaccess and Apache so just getting my head around it.

One last thing that still has me a little confused: It seems intuitive to me that after an external redirect all the rules will be run again, as the .htaccess file is essentially saying 'not at this location, try somewhere else', so a new request is sent. However, it surprised me that after an internal rewrite marked [L] that the URL was being passed back to the top rule and iterated through the rules again. After a while, I realised what was going on since I was receiving a loop, but I'd interpreted the Apache mod_rewrite docs as saying that [L] forced the rules to discontinue and return the contents to the browser. Any chance you could give a layman's explanation of why this happens?

Thanks!

jdMorgan

2:30 pm on Jul 27, 2008 (gmt 0)

You surmise correctly the operation of REQUEST_URI -- It is updated by any rewriterule that applies.
THE_REQUEST is that actual contents of the HTTP request header sent by the browser; Az such, it is unaffected by internal rewrites, and will only change as the result of a new request from the client (e.g. after an external redirect).

The RewriteRule [L] flag terminates rule processing for the current iteration only. If any rewrite has been invoked in the current iteration, rule-processing begins again from the top.

Consider that some rewriterules are used to enforce access control (see RewriteRule [F] flag). It is also possible that an errant rewrite might result in a match with a rule that triggers a 410-Gone response (see [G] flag). For these and other reasons, the rewritten URL must be checked again, in case it matches an access-control or non-existent-file rule.

Therefore, mod_rewrite in an .htaccess context requires explicit loop prevention -- either in the case where a substitution URL matches the rule's own pattern, or in the case where two complimentary rules contain substitutions which match each other's patterns (which was the case with your rewrite-then-redirect looping).

Jim

zedjay

3:33 pm on Jul 27, 2008 (gmt 0)

Ah ok, gotcha. Thanks Jim!