Forum Moderators: phranque

Message Too Old, No Replies

301 file extension => non-extension

         

Readie

4:58 am on Feb 24, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok, due to some of the advice I got regarding the damage a poor 301 can do to my site, I decided to ask before I implement this.

I want to force any URL that includes the file extension to the non-file extension version of the same URL request - even if it's the incorrect file extension.

Now, I have two questions, first: is this a good idea on my part? And second: would the following rewrite rule do it?

RewriteRule ^(.*)\.([^/]+)$ /$1 [R=301,L]

Readie

6:33 am on Feb 24, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



After a little thought, I've decided that, assuming it even works, a better way of writing it would probably be:

RewriteRule ^([^\.])\.([^/]+)$ /$1 [R=301,L]

jdMorgan

3:22 pm on Feb 24, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I prefer "Door #3" -

RewriteRule ^(([^/]*/)*[^/.]+)+\.[^/.]+$ http://www.example.com/$1 [R=301,L]

"Match anything but a slash (or blank) followed by a slash as many times as possible (directory-path), then anything but a slash or a dot followed by a dot (filename), remember everything up to the dot, then match anything but a slash or a dot in the final path-part (filetype).

This pattern won't match filenames with multiple dots in them, but should do for most cases. If you do have filenames with multiple dots, then "^(([^/]*/)*([^/.]*\.)*[^/.]+)+\.[^/.]+$" would work, but will be considerably slower to match.

Note that the parentheses are nested, so that $1 contains everything that it should.

Always put your canonical hostname into the substitution URL-path for external redirects, in order to avoid problems with a conflicting (non-canonical) ServerName plus UseCanonicalName configured in the admin-only server config.

Jim

Readie

5:59 pm on Feb 24, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thankyou for your reply.

Implementing the above caused me a few problems, as listed below:

The entire site went completley mashed up with regards to aesthetics - fixed with
RewriteRule ^sitemain/style\.css$ - [L]


All requests for any page not the home page caused "Firefox has detected that the server is redirecting the request for this address in a way that will never complete."

I suspect the issue here is the need for a RewriteCond of some description - only applying this rule when there is an extension in the URL entered.

Perhaps (as a wild guess):

RewriteCond %{REQUEST_URI} ^(([^/]+)[^\.]+)\.([^\.])$ -f

jdMorgan

6:52 pm on Feb 24, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Of course the formatting went pear-shaped... The code will remove the ".css" from the requested filepath, you likely then rewrite the extensionless URL request to a script, now thinking that it's a 'page' request, and either the script won't find a page by that name, or the browser won't know what to do with the non-css-file that the script does send back.

So, what kind of requests do you *really* want to remove the filetypes from? You did say "any URL" in your initial post, but I doubt that you meant that... We know that CSS files are out, but how about "sitemap.xml" and "robots.txt"? What about images? I'd make a list of all filetype extensions that you *do* want to remove, another of all those that you don't want to remove, and implement the shorter or more stable list as an inclusion or exclusion list in one or more RewriteConds.

Only if either method results in a list of more than 50 or so filetypes would I consider using a disk check to make this decision -- you could well end up having to pay for a server upgrade using that approach on a busy site...

Also, if you are internally rewriting the resulting extensionless-URL requests to a filepath that does have an extension (e.g. "index.php"), then that rewritten path must be excluded from the redirect. And if the redirect removes many file extensions and you rewrite to several or many scripts (e.g. in several subdirectories), then it will likely be simpler to exclude anything but direct client requests from this redirect (which may be another aspect not considered in your initially-stated requirements.)

This one reason we do go on here about thoroughly-defining requirements before proceeding to coding. It's also one of the only reasons that ever I manage to got to bed without feeling somewhat guilty about not answering *all* outstanding questions -- In some cases, it's fairly clear that the original stated requirements could do with quite a bit more consideration...

Jim

Readie

8:20 pm on Feb 24, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Quite a bit more thought indeed in this one.

I guess the wisest course of action here... I'm trying to hide the file types of my pages, so really I need to compensate for the file types I use, and some I don't. So, re-writing requests for

.htm
.html
.php
.php3
.php4
.xhtm
.xhtml
.asp
.shtml
.shtm

Would probably be enough - although this is beginning to get complicated in my mind, as it then needs to re-write the extensionless request to the correct file without re-triggering this 301... But it also needs to not trigger this request even if there is no matching file to avoid returning an error 500, so a simple list of exclusions would be insufficient.

I'm beginning to doubt the validity and feasability of this idea, as it seems this would screw with the existing re-writes, and the only way I can think of to prevent that would defy the whole point of implementing this in the first place.

g1smd

8:43 pm on Feb 24, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's a perfectly valid thing to do and hundreds of thousands of sites use it daily.

You need to be clear that files on the server need the extension so the OS knows what type of files they are, but URLs do not have to have an extension at all.

Mod_Rewrite "translates" a URL request into the action of getting a file from the server. The URL has to contain enough information for the Rewrite to work out which file to get.

That is, how will Mod_Rewrite know that a request for example.com/thisfile needs to get thisfile.htm from the server, while a request for example.com/thatfile needs to get thatfile.php from the server?

jdMorgan

1:10 am on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Checking "file-exists" to avoid a 404 and as a multi-step mechanism to "figure out" what file extension needs to be added is only going to be feasible with, say, three or four filetypes, and then only if one of those filetypes makes up the majority of requested pages -- We'd put the check for that filetype first, so that it would almost always match, avoiding the need to do the additional file-exists checks.

So, at this point it's time to ask, can you reduce the number of filetypes? For example, would it be possible to rename all your .htm files to .html, or vice-versa? How about renaming all .shtm files to .shtml or vice-versa? xhtm/xhtml? And what about the php files -- do you really need all those versions, or are there really only one or two php interpreters on this server?

If you could get the number of filetypes down to 3, 4, or 5, this might be doable.

Alternately, maybe some of those filetypes are so rarely used that they either won't be a problem (performance-wise) to check, or perhaps you could accept a few of those filetypes keeping their extensions in URLs...

I'm waffling a bit on my "limit of three" mentioned above; You could do a few more, but you certainly don't want to go check the disk up to ten times (ten filetypes in your list) for a single page request, or five times on average per page request...

The code's not actually that difficult, but we're still working out the actual requirements here...

Handle five filetypes, in outline:
If extensionless URL-path requested { 
If requested URL-path plus .php exists as a file {
rewrite to .php file
}
Else if requested URL-path plus .html exists as a file {
rewrite to .html file
}
Else if requested URL-path plus .shtm exists as a file {
rewrite to .shtm file
}
Else if requested URL-path plus .xhtm exists as a file {
rewrite to .xhtm file
}
Else if requested URL-path plus .asp exists as a file {
rewrite to .asp file
}
Else return 404-Not Found
}

Jim

Readie

3:06 am on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I actually only use about 4 of the file types in that list - I was sort of saying that, to hide the file type in use, a possible method would be to re-direct several requests for the same file with different extensions, with only one of them being the actual file type - that way I would avoid issues with requests for files that it doesn't really matter if the user sees.

But, because my mind works in very roundabout ways it took me a while to come to the conclusion that there is probably a much simpler way.

Since the non-web-pages I use are restricted to gif, png, jpg, css, js, txt, ico and xml, I should be able to say something along the lines of:

if (extension exists in request) {
if not (.gif, .png, .jpg, .css, .js, .txt, .ico, .xml) {
301 extension => non extension
}
}


[the sequence of if/else/elseif's mentioned in your above post]

Course, problem with that is 5 months down the line I try and use a bmp or something and spend 2 hours freaking out wondering why it won't work.

jdMorgan

3:40 am on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would go with redirecting the set of 'page' filetypes, myself, and limit the list to those that you actually use. In this way, if you add a new one or forget to redirect an existing one, the result is that there's no redirect, and it's easy to see that that page filetype remains in the address bar. Other than the aesthetics, there's no harm done. Plus it's an incentive to limit the number of 'page' filetypes to those already in use.

May we have your list of actual page filetypes actually in use, please?

Jim

Readie

5:51 am on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, the argument for my method is that a miss-spelling of the extension still gives the file, but, as I said, I just know that evening will come where I scream at the server to load the new file type I've had to use for some reason or another :)

.php
.html
.shtml

... I'm not the sole webmaster of this site, so I'm not 100% certain that's a definitive list.

jdMorgan

4:14 pm on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, so with the problem and requirements fairly-well described, we come up with this:

# Externally redirect only direct client requests for "page" URL-paths
# with appended file extensions to corresponding extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/?\ ]*/)*[^/.?\ ]+\.(php|s?html)(\?[^\ ]*)?\ HTTP/
RewriteRule ^(([^/]*/)*[^/.]+)\.(php|s?html)$ http://www.example.com/$1 [R=301,L]
#
#
# The following three rules invoke OS calls to check the filesystem which may invoke
# physical disk accesses. These checks are very resource-intensive. Therefore, do not
# add additional filetypes without considering this performance impact.
#
# Internally rewrite extensionless request to .php file (if it exists)
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule ^(([^/]*/)*[^/.]+)$ /$1.php [L]
#
# Internally rewrite extensionless request to .html file (if it exists)
RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule ^(([^/]*/)*[^/.]+)$ /$1.html [L]
#
# Internally rewrite extensionless request to .shtml file (if it exists)
RewriteCond %{REQUEST_FILENAME}.shtml -f
RewriteRule ^(([^/]*/)*[^/.]+)$ /$1.shtml [L]
#

Your staff needs to be made aware that checking for "file exists" as done in the last three rules above involves a call to the operating system to check the filesystem. This may result in a physical disk read if the current filesystem cache is marked "dirty" (i.e. stale), and can markedly slow down the server.

Therefore, all reasonable efforts should be made to avoid adding new filetypes for which extensionless URLs are desired, and to get rid of unnecessary existing filetypes requiring extensionless URLs.

I suggest that you leave my comments in the code for this reason.

If a URL is requested with a file extension not handled by the code above, the result will be that that request will be served directly, and no redirect to the extensionless URL will occur. Therefore, it will be easy to detect the situation where someone comes up with a new file extension, because they will complain that the URL still shows an extension in their browser address bar, or that trying to request their new filetype using a URL without an extension does not work.

Not tested. Hopefully, no typos in the code... Delete your browser cache before testing any new code.

Jim

[edit] Corrected as noted below. [/edit]

[edited by: jdMorgan at 6:49 pm (utc) on Feb 25, 2010]

Readie

5:23 pm on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you for the reply, help and advice :)

Just a thought: would it be wise for me to add:

RewriteCond $1 !\.[^/.]+$
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule ^(([^/]*/)*[^/.]+)$ /$1.php [L]


For each of the RewriteRules?

[edited by: jdMorgan at 6:51 pm (utc) on Feb 25, 2010]
[edit reason] Corrected copied rule as noted below. [/edit]

jdMorgan

5:49 pm on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not really necessary. The existing pattern takes care of the the exact same thing as your first RewriteCond, and since the extensionless URL doesn't (and cannot) have a trailing slash, it cannot refer to a directory.

Do not add additional filesystem checks unless they are absolutely required. On a busy site, you may well end up being forced into a major server upgrade simply due to these additional file-checks! And if your server is already top-of-the-line, then that means you'll be forced to adopt a load-shared multiple-server set-up, and that is major complication, major cost, and major on-going maintenance hassle... (Consider keeping the databases on two (or more) servers in sync, for example, and what will happen if you don't).

The only reason you'd want to add the !-d check is if you allowed an extensionless "page" URL to have the same name as an existing directory, and you wanted requests for that directory to take precedence over the "page." Fair enough, but if you did add the !-d check, then the "page" at that extensionless URL would become completely-inaccessible because of this directory-precedence rule. And in that case, you'd have to change the extensionless page name, which would then render the original !-d check an unnecessary waste of server resources. So therefore it's really unnecessary in the first place...

One thing about mod_rewrite code is that is it can have far-reaching effects and implications. So a dozen lies of code can have effects all out of proportion to the number of lines in that code. Even if you understand mod_rewrite directives and regular-expressions completely, it may still be possible to miss a full understanding of what the code does on a real Web site. That's why I always wear a wry grin when I see a new post entitled "Quick question" here. The only "Quick question" I can think of is something like "How do I spell mod_rewrite?" ;)

Jim

Readie

5:57 pm on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The only "Quick question" I can think of is something like "How do I spell mod_rewrite?"

Lol.

Ok, thank you very much for your time Jim :)

Readie

6:37 pm on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just a notice, there's a typo within the internal rewrites.

It works with (but could probably be phrased better than):

RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule ^(.*)$ /$1.php [L]


-----

RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule ^(.*)$ /$1.html [L]


etc...

jdMorgan

6:50 pm on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That was no typo, that was a consistent error in parentheses nesting!

I corrected the code in the post above to prevent further propagation of the error.

Jim

Readie

6:53 pm on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ahh, I see. I'm glad to see people far more experienced than me occasionally copy and paste mistakes too :P :)

Again, thank you :)

jdMorgan

8:17 pm on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No, not a copy and past error. I mis-coded it in my own post, then you copied it in a reply.

I like to correct errors as "high up" in the thread as possible, because otherwise some later readers will copy the bad code without reading further and finding the correction. And when that happens, they may come right back here and require help with that bad code. And if they make no reference to where they found it, then it can take an inordinate amount of time to de-bug it all over again.

Since the number of contributors in this forum is very small, it's best not to waste their time. So if I make a mistake, I'm not at all shy about acknowledging it and correcting it... :)

Jim

Readie

8:31 pm on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That makes a great deal of sense :)

I also apologise for any offence - I have a slightly cutting sense of humour, I can assure you it was only intended as a joke though :)

jdMorgan

4:20 pm on Feb 26, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



None taken. The more varied and novel work one does, the more mistakes one makes. I'm proud to say that I make a lot.

500-Server Error? Great! -- Only 499 left to go!

Jim

Readie

5:39 am on Feb 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Whilest I find posting some of my more newbie mistakes embarrasing (wish I could delete my first few posts in the PHP forums), I agree they're nothing to actually be ashamed of :)

Now, this is just curiosity on my part:

After uploading the changes to .htaccess - my custom error documents stopped working. I left it, however, as I had much more pressing matters to attend to at the time (A new default skin for the site was required ASAP, and I'm a slow PhotoShopper).

Two days later, with no changes to the .htaccess or to any server configuration etc, they just started working again. Is that common with a big change to the htaccess file?

jdMorgan

4:38 pm on Mar 1, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A browser caching issue, most likely.

Your error documents should be marked as non-cacheable (using mod_headers if static pages) or by modifying the PHP to send cache-control headers if they are dynamically-generated).

Servers don't "remember" anything from one HTTP request to the next. Each HTTP request exists on its own until a response is sent, and is both the "first request" and the "last request" ever, as far as a server is concerned. That's why you need client-side cookies and authentication to keep a user "logged-in" for example.

So, a server itself doesn't remember anything even from one HTTP request to the next, much less over "several days."

Always delete your browser cache after making any changes to any server-side code. Forcing a page reload is not always sufficient.

Check your cache-control headers using the "Live HTTP Headers" or Firebug add-ons (or similar) for Firefox and Mozilla-based browsers. Error documents should return "Cache-Control: No-store" and an Expires: header with the current time (saying, in effect, "This page expires right now").

I find the simple Firefox/Mozilla "PrefBar" (Preferences toolbar) to be quite handy for enabling, disabling, and flushing the browser cache, enabling/disabling JavaScript, Java, Flash, GIF animation, page coloring and styles, popups, HTTP referrer headers, and HTTP pipelining -- all with a row of checkboxes in a browser toolbar, and for switching user-agent strings with a drop-down menu from that toolbar.

These are only a few of the very-most-useful settings in this toolbar, and there are even more sophisticated toolbars such as the "Developer Toolbar" available that do all this and more (but I like to keep things simple).

Jim

Readie

1:02 am on Mar 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Fair enough. I shall set the error docs to non-cache straight away.

Thanks for the tip about the prefbar as well. Call me simple, but I amused myself for about 10 minutes going "colours on... Colours off... Colours on... Colours off..." :D

jdMorgan

8:25 pm on Mar 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oh, on some sites, that "colors off" button is a life-saver for old eyes -- and on a few sites, it's a life-saver for even young eyes. Who on earth thinks that dark-red text on a dark-red background is a good idea? I don't know, but I was on a commercial site a few days ago, and a critical navigation link was displayed in just that manner! :o

Jim

Readie

11:27 pm on Mar 4, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I must admit to having been guilty of doing something along those lines at one point. But I have an excuse!

I was 13 years old, had only been doing HTML for a few months, had no clue CSS existed (link colour in the body tag) and didn't know how to use HTML tables.

On a commercial site... Yeah... Just bleh.