homepage Welcome to WebmasterWorld Guest from 54.226.180.86
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

This 77 message thread spans 3 pages: < < 77 ( 1 2 [3]     
.htm to Extensionless URLs - Plus Renaming Files
.htaccess on deck
MarkOly




msg:4587286
 11:04 pm on Jun 24, 2013 (gmt 0)

After much deliberation, I've decided to convert from .htm extensions to extensionless URLs. I'm also changing the names of most pages and moving them to subfolders - about 80 pages. I've pieced together the .htaccess code based on the great examples I've cherry picked here.

RewriteEngine On
RewriteBase /

#1 - Redirect requests for old URLs to new URLs
RewriteRule ^old-page\.html?$ http://www.example.com/new-folder/new-page [R=301,L]
# Then repeat the above 80 times.

#2 - Redirect index.html or .htm in any directory to root of that directory and force www
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index\.html?[^\ ]*\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1? [R=301,L]

#3 - Redirect all .html requests to .htm on canonical host.
RewriteRule ^([^.]+)\.html$ http://www.example.com/$1.htm [R=301,L]

#4 - Redirect direct client request for old URL with .htm extension
# to new extensionless URL if the .htm file exists
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*[^.\ ]+\.htm\ HTTP/
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^(([^/]+/)*[^.]+)\.htm$ http://www.example.com/$1 [R=301,L]

#5 - Redirect any request for a URL with a trailing slash to extensionless URL
# without a trailing slash unless it is a request for an existing directory
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^.]+)/$ http://www.example.com/$1 [R=301,L]

#6 - Redirect requests for non-www/ftp/mail subdomain to www subdomain.
RewriteCond %{HTTP_HOST} !^(www|ftp|mail)\.example\.com$
RewriteRule ^([^.]+)$ http://www.example.com/$1 [R=301,L]

#7 - Internally rewrite extensionless URL request
# to .htm file if the .htm file exists
RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.htm [L]


I'm wondering if it would be a good idea to trim some fat from this. For one thing, on the 80 specific URL redirects (#1), will the inclusion of html extensions be a huge extra burden? Considering that there's 80 lines to go through, would it be a good idea to only include the necessary .htm extensions?

If there's one error I see more than any others in my logs, it's the .html requests. That's why I added #3 (redirect .html to .htm). I know you want to avoid multiple redirects, so I'll probably want to get rid of #3. I could probably easily combine it with #4 (redirect .htm to extensionless) - if I could delete the file check line in #4 (RewriteCond %{REQUEST_FILENAME} !-f). So I'm wondering how important that file check is. There's another file check in #7 (internal rewrite to .html), so it doesn't seem that necessary. It looks like the file check would prevent Apache from cycling through again in the case of a bad file name. But from what I've read, the filename and directory requests use alot of resources. So it seems like more resources would be used checking every request for file name vs the extra burden of occasional bad file name requests cycling through once more. Am I missing something?

I'm also wondering how important #5 is (remove trailing slash from files), which requires a directory check line (RewriteCond %{REQUEST_FILENAME} !-d). I don't have problems with this error now. The .html requests are alot more common. But if I was using extensionless URLs, I bet it would be a different story. Is this a common error once you convert to extensionless?

If you see anything else I should be concerned with, please let me know. Thanks for any help!

MarkOly

 

g1smd




msg:4597977
 10:49 am on Jul 31, 2013 (gmt 0)

Comments apply to the code shown in the examples posted a few hours ago not to the original posting from several weeks ago i.e. for me, with 30 posts-per-page, that's the code on the previous page (page 2), not in the post shown immediately above (at the top of page 3).

Rule 1: I would remove the closing $ so that old .htm URLs whether requested as .htm or .html and with or without appended junk also redirect. Should this rule also strip parameters if they were requested? The Apache default action is to re-append them. Removing parameters is as simple as adding a question mark to the rule target.

Rule 2: The Condition has
[^\ ]* that allows for appended trailing junk or parameters after the index filename. I would alter the Rule pattern to allow index requests with trailing junk to also redirect and for the junk to be stripped.

Rule 8: I would allow URLs with trailing junk to also be redirected to the new URL. I think I would also strip parameters in the redirect.

Rule 5: Should this rule also strip parameters in the redirect if they were requested?

Rule 6: I think the second Condition is redundant. Stripping parameters in this redirect may cause problems elsewhere without a lot of messing about. I'd put up with a redirection chain for some requests, as you have it now.

Rule 13: Is this meant to redirect requests for extensionless-URL pages to http, or should it redirect some other stuff as well? If it only needs to redirect extensionless requests, the Rule pattern can be changed from
(.*) to something more specific and you can get rid of at least the second Condition. Should this rule also strip parameters in the redirect if they were requested?

Rule 7: I'm not sure whether the
-f test is a good idea or not. Valid and non-valid requests trigger -f to look at the filesystem to see if the file exists. Valid requests then look at the filesystem a second time to fetch that file. The two filesystem accesses make valid requests slightly slower. If the Condition were removed, all requests would look at the filesystem only once, and the file would either be served or Apache would generate a 404 error to say it didn't exist. There's a difference in the error message though. With the -f test present the error would say that "/this-stuff" does not exist, but without the -f test the error would be that "/this-stuff.htm" does not exist, exposing that you're using rewrites to static .htm files.

All the above is guesswork and I might have made an error in my thinking somewhere.

At some point, you'll renumber your blocks of rules. The convention I use is 11 onwards for rules that block access, 21 onwards for redirects and 31 onwards for rewrites. I also subdivide 11.a, 11.b, etc where merited.

You've commented your code reasonably well, so it shouldn't be too hard to figure things out when you need to add something extra to the code several years in the future.

MarkOly




msg:4598327
 6:15 am on Aug 1, 2013 (gmt 0)

Still working on this. I'm making alot of progress though. Your suggestions helped alot. I'll update tomorrow.

MarkOly




msg:4599204
 8:01 am on Aug 4, 2013 (gmt 0)

Well I took your suggestions and ran with them. I went through and took a fresh look at each rule. Before this, I wasn't really clear on using [^\ ]* for stripping parameters. So I took the challenge to add it to each rule and managed to do it for all but a few. I think every rule has been updated in some way. Instead of going back to reply to your earlier posts, I thought it would be easier to post the updated rules and reply to your posts in the applicable spots. Rules are in bold.

Rule 1: I would remove the closing $ so that old .htm URLs whether requested as .htm or .html and with or without appended junk also redirect. Should this rule also strip parameters if they were requested? The Apache default action is to re-append them. Removing parameters is as simple as adding a question mark to the rule target.

Removed the closing $. Added ? to target. Now it's removing appended junk and query strings.

#1 Redirect requests for old URLs to new URLs
RewriteRule ^old-page\.htm http://www.example.com/new-folder/new-page? [R=301,L]
# Then repeat the above 80 times.

Rule 2: The Condition has [^\ ]* that allows for appended trailing junk or parameters after the index filename. I would alter the Rule pattern to allow index requests with trailing junk to also redirect and for the junk to be stripped.

Done. My first attempt at this had it redirecting anything at all that came after index. So something like example.com/indexation would redirect. I fixed that and shortened it up.

#2 Redirect index requests in any directory to root of that directory, removing trailing parameters, forcing 'http://www.'
RewriteRule ^(([^/]+/)*)index([^\w\-]+[^\ ]*)?$ http://www.example.com/$1? [NC,R=301,L]

Rule 8: I would allow URLs with trailing junk to also be redirected to the new URL. I think I would also strip parameters in the redirect.

Done. The weird thing about this rule is that the condition and pattern are identical, so you would think I could just remove the condition. But when I try doing that, I get server type errors.

Looking at this rule
RewriteRule ^(([^/]+/)*[^.]+)\.html?$
I realized there's yet another possible malformed request:
example.com/blahblah//.html
So I guess the second grouping bracket needs to be [^./] after all. I don't know whether the server interprets // in this location as a null file-- error of some sort, surely? --or as a file called "/.html" In the specific case of .htm or .html you're in the clear because the server has already blocked requests beginning in .ht (I looked in MAMP's config file; that's the wording).

After updating for parameters [^\ ]* I think that fixed some things. So this:
example.com/valid-file//.html redirects to: example.com/valid-file/ which then redirects to: example.com/valid-file
If it's a directory:
example.com/valid-directory//.html gives me a 403 Forbidden. That happened no matter what changes I made.

#8 Redirect remaining .htm or .html requests to extensionless URL, removing trailing parameters, forcing 'http://www.'
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*[^.]+\.html?[^\ ]*\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*[^.]+)\.html?[^\ ]*$ http://www.example.com/$1? [NC,R=301,L]

Rule 5: Should this rule also strip parameters in the redirect if they were requested?

Did that. I moved this rule up. A certain pattern of trailing things redirected faster this way. I've forgotten exactly what that was though. This is one rule I couldn't add the [^\ ]* parameter to. I tried to add it after the slash and it caused server type errors. It doesn't seem to matter though because parameter redirecting and removal is still working on all files and folders. Which leads me to believe that I may not need one of these trailing rules - like maybe this one. But I don't want to mess with it. Everything is working so perfectly now with these three trailing rules. They're really picking almost everything up I can throw at them, even the multiple slashes and mixed trailing things.

I added a file exists check here so that invalid requests would not get the 301 to 404 treatment. I did this anywhere I could get away with it. My philosophy is that most of these rules are invoked so rarely, what's the big deal if it takes a few extra microbeats.

I don't know whether
$1
is more efficient than
%{REQUEST_URI}
I'd go with the longer form unless there's a big difference in server efficiency, just so I don't have to keep looking back "What $1? Which rule is this again?"

I swapped these out wherever I had the $1. For some reason, %{REQUEST_URI} wouldn't work for me. Strangely, it only worked in one place somewhere around here. And with that one, it wouldn't work with the $1 so I had to use %{REQUEST_URI} there. But everywhere else, I had to change it back to $1

#5 Redirect requests with trailing slash to extensionless URL, forcing 'http://www.', if request is a valid .htm file, excluding specific folders
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*[^.]+/\ HTTP/
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule ^(([^/]+/)*[^.]+)/$ http://www.example.com/$1 [R=301,L]

I wanted to see if I could make a new version of the trailing invalid characters rule with a [^\ ]* It took awhile but I did it:

#11 Redirect requests with trailing invalid characters to extensionless URL, removing trailing parameters, forcing 'http://www.', excluding specific file types and folders
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[\w\-/]*)?[^\w\-/\ ]+[^\ ]*\ HTTP/
RewriteCond %{REQUEST_URI} !\.(css|gif|jpe?g|png|js|ico|xml|txt)$ [NC]
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule ^((([^/]+/)*[\w\-/]*)?)[^\w\-/\ ]+[^\ ]*$ http://www.example.com/$1? [R=301,L]

Added parameter redirecting and stripping [^\ ]* here too:

#9 Redirect requests with trailing query string to extensionless URL, removing trailing parameters, forcing 'http://www.', excluding specific folders
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?#\ ]*)\?[^\ ]*\ HTTP/
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule (.*) http://www.example.com/$1? [R=301,L]

Can't Rule 13 be expressed as ^([^.]*)$ so you don't have to put all those non-page extensions in a Condition? At this point you've already redirected all requests for .htm/.html

Did that. I moved #13 above #6. This only redirects page and directory requests. It's important for me to keep https images (and stylesheets) from changing to http because I have two different stylesheets - one http version and one https version. I have to do this so customers don't get the mixed content error when checking out.

Rule 13: Should this rule also strip parameters in the redirect if they were requested?

I couldn't find any trailing parameters that make it down this far. But it couldn't hurt to add it, right?

#13 Redirect https: requests to 'http://www.' if request is a valid .htm file or directory, excluding specific folders and file
RewriteCond %{SERVER_PORT} ^443$
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteCond $1 !^file1
RewriteCond %{REQUEST_FILENAME}.htm -f [NC,OR]
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule ^([^.]*)[^\ ]*$ http://www.example.com/$1? [R=301,L]

Rule 6: I think the second Condition is redundant. Stripping parameters in this redirect may cause problems elsewhere without a lot of messing about. I'd put up with a redirection chain for some requests, as you have it now.

The second condition was for HTTP/1.0. I removed it and the rule still seems to work when testing as HTTP/1.0 at web-sniffer. I don't know how important that is because I haven't read up on HTTP/1.0 yet. I probably need to do this and test for it because I have no idea where I need to a condition for this and where I don't.

I split up #6 to separate out the non-www to www redirects that are https and need to remain https. I saw where there is a combination non-www to www rule that works for both http and https. I don't think it will work here though because my conditions are different for each (http and https). #6a really seems to only cover image and css files that would have normally got sucked up by #13 above, but didn't because #13 only works for .htm files and directories. So below they remain as https images and css files. The odds of an https non-www image or css file request are beyond remote. But here it is:

#6a Redirect https: requests for non-www and non-webmail subdomains to 'https://www.' if request is a valid non-.htm file, excluding specific folders
RewriteCond %{HTTP_HOST} !^(www|webmail)\.example\.com$ [NC]
RewriteCond %{SERVER_PORT} ^443$
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule (.*) https://www.example.com/$1 [R=301,L]

#6b covers non-www to www redirects for http:

#6b Redirect http: requests for non-www and non-webmail subdomains to 'http://www.' if request is a valid .htm file, non-.htm file, or directory
RewriteCond %{HTTP_HOST} !^(www|webmail)\.example\.com$ [NC]
RewriteCond %{SERVER_PORT} !^443$
RewriteCond %{REQUEST_FILENAME}.htm -f [NC,OR]
RewriteCond %{REQUEST_FILENAME} -f [OR]
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

Rule 7: I'm not sure whether the -f test is a good idea or not…With the -f test present the error would say that "/this-stuff" does not exist, but without the -f test the error would be that "/this-stuff.htm" does not exist, exposing that you're using rewrites to static .htm files.

Well I did remove the -f test. If there's anywhere it makes sense to remove it, it would be this rule, the one that every single page request has to go through. Is that a problem though? - that is exposing that I'm using rewrites to static .htm files?

#7 Internally rewrite extensionless URL requests to .htm file
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /[^.]+[^./]\ HTTP/
RewriteRule ^([^.]+[^./])$ /$1.htm [L]

At some point, you'll renumber your blocks of rules. The convention I use is 11 onwards for rules that block access, 21 onwards for redirects and 31 onwards for rewrites. I also subdivide 11.a, 11.b, etc where merited.

I'll do that. No dewey decimel though? :) JK

Now, what if someone comes in with a request for an extension you don't use at all? At one time I had a global [NS] block on requests ending in .php just because a 403 Forbidden is so much more satisfying than a 404.

Well the parameter stripping [^\ ]* seems be redirecting these to the extensionless url. I probably need to do something about blocking php requests though because I do see occasional php hack attempts in my logs. I'm sure there are plenty of anti-hack and anti-robot things I should address. I'll have to come back to it in a month or two to wrap up loose ends like that I didn't get to this time.

So far, so good though. I've tested this alot by now. This time, I tested also in IE8 to make sure I'm not getting the mixed content warning. I guess Firefox doesn't have an option for that anymore. I never even thought of that when I upgraded. The rules process lickety split though, even the ones with multiple file checks. I need to test this on dial-up though. I wonder if that free dial-up option still comes with Roadrunner? It's been a few years.

g1smd




msg:4599216
 10:05 am on Aug 4, 2013 (gmt 0)

Adding [^\ ]* tests for "not a space" and is only relevant when it is added to a RewriteCond that is testing THE_REQUEST. It's the wrong thing to add in other places. It's looking for the space before HTTP in the literal request from the browser:
GET /something.php?this=that&something=theother HTTP/1.1


Rule 2 - This wouldn't rediredect /indexation if there were a literal escaped period after "index" in the pattern.

The pattern ^<rest of pattern>index\.htm with no trailing $ matches URL requests with anything after htm - an l, or any type of appended junk, likewise the pattern ^<rest of pattern>index\. will match "index dot anything".

I would simplify the Rule pattern from ^(([^/]+/)*)index([^\w\-]+[^\ ]*)?$ to ^(([^/]+/)*)index\.htm with no trailing $. This matches .htm and .html and .htm<anything>.



Rule 8 - the Condition purposely tests THE_REQUEST so that the pattern will be a match only when something was requested as a URL from somewhere out there the web, and not as the result of matching a prior internal rewrite. This prevents an infinite loop.

The Rule pattern (Rule not Condition) can be simplified from ^(([^/]+/)*[^.]+)\.html?[^\ ]*$ to ^(([^/]+/)*[^.]+)\.htm with no trailing $. This matches .htm and .html and .htm<anything>.


Rule 6a and 6b - Is there any request that can lead to an infinite http-https-http-https loop? The rules don't look "symmetrical" and "opposite".

lucy24




msg:4599222
 10:34 am on Aug 4, 2013 (gmt 0)

If it's a directory:
example.com/valid-directory//.html gives me a 403 Forbidden. That happened no matter what changes I made.

I strongly suspect it's running into a config-level rule that says something like
<FilesMatch "^\.ht">
Order Allow,Deny
Deny from all
</FilesMatch>
This is normal in shared-hosting setups because they need to ensure that nobody gets into an .htaccess or .htpasswd file. Well, you'd do it on your own server too, only then you might not have htaccess files to protect.

This thread has been going on for quite a while, so I can no longer remember if you're testing in a WAMP-or-similar setup. If yes, take a closer look at the default config file. If you find a rule involving .ht try commenting it out and see if that affects your rule. If yes, it means that you can't go any further. If no, keep looking.

Or ignore the problem and proceed on the assumption that you will not get an awful lot of typo requests for /valid-directory//.html ;)

MarkOly




msg:4599297
 8:48 pm on Aug 4, 2013 (gmt 0)

Adding [^\ ]* tests for "not a space" and is only relevant when it is added to a RewriteCond that is testing THE_REQUEST. It's the wrong thing to add in other places. It's looking for the space before HTTP in the literal request from the browser:

Okay that makes sense. Everywhere that I had put [^\ ]* in the condition, I also copied it in the pattern. So like you said, it looks like I can just remove the $ from the pattern in most cases.

Rule 8 - the Condition purposely tests THE_REQUEST so that the pattern will be a match only when something was requested as a URL from somewhere out there the web, and not as the result of matching a prior internal rewrite. This prevents an infinite loop.

Okay that is what happened when I removed the condition - an infinite loop. But then why doesn't my index rule #2 require a condition to prevent the infinite loop?

The Rule pattern (Rule not Condition) can be simplified from ^(([^/]+/)*[^.]+)\.html?[^\ ]*$ to ^(([^/]+/)*[^.]+)\.htm with no trailing $. This matches .htm and .html and .htm<anything>.

Done.

#8 Redirect remaining .htm or .html requests to extensionless URL, removing trailing invalid characters and parameters, forcing 'http://www.'
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*[^.]+\.html?[^\ ]*\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*[^.]+)\.htm http://www.example.com/$1? [NC,R=301,L]

On #11, I also removed the [^\ ]* from the pattern. Also, in the pattern, I removed the space from this part [^\w\-/\ ]+ so now it matches the part right before it. I think that's important in this rule.

#11 Redirect requests with trailing invalid characters to extensionless URL, removing trailing parameters, forcing 'http://www.', excluding specific file types and folders
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[\w\-/]*)?[^\w\-/\ ]+[^\ ]*\ HTTP/
RewriteCond %{REQUEST_URI} !\.(css|gif|jpe?g|png|js|ico|xml|txt)$ [NC]
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule ^((([^/]+/)*[\w\-/]*)?)[^\w\-/]+ http://www.example.com/$1? [R=301,L]

Rule 2 - This wouldn't rediredect /indexation if there were a literal escaped period after "index" in the pattern.

Right. That's what I fixed. I didn't want the possibility of creating a file name that begins with "index", even though I wouldn't do that.

The pattern ^<rest of pattern>index\.htm with no trailing $ matches URL requests with anything after htm - an l, or any type of appended junk, likewise the pattern ^<rest of pattern>index\. will match "index dot anything".
I would simplify the Rule pattern from ^(([^/]+/)*)index([^\w\-]+[^\ ]*)?$ to ^(([^/]+/)*)index\.htm with no trailing $. This matches .htm and .html and .htm<anything>.

Well I did want it to work for "anything after htm or html" and "index dot anything". But I also wanted it to work for just plain "index" and "indexthen-one-character-thats-not-a-valid-url-character-followed-by-anything" So it looks like what I have is good. Except for maybe the [^\ ]* should be changed to (.*) But isn't the [^\ ]* the lesser of the two evils? I noticed that even if I add a space in the parameter, it still redirects. So yeah, the [^\ ]* doesn't even filter the space. But I don't think it hurts to have it there, does it? Especially if I'm avoiding (.*) ?

So I still have this:

#2 Redirect index requests in any directory to root of that directory, removing trailing invalid characters and parameters, forcing 'http://www.'
RewriteRule ^(([^/]+/)*)index([^\w\-]+[^\ ]*)?$ http://www.example.com/$1? [NC,R=301,L]

...where these requests redirect to root:

http://www.example.com/index
http://www.example.com/index.
http://www.example.com/index/.
http://www.example.com/index./
http://www.example.com/index.htm
http://www.example.com/index,htm
http://www.example.com/index.abc
http://www.example.com/index/abc
http://www.example.com/index/.,/
http://www.example.com/index?;.?/

These requests give a 404:

http://www.example.com/indexhtm
http://www.example.com/index-htm
http://www.example.com/index33
http://www.example.com/indexation

You think maybe the rule is too loosey goosey?

Rule 6a and 6b - Is there any request that can lead to an infinite http-https-http-https loop? The rules don't look "symmetrical" and "opposite".

Hmm. I don't think http to https can happen because anything that converts to https is prefaced by !^443$
I see what you're saying about the two rules not matching. For the two conditions that are missing from #6a (.htm -f check and -d check), they are caught above that in #13. Because anything that triggers #6a is https - and #13 catches all https before #6a, but releases non-htm files and directories to #6a. I know that sounds confusing. But I did spend alot of time fine tuning and testing it, especially on images and css files. I'll have to keep an eye on it.

This is normal in shared-hosting setups because they need to ensure that nobody gets into an .htaccess or .htpasswd file. Well, you'd do it on your own server too, only then you might not have htaccess files to protect.
This thread has been going on for quite a while, so I can no longer remember if you're testing in a WAMP-or-similar setup. If yes, take a closer look at the default config file. If you find a rule involving .ht try commenting it out and see if that affects your rule. If yes, it means that you can't go any further. If no, keep looking.
Or ignore the problem and proceed on the assumption that you will not get an awful lot of typo requests for /valid-directory//.html

Yeah I am on a shared server. I wouldn't mind having access to config file, but no such luck for the unwashed masses. :)

Thanks for help guys! It looks like I'm tantalizingly close to putting this to bed.

lucy24




msg:4599321
 9:41 pm on Aug 4, 2013 (gmt 0)

But then why doesn't my index rule #2 require a condition to prevent the infinite loop?

Internal requests that result from mod_dir activity have a special relationship with mod_rewrite. I've never found documentation, but I know from direct personal experimentation that directory-index requests are only evaluated against RewriteRules that create an internal rewrite, not against ones that create an external redirect. At least in Apache 2.2.

When in doubt, it can't hurt to stick on a [NS] "no subrequests" flag. This doesn't seem to apply to results of internal rewrites*, but it does apply to some other common actions: mod_dir requests for directory index files; auto-indexing; include files when done as SSI rather than php. (php includes don't seem to pass through htaccess at all. I do not understand this.)


* I will try some stuff on my test site and report back if I'm mistaken.

g1smd




msg:4599336
 10:32 pm on Aug 4, 2013 (gmt 0)

Rule 2 should be testing THE_REQUEST in a preceding RewriteCond. Your original code did this.

While some server configurations might get away without it for a while, it can't be guaranteed, and unless you're testing these requests regularly and/or studying the server error logs frequently you might not notice the problem for a long time.

MarkOly




msg:4599385
 4:09 am on Aug 5, 2013 (gmt 0)

Okay I'll just add back the RewriteCond. I removed it in the first place because it's identical to the pattern. So:

#2 Redirect index requests in any directory to root of that directory, removing trailing invalid characters and parameters, forcing 'http://www.'
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index([^\w\-]+[^\ ]*)?\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)index([^\w\-]+[^\ ]*)?$ http://www.example.com/$1? [NC,R=301,L]

Thanks!

lucy24




msg:4599391
 5:32 am on Aug 5, 2013 (gmt 0)

I removed it in the first place because it's identical to the pattern.

It's supposed to be. You're not testing the content, you're testing its source.

"The request is for index.php AND this request originated on the outside, rather than inside the present server."

But honestly it seems as if [NS] would do the job, since the specific purpose of this flag is to weed out server-internal requests.

MarkOly




msg:4599520
 4:49 pm on Aug 5, 2013 (gmt 0)

I removed it in the first place because it's identical to the pattern.

It's supposed to be. You're not testing the content, you're testing its source.

"The request is for index.php AND this request originated on the outside, rather than inside the present server."

But honestly it seems as if [NS] would do the job, since the specific purpose of this flag is to weed out server-internal requests.

Okay that's starting to make sense. I obviously have some reading to do on this. When I come back to this for round 2, I'll spend some extra time learning about the things that happen from a server standpoint when a request is made. It's easy to jump right to the coding part of it before establishing any kind of background.

So using the NS flag:

#2 Redirect index requests in any directory to root of that directory, removing trailing invalid characters and parameters, forcing 'http://www.'
RewriteRule ^(([^/]+/)*)index([^\w\-]+[^\ ]*)?$ http://www.example.com/$1? [NS,NC,R=301,L]

MarkOly




msg:4599628
 4:43 am on Aug 6, 2013 (gmt 0)

I wanted to post my final htaccess again so the corrections I made are all in one spot. I think that should be a wrap this time. :) I'll be checking my error logs every day. So if any new errors crop up, I'll post back again to correct it. Speaking of that, Apache didn't like my [NC] on the -f check lines: ([warn] RewriteCond: NoCase option for non-regex pattern '-f' is not supported and will be ignored.) That shows how important it is to check your logs immediately after making changes - at least someone on my level who thinks you can add [NC] wherever you want.

If there's any disclaimers to make, I think it would be on the last rules - 13, 6a, and 6b - that relate to redirecting http protocol to https and non-www subdomains to www. It's probably overkill and unorthodox and most people probably don't need all that. If you have separate https version stylesheets like I do, it might be more relevant. I've tested it all. But it could probably use more testing to make sure there are no infinite loops, as g1smd mentioned.

RewriteEngine On
RewriteBase /

#1 Redirect requests for old URLs to new URLs
RewriteRule ^old-page\.htm http://www.example.com/new-folder/new-page? [R=301,L]
# Then repeat the above 80 times.

#2 Redirect index requests in any directory to root of that directory, removing trailing
# invalid characters and parameters, forcing 'http://www.'
RewriteRule ^(([^/]+/)*)index([^\w\-]+[^\ ]*)?$ http://www.example.com/$1? [NS,NC,R=301,L]

#8 Redirect remaining .htm or .html requests to extensionless URL, removing trailing
# invalid characters and parameters, forcing 'http://www.'
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*[^.]+\.html?[^\ ]*\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*[^.]+)\.htm http://www.example.com/$1? [NC,R=301,L]

#5 Redirect requests with trailing slash to extensionless URL, forcing 'http://www.',
# if request is a valid .htm file, excluding specific folders
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*[^.]+/\ HTTP/
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule ^(([^/]+/)*[^.]+)/$ http://www.example.com/$1 [R=301,L]

#11 Redirect requests with trailing invalid characters to extensionless URL, removing
# trailing parameters, forcing 'http://www.', excluding specific file types and folders
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[\w\-/]*)?[^\w\-/\ ]+[^\ ]*\ HTTP/
RewriteCond %{REQUEST_URI} !\.(css|gif|jpe?g|png|js|ico|xml|txt)$ [NC]
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule ^((([^/]+/)*[\w\-/]*)?)[^\w\-/]+ http://www.example.com/$1? [R=301,L]

#9 Redirect requests with trailing query string to extensionless URL, removing trailing
# invalid characters, forcing 'http://www.', excluding specific folders
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?#\ ]*)\?[^\ ]*\ HTTP/
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule (.*) http://www.example.com/$1? [R=301,L]

#13 Redirect https: requests to 'http://www.' if request is a valid .htm file or directory,
# excluding specific folders and file
RewriteCond %{SERVER_PORT} ^443$
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteCond $1 !^file1
RewriteCond %{REQUEST_FILENAME}.htm -f [OR]
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule ^([^.]*)$ http://www.example.com/$1? [R=301,L]

#6a Redirect https: requests for non-www and non-webmail subdomains to 'https://www.'
# if request is a valid non-.htm file, excluding specific folders
RewriteCond %{HTTP_HOST} !^(www|webmail)\.example\.com$ [NC]
RewriteCond %{SERVER_PORT} ^443$
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule (.*) https://www.example.com/$1 [R=301,L]

#6b Redirect http: requests for non-www and non-webmail subdomains to 'http://www.'
# if request is a valid .htm file, non-.htm file, or directory
RewriteCond %{HTTP_HOST} !^(www|webmail)\.example\.com$ [NC]
RewriteCond %{SERVER_PORT} !^443$
RewriteCond %{REQUEST_FILENAME}.htm -f [OR]
RewriteCond %{REQUEST_FILENAME} -f [OR]
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

#7 Internally rewrite extensionless URL requests to .htm file
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /[^.]+[^./]\ HTTP/
RewriteRule ^([^.]+[^./])$ /$1.htm [L]

lucy24




msg:4599631
 5:48 am on Aug 6, 2013 (gmt 0)

:: mopping brow ::

It looks as if you have Fought The Good Fight. Go have a beer. If any residual issues come trickling in, you can deal with them later. In the last few days of testing, you have probably fed the server more bad requests than it sees in a year in real life :)

MarkOly




msg:4599745
 3:13 pm on Aug 6, 2013 (gmt 0)

It looks as if you have Fought The Good Fight. Go have a beer. If any residual issues come trickling in, you can deal with them later. In the last few days of testing, you have probably fed the server more bad requests than it sees in a year in real life happy!

Yeah it got hard to sort through the chaos. It's all clean now though. Except for those pesky apple icon errors. Outlook keeps reminding me about it and I keep hitting snooze. Couldn't take me much longer than an hour or two to throw one together.

Well thanks for all the help Lucy! I hope you've had a chance to catch up on your Zupitza. Thanks g1smd!

lucy24




msg:4599866
 11:00 pm on Aug 6, 2013 (gmt 0)

Except for those pesky apple icon errors.

Huh? What apple icon errors? Have you mentioned them before?

The apple-touch-icon can cause trouble because it's got the same .png extension as an ordinary image file, but sometimes you want to treat it the same as favicon.ico. (My favicon has an Allow from all to help flag human users who got locked out by mistake.) In fact I've been meaning to add a <FilesMatch> envelope for mine.

MarkOly




msg:4599897
 2:51 am on Aug 7, 2013 (gmt 0)

Huh? What apple icon errors? Have you mentioned them before?

File does not exist: /var/www/vhosts/example.com/httpdocs/apple-touch-icon-precomposed.png
File does not exist: /var/www/vhosts/example.com/httpdocs/apple-touch-icon.png

Sorry I should have been more specific cause it's completely unrelated to the topic at hand. I don't have an apple touch icon, so I need to create one. It's been on my todo list forever. And it's probably important. That's why I elevated it to 'annoy me with a daily alarm reminder' status.

lucy24




msg:4599916
 6:50 am on Aug 7, 2013 (gmt 0)

Oh, yes. There's one basic principle about apple-touch icons: If you have three kinds on your site, intending to cover all bets, the visitor will ask for one of the other five.

apple-touch-icon
same-57x57
same-72x72
same-144x144
... and all four again, inserting -precomposed-

This 77 message thread spans 3 pages: < < 77 ( 1 2 [3]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved