homepage Welcome to WebmasterWorld Guest from 54.204.128.190
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

This 77 message thread spans 3 pages: < < 77 ( 1 [2] 3 > >     
.htm to Extensionless URLs - Plus Renaming Files
.htaccess on deck
MarkOly




msg:4587286
 11:04 pm on Jun 24, 2013 (gmt 0)

After much deliberation, I've decided to convert from .htm extensions to extensionless URLs. I'm also changing the names of most pages and moving them to subfolders - about 80 pages. I've pieced together the .htaccess code based on the great examples I've cherry picked here.

RewriteEngine On
RewriteBase /

#1 - Redirect requests for old URLs to new URLs
RewriteRule ^old-page\.html?$ http://www.example.com/new-folder/new-page [R=301,L]
# Then repeat the above 80 times.

#2 - Redirect index.html or .htm in any directory to root of that directory and force www
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index\.html?[^\ ]*\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1? [R=301,L]

#3 - Redirect all .html requests to .htm on canonical host.
RewriteRule ^([^.]+)\.html$ http://www.example.com/$1.htm [R=301,L]

#4 - Redirect direct client request for old URL with .htm extension
# to new extensionless URL if the .htm file exists
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*[^.\ ]+\.htm\ HTTP/
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^(([^/]+/)*[^.]+)\.htm$ http://www.example.com/$1 [R=301,L]

#5 - Redirect any request for a URL with a trailing slash to extensionless URL
# without a trailing slash unless it is a request for an existing directory
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^.]+)/$ http://www.example.com/$1 [R=301,L]

#6 - Redirect requests for non-www/ftp/mail subdomain to www subdomain.
RewriteCond %{HTTP_HOST} !^(www|ftp|mail)\.example\.com$
RewriteRule ^([^.]+)$ http://www.example.com/$1 [R=301,L]

#7 - Internally rewrite extensionless URL request
# to .htm file if the .htm file exists
RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.htm [L]


I'm wondering if it would be a good idea to trim some fat from this. For one thing, on the 80 specific URL redirects (#1), will the inclusion of html extensions be a huge extra burden? Considering that there's 80 lines to go through, would it be a good idea to only include the necessary .htm extensions?

If there's one error I see more than any others in my logs, it's the .html requests. That's why I added #3 (redirect .html to .htm). I know you want to avoid multiple redirects, so I'll probably want to get rid of #3. I could probably easily combine it with #4 (redirect .htm to extensionless) - if I could delete the file check line in #4 (RewriteCond %{REQUEST_FILENAME} !-f). So I'm wondering how important that file check is. There's another file check in #7 (internal rewrite to .html), so it doesn't seem that necessary. It looks like the file check would prevent Apache from cycling through again in the case of a bad file name. But from what I've read, the filename and directory requests use alot of resources. So it seems like more resources would be used checking every request for file name vs the extra burden of occasional bad file name requests cycling through once more. Am I missing something?

I'm also wondering how important #5 is (remove trailing slash from files), which requires a directory check line (RewriteCond %{REQUEST_FILENAME} !-d). I don't have problems with this error now. The .html requests are alot more common. But if I was using extensionless URLs, I bet it would be a different story. Is this a common error once you convert to extensionless?

If you see anything else I should be concerned with, please let me know. Thanks for any help!

MarkOly

 

lucy24




msg:4597086
 9:20 pm on Jul 27, 2013 (gmt 0)

:: thinking about what I'd do if I were making this from scratch ::

Good characters: alphanumeric, hyphen, directory slash
Bad characters: everything else, except that there might be an extension in the form .xtn
Can we assume for the sake of discussion that your name is not apache dot org and that, therefore, your directory names do not contain literal periods? (Yes, it's legal, but frankly if it had been me I would have said 2_2 2_4 and so on.) And no other nonsense like , or ~ or any of the many other things that are technically legal? Otherwise, adjust your Good and Bad groups accordingly.

Use a pipe-delimited list of only those extensions that actually occur in page names. Include .pdf if you happen to have any. Assume for the sake of discussion that you don't want people making cold requests for images, so if someone asks for a file in ".jpg;" that's their tough luck and they deserve a 404.

RewriteCond %{REQUEST_URI} !index\.php
RewriteRule ^((?:[^/.]+/)*(?:[^/.]+\.(?:html?|php))?)[^a-zA-Z0-9].* http://www.example.com/$1 [R=301,L]

Exclude index.php (or whatever filename you use for your actual index pages) because that redirect comes later. And, in the index redirect, leave off the closing anchor because then you can concurrently redirect requests for
/directory/index.phpmore-garbage-here
You have to code for one or more characters after the captured part so a request doesn't redirect to itself.

RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/

What's the story here? Do these two directories have wonky URLs?


Edit:
Oops, we're onto a second page.
The only issue I found so far is that something like example.com/bogus. gets a 301 first to example/bogus then example/bogus gets its 404.

The problems with redirecting to something other than a clean 200 are:
#1 extra work for the server by redirecting someone who will end up being blocked, or request for a nonexistent file, or one redirect leading to a second different one
#2 potential loss of googlejuice resulting from multiple redirects

The multiple-redirect problem can be handled by putting rules in the right order.

The 301-to-4xx problem is a lesser-of-two-evils situation: either the server occasionally has to do a little extra work redirecting a bad request, or it has to do extra work every time by looking up the request before issuing a redirect.

How often do you get malformed requests for bogus URLs? Or, more exactly: how often do you get them from visitors who haven't already been blocked by UA and/or IP and/or referer*? Pull up a few days' logs, look for requests fitting the Rule 11 pattern, and see how many of them are simply logged as 403 already.


* I have some blocks in place for bogus referers, but in practice all of these requests come in from Ukrainian robots, so it's strictly belt-and-suspenders.

JD_Toims




msg:4597134
 3:34 am on Jul 28, 2013 (gmt 0)

RewriteCond $1 !\.(css|gif|jpe?g|bmp|png|js|ico|xml|txt)$ [NC]

...

It does work - and for multiple trailing chars after I got daring and added the + before the $ in the pattern. But am I skating on thin ice here?

I haven't read the whole thread or file being posted, but did read the last couple posts and thought I'd jump in to say I usually get all those type of files out of the rewrites/redirects as soon as I can, so my first or "nearly first" rule is some variation of:

RewriteRule \.(css|gif|jpe?g|bmp|png|js|ico|xml|txt)$ - [L]

Who really cares if those file types canonicalize or not? I don't...

If someone makes a bad request they get a 404 immediately and I don't have to run all those file types through the entire .htaccess file for every single request for any one of them, so I save some processing by just eliminating them from everything else in the file.

Dideved




msg:4597143
 6:19 am on Jul 28, 2013 (gmt 0)

^(.*)[^/0-9a-z]+$

vs

^([\w/-]+(\.\w+)?)?[^a-zA-Z\d].*


Am I the only one who noticed that the OP's original pattern was simpler *and* already worked correctly *and* didn't ignore a huge swath of valid URLs?

MarkOly




msg:4597184
 3:54 pm on Jul 28, 2013 (gmt 0)

Am I the only one who noticed that the OP's original pattern was simpler *and* already worked correctly *and* didn't ignore a huge swath of valid URLs?

Are you trying to say we're spinning our wheels? :)

I think the problem with my first version of #11 is that it's greedy and promiscuous. So we've been looking for something that handles trailing punctuation without forcing every request to go through this thing: (.*)

So far, the rule that accomplishes this is the one I posted a couple posts above:

# Redirect URL containing valid characters to remove trailing invalid characters
RewriteCond $1 !\.(css|gif|jpe?g|bmp|png|js|ico|xml|txt|any-other-file-type-thats-not-a-page)$ [NC]
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule ^([/0-9a-z_\-]*)[^/0-9a-z_\-]+$ http://www.example.com/$1 [NC,R=301,L]

Now this is skating on this ice because if any request comes through for a new file type I haven't included in RewriteCond 1, then things could get ugly. That's why I added RewriteCond 2 to block out my shopping cart and stat folders. Why not take them out of harms way and never have to think about it again?

This rule works for me because I'm using extensionless URLs and at this point in the rules, there are no more valid page requests coming through that contain a period. So this rule is no good for a site with htm or html extensions - and probably dangerous for a large or complex site or one with multiple people making changes to it.

This one here, I modified a bit to fit my extensionless URLs. It works, but it also redirects css files. It's weird, I added the same RewriteCond as above rule to block out css requests, but it still redirects css files:

# Redirect URL containing valid characters to remove trailing invalid characters
RewriteCond %{REQUEST_URI} !index\.com
RewriteCond $1 !\.(css|gif|jpe?g|bmp|png|js|ico|xml|txt|any-other-file-type-thats-not-a-page)$ [NC]
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule ^((?:[^/.]+/)*(?:[^/.]+(?:\.pdf)?)?)[^a-zA-Z0-9].* http://www.example.com/$1 [R=301,L]

As Dideved pointed out, we have a working rule that's simple yet powerful. It's just a little promiscuous. I thought this might fix that (change (.*) to (^\ )):

# Redirect URL containing valid characters to remove trailing punctuation
RewriteRule ^(^\ )[^/0-9a-z]+$ http://www.example.com/$1 [NC,R=301,L]

Only problem, that doesn't work for trailing periods. Dang! Now if there was a RewriteCond I could use on this rule (the original version with the (.*)), then it wouldn't be so promiscuous. But I think that RewriteCond would inevitably require using (.*) so what's the point?

Well I think we are spinning our wheels here. I'm going to plug back in the rule I mentioned at the top of this post and keep an eye on it.

Thanks for all the help! I really do appreciate it.

Dideved




msg:4597198
 6:11 pm on Jul 28, 2013 (gmt 0)

I think the problem with my first version of #11 is that it's greedy and promiscuous.


That sounds like something you would have learned from these forums. :) Here's an alternative view for you to consider.

Take a look at every framework, every CMS, basically every open source project out there, and I think you'll find that .* is used liberally. Then take a look at the official Apache documentation (https://httpd.apache.org/docs/2.4/rewrite/remapping.html), and I think you'll find that .* is used liberally there too.

I'd venture to say that use of .* is standard practice, even for the authors of Apache itself. It's actually just a small minority who are vehemently against it, so much so that they go out of their way to write longer, more complex, and less robust patterns just to avoid it.

If it's the right tool for the job, you don't need to be afraid to use it. ;)

JD_Toims




msg:4597203
 6:58 pm on Jul 28, 2013 (gmt 0)

Take a look at every framework, every CMS, basically every open source project out there, and I think you'll find that .* is used liberally.

That doesn't mean it's best, just easiest.

When you really start digging into regex processing and the recursion often caused by the use of .*, especially when you're not "just matching everything" on the line, I think you'll find many times almost anything else is more efficient.

Also, when you're working with mod_rewrite in the context of .htaccess and a rule matches the file is reprocessed from the beginning, so if you use .* and it only generates 5 rounds of recursion for each rule (it's easy to add hundreds or thousands with multiple .* patterns on the same line), but the last rule in a 10 rule file matches (never mind conditions using the same pattern) you've created 80 extra unnecessary match attempts when compared to the 20 that would be necessary with a pattern that would "break" immediately on a negative pattern match.

And, to use "every CMS" as an example of why it's ok when WP is some of the slowest, most inefficient code, especially in the .htaccess file, doesn't lend much weight to the argument.

If WP and other open source code had it right there wouldn't be the need for threads like these: Change the default .htaccess file and make your WP site faster. [webmasterworld.com...]

Dideved




msg:4597207
 7:10 pm on Jul 28, 2013 (gmt 0)

When you really start digging into regex processing and the recursion often caused by the use of .*, especially when you're not "just matching everything" on the line, I think you'll find many time almost anything else is more efficient.


Ahh, the performance argument. So far as I can tell, I'm the only person on these forums who has ever bothered to actually run a benchmark. The performance difference we're talking about is nanoseconds of difference. It's so infinitesimally small that for all practical purposes, there is no performance difference. And to boot, the non-.* patterns aren't even always faster.

JD_Toims




msg:4597217
 7:40 pm on Jul 28, 2013 (gmt 0)

So far as I can tell, I'm the only person on these forums who has ever bothered to actually run a benchmark.

Where did you post the benchmarking test and results?

And if it was an .htaccess test, which botnet did you use to slam the server to simulate a large number of simultaneous requests for large variety of resources?

* I find it very hard to believe your test is completely valid when the use of .* goes against all regex optimization information I've read or seen, including the the info on the php, javaworld, M$ and many other websites... They all recommend the use of "the fastest break possible" not necessarily because it speeds up when matching, but because it speeds up Non-matches, which can take many times longer to process if there's a large amount of "backing up" necessary before it can be determined there's not a match. I guess it's possible you know something about regexes none of them do, but I tend to doubt it.

Dideved




msg:4597219
 7:55 pm on Jul 28, 2013 (gmt 0)

Where did you post the benchmarking test and results?


It looks like our profile pages list only a limited number of past posts. Is there a way to see my full list of posts?

And if it was an .htaccess test, which botnet did you use to slam the server to simulate a large number of simultaneous requests for large variety of resources?


I used siege (http://linux.die.net/man/1/siege).

JD_Toims




msg:4597221
 8:07 pm on Jul 28, 2013 (gmt 0)

It looks like our profile pages list only a limited number of past posts. Is there a way to see my full list of posts?

No, unfortunately.

And just to make sure you know I'm not "arguing just to argue". You're correct there's likely little to no difference in "processing time" when there's a pattern match... Where the performance gains are usually made is when there is not a pattern match, because by "breaking the check" quickly (one pass) when there's no match it's possible to save hundreds, thousands, 10s of thousands (for really inefficient patterns [obviously lol]) of "possible match checks" when compared to "greedy grab and test it all" patterns.

Dideved




msg:4597226
 8:30 pm on Jul 28, 2013 (gmt 0)

Where the performance gains are usually made is when there not a match, because by "breaking the check" quickly it's possible to save hundreds, or thousands of "possible match checks" over "grab and test it all".


I may re-run the siege test to double check that scenario. Though, if I use the suggested solutions from this thread, I strongly suspect that the non-.* alternatives won't fare very well. They don't "break early" like you would want them to.

For example, the OP's goal was to match any URL that ends in punctuation, such as "
/some/path,". Which means of course that a non-matching URL would be just "/some/path". Here's Lucy's non-.* pattern:

^([\w/-]+(\.\w+)?)?[^a-zA-Z\d].*

...or the OP's cleaned up (and working) version...

^([/0-9a-z_\-]*)[^/0-9a-z_\-]+$

The
[\w/-]+ or [/0-9a-z_\-]* parts, respectively, right at the beginning would still match a non-matching URL all the way to the end before it realizes it needs to backtrack. It hasn't solved or improved anything. I suspect the .* pattern will be the performance winner here.

And even then, we're still talking about micro-optimizations. The real question we should be asking is, "Which pattern is shorter, simpler, and works correctly for all valid URLs?"

JD_Toims




msg:4597228
 8:46 pm on Jul 28, 2013 (gmt 0)

Oh, I haven't really paid much attention to the code in this thread. I've skimmed parts to see what's going on but the posts are too long for me to sit and read (lol) I was just commenting on a couple things. I don't think I would write the rule you're talking about like either one of them have it... I think I'd do something more like:

RewriteCond %{REQUEST_URI} ^/([/0-9a-z_\-]*)
RewriteRule [^/0-9a-z]$ http://www.example.com/%1 [R=301,L]

Fairly sure the . (dot) is already being stripped for an extensionless site, so I think the only likely valid end characters are letters, numbers, or / on the particular site.

I mean why match and store everything for every request when we can "implicitly match" (no storage, no back-tracking, works for all URL patterns) then check the end of the line for an invalid character and if we find one we can "grab the good stuff" from the beginning of the URL and redirect?

In the ruleset above, we don't ever hit the condition unless there's something other than a letter, number or / at the end of a line, so if there's not an invalid character we have a single pass break and if there is then we just grab everything up to it in a single pass and redirect.

EDITED > DELETED THE EDIT
Was right the first time lol

Dideved




msg:4597233
 9:10 pm on Jul 28, 2013 (gmt 0)

In the ruleset above, we don't ever hit the condition unless there's something other than a letter, number or / at the end of a line, so if there's not an invalid character we have a single pass break and if there is, then we just grab everything up to it in a single pass and redirect.


Very clever indeed. Later today I'll test it with siege and see what kind of difference it makes.

JD_Toims




msg:4597234
 9:16 pm on Jul 28, 2013 (gmt 0)

I should probably add:

One of the reasons I'd write it that way is requests without an invalid character (99.9999*% will not have one) should "break the match" very rapidly, and I don't mind slowing down a "blip" to redirect when there's a request with an invalid character. IOW: I'd rather keep good requests as fast as I can and slow down some if I have to for the few that will need to be redirected.

Very clever indeed. Later today I'll test it with siege and see what kind of difference it makes.

Thanks and I'm definitely interested knowing the results of what you find.

lucy24




msg:4597243
 10:43 pm on Jul 28, 2013 (gmt 0)

It works, but it also redirects css files. It's weird, I added the same RewriteCond as above rule to block out css requests, but it still redirects css files.

I would strongly recommend doing it the other way around. (I think I said so in a post, but it may have been lost in the shuffle.)

Rather than excluding non-page extensions in a RewriteCond ("blacklisting"), set up your RewriteRule to only include pages in the first place ("whitelisting"). This becomes vastly easier when you've gone extensionless-- a detail I'd forgotten when I posted-- because then all page URLs come down to
^([^.]*)$
assuming, as already noted, that you have no literal periods in directory names.

Conditions are only evaluated if the body of the rule is a potential match. Two steps forward, one back. So don't put anything in a condition that you could put in a body of the rule. This particularly applies to conditions in the form "the requested URL is such-and-such".

I think somewhere along the line phranque created a spinoff thread for discussion of .* so I expect he'll be along shortly to do some further pruning.

[edited by: bill at 4:37 am (utc) on Jul 29, 2013]
[edit reason] typo fix [/edit]

Dideved




msg:4597249
 11:45 pm on Jul 28, 2013 (gmt 0)

I think somewhere along the line phranque created a spinoff thread for discussion of .* so I expect he'll be along shortly to do some further pruning.


Hopefully not, since the benefits and drawbacks being discussed are relevant to the OP's choices.

Dideved




msg:4597260
 1:09 am on Jul 29, 2013 (gmt 0)

Thanks and I'm definitely interested knowing the results of what you find.


Full details in text file below. Results are... interesting.

For non-matching URLs, such as "/some/path", which as you said will be the case 99% of the time:
- OP's original pattern: Average Transaction rate: 1,229.44 trans/sec
- Your revised pattern: Average Transaction rate: 1,223.09 trans/sec
OP's original pattern was faster by about 4 nanoseconds per transaction.

For matching URLs, such as "/some/path,":
- OP's original pattern: Average Transaction rate: 1,188.03 trans/sec
- Your revised pattern: Average Transaction rate: 1,200.55 trans/sec
Your revised pattern was faster by about 9 nanoseconds per transaction.

I know you expected the numbers to be reversed. You expected your revised pattern to be faster when the URL *didn't* match. There should be enough details in the text file for you to repeat the process and see if you get the same kind of numbers.

https://cdn.anonfiles.com/1375058355718.txt

Though, as interesting as this is, I do want to reiterate that we're talking about single-digit nanoseconds. For all real-world, practical purposes, the performance is indistinguishable, and the best option is to pick the solution that is the simplest and most correct.

JD_Toims




msg:4597263
 1:18 am on Jul 29, 2013 (gmt 0)

Interesting, thanks for testing and sharing!
And yes, we are talking about nano-seconds, but it's a fun discussion :)

The difference could have to do with the .htaccess having to be compiled for every request and mine having a condition. It would be interesting to see if there is a difference in the httpd.conf file... And the differences are so small it seems like it could almost be latency differences during the testing.

Did you run it for the .* pattern also?
That's actually the one I expected mine would "for sure" be faster than when there wasn't a match.

lucy24




msg:4597267
 2:18 am on Jul 29, 2013 (gmt 0)

The difference could have to do with the .htaccess having to be compiled for every request

This is actually a much more important question than request-processing speed alone. It isn't simply "good RegEx" vs. "bad RegEx"; by itself that would be a no-brainer. It's "execution time plus compile time of good RegEx" vs. "execution time plus compile time of bad RegEx".

It would be horrible to discover that you're better off using a bad RegEx in htaccess, because the mere act of compiling a good one uses more server resources than are saved in execution. Assorted real-life analogies present themselves, but let's stick with something like: a long messy string of [OR]-delimited conditions vs. a single tidy pipe-delimited condition.

Unfortunately I don't see any way to test this other than on a production server, because every variable counts.

JD_Toims




msg:4597272
 2:36 am on Jul 29, 2013 (gmt 0)

It would be horrible to discover that you're better off using a bad RegEx in htaccess...

In some cases (limited) "bad" isn't as bad as it may seem, for instance, from the PHP website:

An optimization catches some of the more simple cases such as (a+)*b where a literal character follows. Before embarking on the standard matching procedure, PCRE checks that there is a "b" later in the subject string, and if there is not, it fails the match immediately. However, when there is no following literal this optimization cannot be used. You can see the difference by comparing the behaviour of (a+)*\d with the pattern above. The former gives a failure almost instantly when applied to a whole line of "a" characters, whereas the latter takes an appreciable time with strings longer than about 20 characters.

[php.net...]

And from the JavaWorld website:

So for example, say you want to optimize a sub-expression like ".*a". If the character a is located near the end of the input string it is better to use the greedy quantifier "*". If the character is located near the beginning of the input string it would be better to use the reluctant quantifier "*?" and change the sub-expression to ".*?a". Generally, I've noticed that the lazy quantifier is a little faster than its greedy counterpart.

Another tip is to be specific when writing a regular expression. Use general sub-constructs like ".*" sparingly because they can backtrack a lot, especially when the rest of the expression can't match the input string. For example, if you want to retrieve everything between two as in an input string, instead of using "a(.*)a", it's much better to use "a([^a]*)a".

[javaworld.com...]

NOTE: I do think compile time is definitely a factor in the testing being done, and also the exact engine used for processing could possibly be another factor in which turns out to be the most efficient.

NOTE 2 (ADDED): I still don't like the .* pattern, because like I said previously, there are almost always more efficient ways (overall) and since so much depends on the specifics of a URL structure, including length I think it's best to avoid them whenever possible, especially when posting in public and most people don't understand all the in's and out's of efficiency... I can't remember finding a pattern I can't "match a better way" so overall in any situation it's close to as fast (nano seconds off), faster or much faster than using .*

NOTE 3 (ADDED): In fairness though, there are times when .* can be used, but it's caused a number of efficiency issues, which have been solved with alternate patterns over the years I've read here, so I think the general "oh, it works" use should be discouraged, because I've seen sites where people thought they were fast and through adjusting their .htaccess I've had them to the point you had to look twice sometimes to notice the page changed... So, although in some situations they're "ok" or even may be a bit better, generally, much like the JavaWorld post, I think they should be used sparingly at most, especially with the "reprocessing from the top" after a rule match that takes place in the .htaccess context.

[edited by: JD_Toims at 3:04 am (utc) on Jul 29, 2013]

Dideved




msg:4597274
 2:56 am on Jul 29, 2013 (gmt 0)

It would be horrible to discover that you're better off using a bad RegEx in htaccess...


Personally, I think your idea of good vs bad regex is very one dimensional. There are many factors that make any piece of code good or bad. Is it easy to write? Easy to understand? Easy to maintain? Does it perform well? Is it correct? Is it robust?

If I were to rank these in importance: Correctness and robustness are far and away the most important. Easy to understand and easy to maintain are the next most significant. Performance -- especially nanosecond performance -- is most definitely the least important.

But your idea of good vs bad seems to be the exact opposite. You sacrifice correctness and robustness and simplicity in favor of nanosecond performance.

JD_Toims




msg:4597278
 3:09 am on Jul 29, 2013 (gmt 0)

Easy to understand and easy to maintain are the next most significant.

I do have to agree, this is a good point to be made...

I had someone work on one of my most efficient .htaccess files after I left the site (it was totally dynamic, php based and looked static meaning: .html URLs with full server headers including modified, expires, content length, etc.).

It was so screaming fast you could barely ever see the page change and they "made corrections" to the .htaccess, because they couldn't understand it and ended up duplicating the content from one directory to another and at the same time eliminated the content that should have been in the directory displaying the duplicate content, so, I have to "go with" understandability being something that's important if someone else who does not "really know mod_rewrite" may ever work on a site, because someone not being able to understand one of my files and trying to "fix that which was not broken" caused a site to totally tank for a large number of queries since a full 2,000+ page directory of content wasn't present any more...

ADDED NOTE: When I say the file was efficient, I mean it was very efficient... It was nearly 300 lines of code in the .htaccess, not 10 or 11 rules, and it processed so fast if I clicked from one page to another where the template looked exactly the same and only the content changed there were times when I'd completely miss I was on the page I clicked, so being the one who built the whole thing from scratch I've gotta wonder if it was that fast for me how fast it seemed to the average visitor...

[edited by: JD_Toims at 3:30 am (utc) on Jul 29, 2013]

MarkOly




msg:4597283
 3:14 am on Jul 29, 2013 (gmt 0)

RewriteCond %{REQUEST_URI} ^/([/0-9a-z_\-]*)
RewriteRule [^/0-9a-z]$ http://www.example.com/%1 [R=301,L]

I mean why match and store everything for every request when we can "implicitly match" (no storage, no back-tracking, works for all URL patterns) then check the end of the line for an invalid character and if we find one we can "grab the good stuff" from the beginning of the URL and redirect?

Yeah that is pretty slick. It works. I've been testing it for the past 20 mins or so. Thanks! I think this is a keeper.

Rather than excluding non-page extensions in a RewriteCond This becomes vastly easier when you've gone extensionless-- a detail I'd forgotten when I posted-- because then all page URLs come down to
^([^.]*)$

Okay I was wondering how to express the extensionless extension because htm and html are easy. But when down to no valid extensions whatsoever, I got stuck on that. Thanks!

So don't put anything in a condition that you could put in a body of the rule. This particularly replies to conditions in the form "the requested URL is such-and-such".

Yeah I found myself doing this on one of them. I wanted to add a condition just to add a condition. Then I thought wait a second, this is exactly what the body says. I'm just making Apache do the same thing twice.

I'm going to test the whole lot some more and report back. Thanks for all the help guys!

JD_Toims




msg:4597284
 3:41 am on Jul 29, 2013 (gmt 0)

Yeah that is pretty slick. It works. I've been testing it for the past 20 mins or so. Thanks! I think this is a keeper.

Glad it's working for you!

I didn't test it, so I'm getting a bit better at "coding off the top of my head" again... I used to be able to do it like it was nothing, but I took a break from coding on-the-fly and "lost a bit" while I was away from it, so it's great to hear it's working and I'm "getting it back" a bit.

lucy24




msg:4597288
 4:07 am on Jul 29, 2013 (gmt 0)

RewriteCond %{REQUEST_URI} ^/([/0-9a-z_\-]*)
RewriteRule [^/0-9a-z]$ http://www.example.com/%1 [R=301,L]

This is kind of ingenious because the Condition technically isn't a condition at all: That is, it can't fail. (I assume the - and _ that are present in the condition but absent from the rule are typos.) It simply shifts the act of capturing from the Rule to the Condition. And since capturing by itself uses server resources, that means you only have to do it in the (very) rare case where the rule will apply.

This is also one reason you use non-capturing groups when possible, even though all those ?: make the code look messier. The other reason is so you don't have to keep counting parentheses and say, OK, I'm using 1 and 2 but not 3 and 4 which are inside of 2, and then 5 but again not 6, and...

JD_Toims




msg:4597289
 4:17 am on Jul 29, 2013 (gmt 0)

That is, it can't fail.

Very true... If the rule applies the condition must match!

That is, it can't fail. (I assume the - and _ that are present in the condition but absent from the rule are typos.)

They're in the condition and not the rule because they're valid URL characters, but not valid endings of a URL (I'm assuming) and I don't know the exact construction of the URLs on the site we're dealing with well enough to eliminate them... I'd limit the condition or the rule more if I could.

EG If I knew no URL should end in a number I'd eliminate the 0-9 from the rule and if I knew the site was truly extensionless (I strip everything on mine, meaning no / allowed -- I take you to the-index without a / instead) I'd eliminate the / from the rule too... In the condition I'd personally eliminate the _ on the sites I work on, because none of them use them, but since I don't know the site we're talking about well enough I can't do it in this situation.

lucy24




msg:4597302
 6:24 am on Jul 29, 2013 (gmt 0)

Most of the rules in this thread are Insurance Rules: the kind you don't need until you need them. Doubled directory slashes, extraneous path info, punctuation at the end of an URL, some other stuff which I've forgotten. What the OP is trying to do is construct an htaccess for the ages, so that when he becomes the next YouTube he doesn't have to keep adding more rules to deal with typos.

Punctuation at the end of an URL is something you might really get in a human request, for example an auto-link from a forum.

Edit:
Oh, wait. If you set up the rule to match only one character, then you can no longer use the self-same rule to get rid of extraneous path info :( Once you've got an extension, anything after it will normally be garbage. But this is the thread that started out with mixed html and htm, wasn't it? That was what made everything so complicated.

MarkOly




msg:4597425
 4:24 pm on Jul 29, 2013 (gmt 0)

That is, it can't fail. (I assume the - and _ that are present in the condition but absent from the rule are typos.)

They're in the condition and not the rule because they're valid URL characters, but not valid endings of a URL (I'm assuming) and I don't know the exact construction of the URLs on the site we're dealing with well enough to eliminate them... I'd limit the condition or the rule more if I could.

I'm glad you mentioned that because it was causing redirect loops in the event of an accidental url ending in - or _
So I added the - and _ to the RewriteRule as well. So now:

# Redirect URL containing valid characters to remove trailing invalid characters
RewriteCond %{REQUEST_URI} ^/([/\w\-]*)
RewriteRule [^/\w\-]$ http://www.example.com/%1 [R=301,L]

That seems to clears up the issue. I'll test it some more.

Most of the rules in this thread are Insurance Rules: the kind you don't need until you need them. Doubled directory slashes, extraneous path info, punctuation at the end of an URL, some other stuff which I've forgotten. What the OP is trying to do is construct an htaccess for the ages, so that when he becomes the next YouTube he doesn't have to keep adding more rules to deal with typos.

And if you remember, I started out wanting to be as lean as possible and use only 4 basic rules. Then the more I read...
Oh, wait. If you set up the rule to match only one character, then you can no longer use the self-same rule to get rid of extraneous path info sad Once you've got an extension, anything after it will normally be garbage. But this is the thread that started out with mixed html and htm, wasn't it?

No, I just converted from htm to extensionless. I had the 80 htm urls that were renamed and had to be redirected to their new extensionless urls as the first rule. The rule's working for multiple trailing junk for extensionless urls, directories, and pdf files. Lemme know if you think of anything I should test.

MarkOly




msg:4597918
 4:56 am on Jul 31, 2013 (gmt 0)

I wanted to post my final combined htaccess for posterity's sake. :) I've tested this inside and out, testing all the rules in combinations of 2 up to 4 rules together. It works for most combinations of errors, except where it gets into odd combinations. I didn't see any cases where it had to cycle through twice, except where it's unavoidable. So it seems like I have the order of rules correct, but I could be wrong.

If anybody sees anything egregious, please let me know. Otherwise, I'm making it official on my end.

Thanks for all the help I got on this. Thanks Lucy!

RewriteEngine On
RewriteBase /

#1 Redirect requests for old URLs to new URLs
RewriteRule ^old-page\.htm$ http://www.example.com/new-folder/new-page [R=301,L]
# Then repeat the above 80 times.

#2 Redirect index requests in any directory to root of that directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index(\.[a-z0-9]+)?[^\ ]*\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)index(\.[a-z0-9]+)?$ http://www.example.com/$1? [NC,R=301,L]

#8 Redirect remaining .htm or .html requests to extensionless URL
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*[^.\ ]+\.html?\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*[^.]+)\.html?$ http://www.example.com/$1 [NC,R=301,L]

#9 Redirect URLs containing valid characters to remove query string except for specific folders
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?#\ ]*)\?[^\ ]*\ HTTP/
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule (.*) http://www.example.com/$1? [R=301,L]

#11 Redirect URLs containing valid characters to remove trailing invalid characters
RewriteCond %{REQUEST_URI} ^/([/\w\-]*)
RewriteRule [^/\w\-]$ http://www.example.com/%1 [R=301,L]

#5 Redirect requests with trailing slash to extensionless URL if .htm file exists
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*[^.\ ]+/\ HTTP/
RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule ^(([^/]+/)*[^.]+)/ http://www.example.com/$1 [R=301,L]

#6 Redirect requests for non-www and non-webmail subdomains to www subdomain
RewriteCond %{HTTP_HOST} !^(www|webmail)\.example\.com$ [NC]
RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

#13 Redirect https requests to http except for specific file types, folders, and file
RewriteCond %{SERVER_PORT} ^443$
RewriteCond $1 !\.(css|gif|jpe?g|bmp|png|js|ico|xml|txt)$ [NC]
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteCond $1 !^file1
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

#7 Internally rewrite extensionless URL requests to .htm file if .htm file exists
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /[^.]+[^./]\ HTTP/
RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule ^([^.]+[^./])$ /$1.htm [L]

lucy24




msg:4597934
 7:30 am on Jul 31, 2013 (gmt 0)

Tralala. The form
[^/\w\-]
is confusing because looking at it you'd be prepared to swear it's loaded down with unnecessary escapes-- but there really aren't any!

Can't Rule 13 be expressed as ^([^.]*)$ so you don't have to put all those non-page extensions in a Condition? At this point you've already redirected all requests for .htm/.html

I don't know whether
$1
is more efficient than
%{REQUEST_URI}
I'd go with the longer form unless there's a big difference in server efficiency, just so I don't have to keep looking back "What $1? Which rule is this again?"

Looking at this rule
RewriteRule ^(([^/]+/)*[^.]+)\.html?$
I realized there's yet another possible malformed request:
example.com/blahblah//.html
So I guess the second grouping bracket needs to be [^./] after all. I don't know whether the server interprets // in this location as a null file-- error of some sort, surely? --or as a file called "/.html" In the specific case of .htm or .html you're in the clear because the server has already blocked requests beginning in .ht (I looked in MAMP's config file; that's the wording).

Now, what if someone comes in with a request for an extension you don't use at all? At one time I had a global [NS] block on requests ending in .php just because a 403 is so much more satisfying than a 404.

g1smd




msg:4597977
 10:49 am on Jul 31, 2013 (gmt 0)

Comments apply to the code shown in the examples posted a few hours ago not to the original posting from several weeks ago i.e. for me, with 30 posts-per-page, that's the code on the previous page (page 2), not in the post shown immediately above (at the top of page 3).

Rule 1: I would remove the closing $ so that old .htm URLs whether requested as .htm or .html and with or without appended junk also redirect. Should this rule also strip parameters if they were requested? The Apache default action is to re-append them. Removing parameters is as simple as adding a question mark to the rule target.

Rule 2: The Condition has
[^\ ]* that allows for appended trailing junk or parameters after the index filename. I would alter the Rule pattern to allow index requests with trailing junk to also redirect and for the junk to be stripped.

Rule 8: I would allow URLs with trailing junk to also be redirected to the new URL. I think I would also strip parameters in the redirect.

Rule 5: Should this rule also strip parameters in the redirect if they were requested?

Rule 6: I think the second Condition is redundant. Stripping parameters in this redirect may cause problems elsewhere without a lot of messing about. I'd put up with a redirection chain for some requests, as you have it now.

Rule 13: Is this meant to redirect requests for extensionless-URL pages to http, or should it redirect some other stuff as well? If it only needs to redirect extensionless requests, the Rule pattern can be changed from
(.*) to something more specific and you can get rid of at least the second Condition. Should this rule also strip parameters in the redirect if they were requested?

Rule 7: I'm not sure whether the
-f test is a good idea or not. Valid and non-valid requests trigger -f to look at the filesystem to see if the file exists. Valid requests then look at the filesystem a second time to fetch that file. The two filesystem accesses make valid requests slightly slower. If the Condition were removed, all requests would look at the filesystem only once, and the file would either be served or Apache would generate a 404 error to say it didn't exist. There's a difference in the error message though. With the -f test present the error would say that "/this-stuff" does not exist, but without the -f test the error would be that "/this-stuff.htm" does not exist, exposing that you're using rewrites to static .htm files.

All the above is guesswork and I might have made an error in my thinking somewhere.

At some point, you'll renumber your blocks of rules. The convention I use is 11 onwards for rules that block access, 21 onwards for redirects and 31 onwards for rewrites. I also subdivide 11.a, 11.b, etc where merited.

You've commented your code reasonably well, so it shouldn't be too hard to figure things out when you need to add something extra to the code several years in the future.

This 77 message thread spans 3 pages: < < 77 ( 1 [2] 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved