homepage Welcome to WebmasterWorld Guest from 54.197.110.151
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

This 32 message thread spans 2 pages: 32 ( [1] 2 > >     
regex efficiency with the ambiguous, greedy and promiscuous .*
Dideved



 
Msg#: 4566749 posted 11:33 am on Apr 19, 2013 (gmt 0)


System: The following 10 messages were cut out of thread at: http://www.webmasterworld.com/apache/4565193.htm [webmasterworld.com] by phranque - 5:43 pm on Apr 20, 2013 (utc -7)


> Rules with (.*) at the beginning or the middle of a pattern can be
> optimised a LOT.

A lot? The difference is less than a millisecond. It's less than even a microsecond. It's a textbook micro-optimization. When you tell people that this will make a big difference, you're blatantly misinforming them.

 

Dideved



 
Msg#: 4566749 posted 8:03 pm on Apr 19, 2013 (gmt 0)

> Here it's a big help if none of your directory names contain literal
> periods.

Ignoring literal periods is not helping the problem That pattern would actually be shorter and simpler -- and work correctly in all scenarios -- if you used .*. And the performance difference. for all practical purposes, is non-existent.

[edited by: incrediBILL at 11:24 pm (utc) on Apr 19, 2013]
[edit reason] Edit for TOS #4 [/edit]

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4566749 posted 11:46 pm on Apr 19, 2013 (gmt 0)

The difference is less than a millisecond. It's less than even a microsecond


Which can add up to a lot of microseconds when people run thousands of rules processing tens of thousands of visitors per hour. I'm all for any optimizations possible because it's good technique which should always be encouraged over slow and sloppy code.

jdMorgan, the prior moderator of this forum, used to make some optimizations that I thought weren't that special until I installed them on my high volume site and you could often note a little extra snappiness all of a sudden.

Also consider many sites run on big shared servers with hundreds or thousands of sites on a single server, or now the cloud, and if all the Apache files are optimized for all those sites on the server(s) it'll give a bit extra resources across the board. Remember, most have huge .htaccess files with tons of bot blocking stuff, sometimes tons of redirects, etc. so any improvement is a good thing and the most optimum method is always best.

FYI, Google rates sites based on performance these days so responding even a fraction of a second better than the competition is an improvement that may make you outrank them and it's not to be overlooked IMO.

Dideved



 
Msg#: 4566749 posted 12:11 am on Apr 20, 2013 (gmt 0)

> Which can add up to a lot of microseconds when people run thousands of
> rules processing tens of thousands of visitors per hour.

After a whole day of tens of thousands of visitors per hour, this optimization would have saved you a grand total of about 10 milliseconds... in total... for the whole day.

The prevailing wisdom of the web community is that micro-optimizations are a bad reason to complicate our code, and they're definitely a bad reason to introduce bugs. Lucy's and g1's patterns knowingly ignore valid scenarios in favor of this micro-optimization.

Of course, everyone is entitled to their opinion, but to try to justify that opinion with false information is bad for everyone. The claim that this is a big optimization is false, and to imply that it's is a widely accepted practice is also false.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4566749 posted 12:16 am on Apr 20, 2013 (gmt 0)

All I know is jdMorgan, the previous moderator of this forum who was second to none being an Apache guru pushed the optimizations so if the guru said it made a difference, and I know I've read those discussions a time or two and I'm not digging them up at the moment.

Whether others use that syntax or not doesn't impress me at all because people don't tend to study the code, or the manual in detail, and get the finer points which some do and others don't.

jsMorgan obviously had reasons and he was the guy that originally published a more optimal WordPress .htaccess section here which skipped images and such which was a massive improvement over the stock garbage WP and Joomla still publish.

However, if I had the time to really profile the code and verify the savings were as infinitesimally small as you claim, then I'd probably drop the code if the implementation overly complicated it with no realistic gain.

Anyway, optimization isn't the topic so let's get back to the thread topic.

Dideved



 
Msg#: 4566749 posted 1:11 am on Apr 20, 2013 (gmt 0)

> However, if I had the time to really profile the code and verify the
> savings were as infinitesimally small as you claim...

Even Lucy herself used the word "nanoseconds" to describe the performance difference. It really is that infinitesimally small.

Dideved



 
Msg#: 4566749 posted 8:35 am on Apr 20, 2013 (gmt 0)

Hopefully some objective numbers from repeatable tests can put this issue to rest. (We can split this discussion off into another thread if you'd like.)

I tested in two ways: with Siege [linux.die.net], an HTTP stress tester, and with a run-of-the-mill PHP loop to fetch and benchmark.

It's between a pattern such as this:

^([^/]+/)*foo\.html$

That's the kind of pattern lucy24/g1smd have frequently recommended, vs a pattern such as this:

^(.*/)?foo\.html$

The PHP test looks like this:

$nRequests = 100;

$start = microtime(true);
for ($i = 0; $i < $nRequests; $i++) {
file_get_contents('http://localhost/foo.html');
}
$stop = microtime(true);

$totalTime = $stop - $start;
$averageTimePerRequest = $totalTime / $nRequests;
$averageTimePerRequestInMicroseconds = $averageTimePerRequest * 1000;

echo "$averageTimePerRequestInMicroseconds microseconds";


For the first pattern, the average time was 11.5 microseconds. For the second pattern, the average time was 11.775 microseconds. Not even one microsecond difference (pretty much the textbook definition of a micro-optimization).

Next, I ran Siege for one minute simulating 25 concurrent users (
siege -c25 -t1M http://localhost/foo.html). The first pattern averaged 32.87 requests per second. The second pattern averaged 33.02 requests per second. Interestingly, in this test, the second pattern performed better.

Bottom line: Let's stop pretending this is an issue. This is one of the microest of micro optimizations ever. If we have to complicate our pattern, even just a little, then it's not worth it. If we have to introduce bugs, then it's _definitely_ not worth it.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4566749 posted 10:53 am on Apr 20, 2013 (gmt 0)

http://localhost/foo.html


there's minimal backtracking in your single test case.


^(.*)/(.*)/page-(.*).html$


this regular expression is one from several rules that may fire on every request in john28uk's actual configuration and would be a much more interesting and informative regular expression to test.

Dideved



 
Msg#: 4566749 posted 4:24 pm on Apr 20, 2013 (gmt 0)

> there's minimal backtracking in your single test case.

This isn't the most crucial optimization you'll ever make. Don't knowingly ignore valid requests just to take advantage of this micro-optimization.

> ^(.*)/(.*)/page-(.*).html$

And what pattern am I comparing it against? I'm guessing you may want me to test it against this alternative:

^([^/]*)/([^/]*)/page-([^.]*).html$

But this is logically a different rule. If the behavior of the original was correct, then this one is buggy. We need to compare two rules that logically do the same thing.

[edited by: incrediBILL at 4:07 am (utc) on Apr 21, 2013]
[edit reason] TOS #4 - DISCUSS THE TOPIC - NOT THE MEMBERS [/edit]

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4566749 posted 1:34 pm on Apr 27, 2013 (gmt 0)

Don't knowingly ignore valid requests just to take advantage of this micro-optimization.

the optimal regular expression will match all valid requests, the fact of which is irrelevant to the test case.

Dideved



 
Msg#: 4566749 posted 11:05 pm on Apr 27, 2013 (gmt 0)

My comment to not knowingly ignore valid requests was in response to some members who shall remain nameless whose patterns used [^.] to match paths and filenames, even though they knew full well that a period is a perfectly legal character, which is even used in real-life and high-profile URLs. Yet even after I pointed this out, they still persisted in using that pattern, because, it would seem, they consider this micro-optimization to be more important than a bug-free pattern.

MickeyRoush



 
Msg#: 4566749 posted 6:03 am on May 4, 2013 (gmt 0)

I was taught that if you don't have to use something greedy like .* then you shouldn't. Because if you use it incorrectly, like not providing an exit from a redirection, you'll always end up in a loop. Also, I really think it's strange when people do this:

^.*something.*$

Which would probably be better off just using this:

something

Why be greedy when you don't have too?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4566749 posted 6:25 am on May 4, 2013 (gmt 0)

Why be greedy when you don't have to?

Words to live by :)

No, I don't know why the Apache docs repeatedly use such atrocious Regular Expressions in their examples. But they never claimed to be a RegEx tutorial.

Dideved



 
Msg#: 4566749 posted 8:19 am on May 4, 2013 (gmt 0)

> I was taught that if you don't have to use something greedy
> like .* then you shouldn't.

Though, this disagreement isn't actually about greedy vs non-greedy. The alternatives still use "*" and "+", both greedy quantifiers. The part they actually change is the ".". They would rather change it to something like [^.], even when doing so will introduce a bug.

> Because if you use it incorrectly, like not providing an
> exit from a redirection, you'll always end up in a loop.

This is actually a completely different rationale from what others here have expressed. A few people here advocate avoiding .* for performance reasons. Though, this turned out to be an insignificant micro-optimization.

Your reason for avoiding .*, however, seems to be that it can be used incorrectly. And I would agree that beginners often don't realize they've written themselves into a loop. But it seems to me that this should be a reason to emphasize the teaching of .* and rewrite conditions for beginners. It shouldn't be a reason to ban .* for veterans.

> Also, I really think it's strange when people do this:
> ^.*something.*$

I agree. There's no good reason for that. But if we're in a situation where .* is indeed the right tool for the job, then we shouldn't be afraid to use it.

[edited by: Dideved at 8:24 am (utc) on May 4, 2013]

Dideved



 
Msg#: 4566749 posted 8:22 am on May 4, 2013 (gmt 0)

> I don't know why the Apache docs repeatedly use such
> atrocious Regular Expressions in their examples.

The alternative explanation, of course, is that the people at Apache write perfectly good regular expressions, and that it's actually just a few people at these forums who became a wee bit obsessed with a micro-optimization.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4566749 posted 9:05 pm on May 4, 2013 (gmt 0)

introduce a bug

IF
#1 the [^.] construction occurs within the htaccess packaged with a CMS
#2 the same CMS permits file and/or directory names containing literal periods
and
#3 the program is not produced or distributed by Microsoft*

THEN
one might legitimately speak of a bug in the program. Not in the Regular Expression, in the distribution as a whole.


* Punch line suppressed because everyone already knows it

Dideved



 
Msg#: 4566749 posted 12:05 am on May 5, 2013 (gmt 0)

I don't understand what you're trying to say here. But here's the bottom line: If the goal is to match a path or filename, both of which can legitimately contain periods, then it's an error to match on [^.].

Dideved



 
Msg#: 4566749 posted 1:42 am on May 5, 2013 (gmt 0)

And to clarify: It isn't some CMS that "permits" literal periods. The Internet standards itself is what permits literal periods in URL paths.

[ietf.org...]

DrDoc

WebmasterWorld Senior Member drdoc us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4566749 posted 4:08 pm on May 6, 2013 (gmt 0)

^(.*)/(.*)/page-(.*).html$

vs

^([^/]*)/([^/]*)/page-([^.]*).html$


It should probably be:
^([^/]*)/([^/]*)/page-(.*).html$

Sometimes you can't avoid greedy match-all patterns. But when you can, you should.

Dideved



 
Msg#: 4566749 posted 6:16 pm on May 6, 2013 (gmt 0)

I agree, DrDoc, that whoever originally wrote that pattern probably intended to match single path segments, in which case you're right that [^/]* is correct and .* is not.

The issue that's the topic of this thread is that a few people advocate *always* avoiding .*, even when it's logically the correct pattern, and instead they often suggest [^.]*, even when matching paths and filenames, despite that a path can legitimately contain periods.

Originally, they rationalized this by saying that avoiding .* was a big optimization. But benchmarks revealed that it's actually a micro-optimization at best. Lately, at least one person has switched to a different rationalization and is now saying that periods don't belong in the URL at all.

DrDoc

WebmasterWorld Senior Member drdoc us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4566749 posted 8:09 pm on May 6, 2013 (gmt 0)

There's also a big difference between something being allowed and something being good practice.

Despite periods being allowed in path names and file names, one can argue whether it is a good idea to employ periods in the names. And, if one never uses periods in such names, [^.] becomes acceptable (and, indeed, preferred ... for that particular user or circumstance).

While I agree that micro optimizations may not always seem worth it, the greater issue here is being able to always write optimized regular expressions. There are some expressions which, when executed on a short string, run very quickly but when run in a different scenario take f.o.r.e.v.e.r to execute. The problem is the prevailing habit of copy-and-paste coupled with the mindset of "it worked in this instance, so it must work in the other".

I think it is important to always write optimized code/regex/whatever, simply for the fact that it is a good habit, even if they gain is not immediately (or ever) realized. There simply is no drawback to an optimized version (assuming it is in all other aspects identical in serving its purpose).

Dideved



 
Msg#: 4566749 posted 9:15 pm on May 6, 2013 (gmt 0)

> And, if one never uses periods in such names, [^.] becomes
> acceptable

True. But a few people don't consider the "if" part of your sentence. They use it as a general purpose pattern to match any path. For example, one poster wanted to match every request ending in ".html". (There was no mention of periods being forbidden.) The pattern someone had suggested to him was:

^([^.]+\.html)$

The person who offered this pattern claimed it would correctly capture all .html requests... but of course that's not true. It *won't* capture all .html requests. It's buggy. And the author did that deliberately, all in the name of an imperceptible, insignificant micro-optimization.

> I think it is important to always write optimized
> code/regex/whatever, simply for the fact that it is a good
> habit

Of course everyone is entitled to their opinion, but the prevailing wisdom of the web community is that micro-optimizations are, at best, a waste of the developer's time. If a micro-optimization adds any extra length or complexity to our code, then it goes from being a waste of time to a bad idea. If a micro-optimization requires that we deliberately introduce a bug, then it's *DEFINITELY* a bad idea.

Buggy behavior and added complexity certainly quality as drawbacks.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4566749 posted 10:03 pm on May 6, 2013 (gmt 0)

I don't understand what you're trying to say here.

... so let's just go ahead and disagree with it on principle.

Apache and Regular Expressions both place an extremely high premium on using the correct terminology and syntax. A single misplaced comma or unescaped space can bring the whole server crashing to the ground.

My issue here is with the recurring use of the terms "bug" or "buggy" as if they were synonymous with "leading to unintended results".

the prevailing wisdom of the web community

Ah, now we're getting somewhere. A statement of fact that can be investigated and thereby confirmed or refuted. All we need to do is (1) identify the "web community" and (2) locate some reliable sources for its "prevailing wisdom".

Oh, and (3) establish that "normal" in the statistical sense is equivalent to "normal" in the diagnostic sense.

Dideved



 
Msg#: 4566749 posted 10:54 pm on May 6, 2013 (gmt 0)

> My issue here is with the recurring use of the terms "bug"
> or "buggy" as if they were synonymous with "leading to
> unintended results".

So now we're debating the definition of a bug? How then do you define a bug?

If a pattern intends to, and claims to, capture all .html requests, but fails for certain URLs, what would you call that?

> Ah, now we're getting somewhere. A statement of fact that
> can be investigated and thereby confirmed or refuted. All we
> need to do is (1) identify the "web community" and (2)
> locate some reliable sources for its "prevailing wisdom".

Off the top of my head: Doug Crockford, widely known and widely respected in the JavaScript arena. The authors behind Zend and the authors behind Symfony. And even Steve Souders, the chief performance officer first at Yahoo and now at Google.

Dideved



 
Msg#: 4566749 posted 6:29 am on May 7, 2013 (gmt 0)

I'd be negligent if I left Knuth off that list. To my knowledge, he's the one who coined the widely known phrase:

"There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."

Dideved



 
Msg#: 4566749 posted 10:26 am on Jun 29, 2013 (gmt 0)


System: The following 7 messages were cut out of thread at: http://www.webmasterworld.com/apache/4588670.htm [webmasterworld.com] and spliced on to this thread by phranque - 4:59 am on Jun 30, 2013 (utc -7)


(d) the expression .* should never be used in mid-pattern if you can possibly help it. If you can't get by with a simple
%{THE_REQUEST} index\.html
then use
/([^/]+/)*index\.html


:sigh:

For the benefit of the OP, it needs to be mentioned that -- although Lucy's other recommendations are all great -- this particular recommendation is not widely accepted. The use of .* is both standard practice and frequently used even in the official documentation.

I have the benefit of knowing that Lucy's motivation for discouraging .* is to save a few nanoseconds of performance. But since I'm apparently the only one around here who bothers to runs benchmarks, I also have the benefit of knowing that her alternative isn't any faster. Not that a few nanoseconds would matter anyway.

[edited by: phranque at 12:04 pm (utc) on Jun 30, 2013]
[edit reason] clarify System tracks [/edit]

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4566749 posted 2:36 pm on Jun 29, 2013 (gmt 0)

Some years ago, jdMorgan showed that for server under high load, these things did make a difference.

The fact that .* is "greedy, promiscuous and ambiguous" should serve as a warning that it is often not the right thing to use.

I'll repeat again. When designing a RegEx pattern, aim to have it parse cleanly from left to right in a single pass without any back-tracking.

Dideved



 
Msg#: 4566749 posted 3:15 pm on Jun 29, 2013 (gmt 0)

Some years ago, jdMorgan showed that for server under high load, these things did make a difference.


You'd place more faith in one person's word than you would in objective, repeatable, verifiable tests?

The fact that .* is "greedy, promiscuous and ambiguous"...


That's not a fact. It's your opinion. And it's not a widely held opinion.

I'll repeat again. When designing a RegEx pattern, aim to have it parse cleanly from left to right in a single pass without any back-tracking.


So you've got repeatable and verifiable tests to back up your grandiose claims of improved performance, right? Because it seems like I'm the only one whose been fact-checking around here, and the facts don't back up your claims.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4566749 posted 7:05 pm on Jun 29, 2013 (gmt 0)

That's not a fact.

It is a fact. The words "greedy", "promiscuous" and "ambiguous" are all technical terms with precise meanings.

For arcane historical reasons, a great deal of Regular Expressions terminology has to do with food, so for example one speaks of "flavors" rather than dialects.

Regular Expressions are greedy by nature. (The opposite term is mercifully "stingy" rather than, say, "finicky" ;)) The notation
.*a
therefore means
"capture as many characters as you possibly can, so long as there is an 'a' left over"
rather than
"capture some characters, stopping as soon as you hit the first 'a'".

You can look up "promiscuous" and "ambiguous" for yourself.

Lots of people get by with imperfect rules and slapdash patterns. It is much less common for someone to actively encourage and recommend doing things sloppily-- not simply because you're lazy and "it's good enough for me" but because you think it's better to do a half-### job.

When you're first composing a rule, it may take a little longer to devise the best possible format. (Remember, though, that a nanosecond for the server is in no way comparable to a nanosecond of human programming time.) This, however, pales by comparison with the absolutely colossal amount of time wasted in this forum in recent months, arguing over what ought to be ordinary ordinary common sense. Frankly I'm surprised I have not yet been asked to "justify" the recommendation to list conditions in order of likelihood-to-fail.

Edit:
Oh yes and...

There is one extremely powerful argument that could be made in favor of the ".*a" formulation. The fact that this argument has not been made tends to speak for itself.

Dideved



 
Msg#: 4566749 posted 9:56 pm on Jun 29, 2013 (gmt 0)

It is a fact. The words "greedy", "promiscuous" and "ambiguous" are all technical terms with precise meanings.


"Greedy" is a technical term that describes the quantifier *, not .* as a whole. And since your alternatives also use this greedy quantifier, it's not a compelling argument. "Promiscuous" and "ambiguous", on the other hand, are *not* technical terms. They are not used by the documentation, and they do not have any special, technical meaning for regular expressions. These words are only your and g1's opinionated adjectives, an opinion which is not shared by the wider web community nor by the creators of Apache itself.

Lots of people get by with imperfect rules and slapdash patterns. It is much less common for someone to actively encourage and recommend doing things sloppily-- not simply because you're lazy and "it's good enough for me" but because you think it's better to do a half-### job.


Your alternative patterns are unnecessarily complicated and frequently buggy. It's just plain stupid to believe that a more complicated and/or buggy pattern is better. All so you can gain -- at best -- a few insignificant nanoseconds. Sometimes your alternatives aren't even faster at all, which you would know if you ever bothered to run a benchmark.

And what's worse is that I've posted objective, repeatable, verifiable tests to demonstrate all this to you and g1, and both of you have just stuck your heads in the sand.

This 32 message thread spans 2 pages: 32 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved