Welcome to WebmasterWorld Guest from 54.221.131.67

Forum Moderators: coopster & jatar k

preg match all - getting too many results?

using preg_match_all to get results from tags

     
1:53 pm on Sep 26, 2017 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Nov 29, 2015
posts:62
votes: 16


Hi all,

I'm not great at RegEx - but i'm trying :-)

My file is like this:

{pageName}Page 1{/}
{template}template1{/}
{title}Welcome to Page 1!{/}
{block1}This is block 1{/}
{block2}This is block 2{/}


And I've come up with this:

preg_match_all('/{(.*?)}(.*?){\/}/', $file, $result);


It kind of works - except it's creating a multidimensional array - and the first array[0] is the "complete match" - like this:

Array
(
[0] => Array
(
[0] => {pageName}Page 1{/}
[1] => {template}template1{/}
[2] => {title}Welcome to Page 1!{/}
[3] => {block1}This is block 1{/}
[4] => {block2}This is block 2{/}
)

[1] => Array
(
[0] => pageName
[1] => template
[2] => title
[3] => block1
[4] => block2
)

[2] => Array
(
[0] => Page 1
[1] => template1
[2] => Welcome to Page 1!
[3] => This is block 1
[4] => This is block 2
)

)


When really - I only want the matches - from array[1] and array[2]

Do I just remove the first element from the array? array_shift($result)? Or have I done something wrong in the regex?

It seems a waste to get the server to match everything - then just remove it with array_shift?


Thanks for the help!
4:30 pm on Sept 26, 2017 (gmt 0)

Full Member

Top Contributors Of The Month

joined:Apr 11, 2015
posts: 306
votes: 21


Well, basically, that is just "the way it works". As with most regex flavours, the 0-backreference is the "complete match" and that is what's being returned here in the 0-index of the $result array. The underlying PCRE regex engine returns the matches and PHP "simply" packages this into an array.

You can change the structure of the returned array, but you will always get the "complete match" returned somewhere.

Personally, I've never known this to be a problem. This isn't really much of a "waste", unless perhaps the input data/match is huge - but then you'd perhaps be looking at alternative methods anyway if this really became a problem? If you specifically need an array with just the second two sub arrays then you can use array_shift() as you suggest.
5:20 pm on Sept 26, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14426
votes: 576


{(.*?)}(.*?){\/}
Cue for me to run around screaming “Nooooo!”

I think what you actually mean is
{([^{}]*)}([^{}]*){\/}
Of course it's really [^}] the first time and [^{] the second time, but let's be double careful. Depending on where, exactly, you’re applying the RegEx, you may even want anchors around the whole thing:
^{([^{}]*)}([^{}]*){\/}$


Is it really possible for either of the first two pieces to be empty, while the whole thing remains legitimate?
{}blahblah{/}
{blahblah}{/}
8:17 pm on Oct 4, 2017 (gmt 0)

Moderator

WebmasterWorld Administrator ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 25, 2002
posts:8562
votes: 245


Actually, leebow's solution is safer than Lucy24's solutions for general use because it allows for internal braces.

For example, say your data looks like this

{block1}You can use a simple quantifier like \d{2} to say "match two digits" {/}


Leebow's solution will correctly match and return block1 for the first capture group and You can use a simple quantifier like \d{2} to say "match two digits"

Lucy24's first solution will return 2 for $1 and to say "match two digits" for $2

Lucy24's more restrictive solution will fail to match at all and will return nothing.

I am guessing the scream "Nooo!" is because general match and lazy operators are inefficient? On a modern regex engine, for a simple pattern like this, I doubt there's much efficiency difference between the negated character class and the lazy operator in terms of efficiency. It amounts to about the same thing.

As things get more complex, however, lazy patterns do indeed get expensive, like lookaheads and lookbehinds, and demand a lot of work from the regex engine. Effectively, a lazy pattern *is* a simple lookahead, which forces the regex engine to look ahead and backtrack, while a negated character class is a greedy expression, that gobbles up what it finds as it goes along until it hits a roadblock, then just moves on in the evaluation.

So what to do?

Option 1: unless this seems to be slowing things down dramatically, leave well enough alone.

Option 2: get really complicated with your regular expressions and use the regex engine to the fullest. In that case you'll end up with something like this

{([^}]*)}((?:[^{]++|{(?!\/}))*+){\/}


{([^}]*)} - matches the opening delimiter and captures the contents as $1
((?:[^{]++|{(?!\/}))*+) - this is the complex stuff

(?:) - creates a non-capturing group, meaning it groups for purposes like alternation, but doesn't capture.
++ and *+ create possessive matches, meaning match until failure, but once you fail, don't give up matched characters and don't backtrack. Backtracking is the expensive part of regular expressions, so by limiting that, we gain efficiencies.
(?!) - creates a negative lookahead, meaning, match X not followed by Y

So if we start putting those together, starting from the center, we have
1. (?:[^{]++|{(?!\/}) - non-capturing group that matches one or more characters (first +) that are not a { in a possessive match (don't backtrack on failure) OR match { not followed by /}. This is only non-capturing because if it weren't you would end up with a third capture group that would include the 2} forward in our string.
2. We wrap #1 in (*+) because we'll take as many of these matches as we can get and link them together, without backtracking

{\/} - matches your closing delimiter with no capture

Of course, your code is now almost impossible to read except by the most dedicated regex experts around. In my opinion, not worth it.

You can benchmark it if you want. I've benchmarked things like this in PHP and they tend to have tiny differences unless you are performing thousands of iterations. Rex over at the RexEgg site on regular expressions has benchmarked a very similar case and over 10,000 iterations, the differences were 300ms, which is to say less than one millisecond per iteration. This difference does increase with longer strings, but personally, I would need a very strong performance case to go down that road.

Still, it's fun to play with.
This regex tester will break it all down for you and you can see what happens when you change the regex or the string you're processing
[regex101.com...]

And Rex explains as clearly as one can, his "explicit greed" technique
[rexegg.com...]
9:46 pm on Oct 4, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14426
votes: 576


In my opinion, not worth it.
Yes, I get a headache just looking at it :)

A lot of times, the exact wording of a rule will be influenced by what patterns might actually occur in your search string. (Simple example: If you know that your URLs never contain literal periods, then ^[^.]+ captures everything except the extension. But if literal periods are a possibility, it won't work.) Or, in this case: can either of the strings-to-be-captured contain braces? Come to think of it, the [^{}] locution might then be an asset:
{[^{}]*(?:{[^{}]*}[^{}]*)*}{/}$
and so on. Only don't quote me.

I'm still wondering about the null option implied by .* though. I could see the middle part being empty by accident, say if you forget to name a page--but the first part? Isn't some content essential there?
10:57 pm on Oct 4, 2017 (gmt 0)

Moderator

WebmasterWorld Administrator ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 25, 2002
posts:8562
votes: 245


null option implied by .*


In general, I avoid .* if possible. If you don't want to match the zero case, don't use the *
3:56 am on Oct 5, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14426
votes: 576


Yeah, that was my point. If the null case represents an error, you shouldn't be matching for it at all, but instead require a .+ at minimum.

But, once again, so much depends on what actually occurs.