PHP Regex help

Forum Moderators: coopster

Message Too Old, No Replies

PHP Regex help

darkmasta

4:56 am on Jan 26, 2005 (gmt 0)

I want a regex that would work as follows: I have a string like {hey.yo.wat}, the string is by itelf and always must start with a { and end with a }, the first backreference would contain a 'hey.yo.', the second 'yo.' and the third, 'wat'. If the string is only something like {hey.yo}, you would get 'hey.' for the first and second backreferences and 'yo' would be the third backrefernce. How should I build the regex?

coopster

2:17 am on Jan 30, 2005 (gmt 0)

Welcome to WebmasterWorld, darkmasta.

Have you had a look at the Learning PHP - Books, Tutorials and Online Resources [webmasterworld.com] thread in our PHP Forum Library [webmasterworld.com]? There are some good REGEX links in there. Try writing it out first and if you get stuck, post what you have so far and we'll try to help you get over the issues.

darkmasta

1:33 am on Feb 2, 2005 (gmt 0)

No, I have a very complete understandinf of PHP down to the internal workings of the Zend Engine. It is simply my inability to do regexs that have made me post.

StupidScript

4:54 am on Feb 2, 2005 (gmt 0)

Interesting. I wonder about the structure of the query-to-follow. i.e. WHY would you want to break something like that up into regex? Wouldn't it be more simple to grab and parse the string?

Given:

{hey.yo.wat}

Your regex begins with:

\{hey\.yo\.wat\}

Now, how to grab the various references?

\{([A-Z]{0,3}\.)([A-Z]{0,2}\.?)([A-Z]{0,3})?\}

Doesn't cut it according to the requirements you posted, because it doesn't give a compound ref1 + ref2.

Can we get sub-references using standard regex syntax, illustrating:

\{(([A-Z]{0,3})\.([A-Z]{0,2})\.?)([A-Z]{0,3})?\}

Nope.

However, why not test for the combination in another way? (to paraphrase)

if ( \1 + "." + \2 + "." == "trueCondition")

Doing other, similar parsing operations to determine the values of the string's elements.

Here's one of many decent regex references: Regular-Expressions.info [regular-expressions.info]

Most engines (PHP, PERL, EditPad Pro, etc.) use the common syntax, so if you can figure the subreferences out, go nuts. I'd take a look at the logic of your program, first, though, IMHO ... :)

darkmasta

2:06 am on Feb 3, 2005 (gmt 0)

what I really need is just the first and the third. Essentialy the whole bunch of items going {hey.yo.waha.sa.s.df} should be split into hey.yo.waha.sa.s and df. That is really all I need. I don't need the second reg that I originally posted.

darkmasta

2:48 am on Feb 3, 2005 (gmt 0)

I made some success! \{([a-z0-9\-_\.]+?)([a-z0-9\-_]+?)\} does exactly what I want except for one thing.. If the subject is something like {BLOCK}, it makes one of the refs B and the other LOCK :( Help!

ergophobe

2:56 am on Feb 3, 2005 (gmt 0)

Apologies in advance that this may cause some sidescroll on small monitors.

Can I cheat a bit and use named groups and a couple of extra regex statements?

$pattern =
'/(\{(([^.]*\.)[^.]*\.)([^\}]*)\})�(\{(?P<2>(?P<3>[^.]*\.))(?P<4>[^\}]*)\})/';

preg_match($pattern, $text, $matches);

In the case of {hey.yo.wat}

print_r($matches) yields

Array
(
[0] => {hey.yo.wat}
[1] => {hey.yo.wat}
[2] => hey.yo.
[3] => hey.
[4] => wat
)

In the case of {hey.yo}
print_r($matches) yields

Array
(
[0] => {hey.yo}
[1] =>
[2] => hey.
[3] => hey.
[4] => yo
[5] => {hey.yo}
[6] => hey.
[7] => hey.
[8] => yo
)

Unfortunately, if you want to actually use this with preg_replace, it won't work because you would end up with conflicts in the indexing and it just chokes. What will work, though, is to split the pattern in two and test with preg_match and then replace with preg_replace

$pattern1 = '/\{(([^.]*\.)[^.]*\.)([^\}]*)\}/';
$pattern2 = '/\{(?P<1>(?P<2>[^.]*\.))(?P<3>[^\}]*)\}/';

if (preg_match($pattern1, $text))
{
$newtext = preg_replace($pattern1, '$3-$2-$1', $text);
}

if (preg_match($pattern2, $text))
{
$newtext = preg_replace($pattern2, '$3-$2-$1', $text);
}

In this case, because we don't have one capture group used up in defining our OR as in ()�(), we don't have to shift the index and can use numbers 1, 2 and 3.

echo "Replacing $text with : $newtext"

Replacing {hey.yo.wat} with : wat-hey.-hey.yo.
Replacing {hey.yo} with : yo-hey.-hey.

That gets you where you want to be, but it requires two regex calls if the first case is met, and three if it fails and goes to the second condition. I couldn't get it to work with just one regex call, but I'm sure someone who is a lot better at regex than I am could handle it.

Tom

darkmasta

2:58 am on Feb 3, 2005 (gmt 0)

I was able to get it to work :)

darkmasta

3:17 am on Feb 3, 2005 (gmt 0)

btw, this was my pattern : '/\\{([a-z0-9\\-_]+?\\.+?)([a-z0-9\\-_]+?)\\}/si'

ergophobe

7:11 pm on Feb 4, 2005 (gmt 0)

Ahh, reading your message #5, you don't want a regex at all. You want this

$string = substr($string, 0, -1); // shave off }
$string = substr($string, 1); // shave off {

$array = explode('.', $string);

or some such thing.

As for the regex, because I enjoy thinking these through...

First, I should have used + instead of * in my previous example.

More to the point, are you sure about your solution? It doesn't meet the conditions you laid out in your initial post and it doesn't meet the conditions you laid out in message 5 either as far as I can tell. For example, it matches {hey.yo} but not {hey.yo.wat}. Essentially, your regex says

"match all strings that begin with a curly brace which is followed by zero or more characters in the range "a-z0-9-_", followed by zero or more periods. If this group matches, meaning we have found either nothing at all or we have found a character in our set or we have found a period, move on and try to match the next group. If we find zero or more characters in our character set, match those in another group".

'/\\{([a-z0-9\\-_]+?\\.+?)([a-z0-9\\-_]+?)\\}/si'

Your second + applies to the \. not to the entire capture group, so you match
{...yo} - match
{hey.yo} - match
{hey..yo} - match
{hey.yo.wat} - no match!

Why don't you match the last one? Because your mathc works like this:

\{ - get a {. If found, mathc and move on.

( - start a capture group

[a-z0-9\\-_]+? - if we find zero or more characters in the set [], match and move on. This is identical to [a-z0-9\\-_]* because the? let's you not match and succeeed. That's why my regex with * instead of + is wrong.

\.+? - if we find zero or more periods, match and move on.

) close our first catpure group. Everything we've found is group one. Possible matches would be
....
hey
hey...
hey.

These values would not match
hey.yo
.yo
.you.

There is no repetition of the group, so we move on to the next group.

( - start group

[a-z0-9\\-_]+? if we find zero or more characters in the set [], match and move on.

) - close our capture group. Possible values are
hey
yo
wat

The following would not match
yo.
.wat

So your regex as a whole matches {hey.yo} but not {hey.yo.wat} because the capture groupe with the \. is not repeated. So we can redo your regex so it matches. I have simplified it to checking for "not period" rather than a a-z range etc, but mostly, I've changed the repetition so that it matches your criteria

\{([^\.]+\.)+([^\.]+)\}

This matches strings
- at least one group that is followed by a dot
- one and only one group that is not followed by a dot

The problem here is that you would only have two capture groups under all circumstances.

So {hey.yo} returns
1 - hey.
2 - yo

{hey.yo.wat} returns
1. yo.
2. wat

{hey.yo.dsf.wat} returns
1. dsf.
2. wat

Because when it goes around and rematches the group, it throws away the previous value until the match fails and it moves on. So that won't work either. If you only want the first and last terms in all circumstances, then you could use a pattern like

\{([^\.]+\.)([^\.]+\.)*([^\.]+)?\}

That would always give you
- the first set of chars that are followed by a dot
- the last set of chars that are followed by a dot
- the last set of chars provided there is no dot

{hey.yo} would return
1. hey.
2. (empty)
3. yo

{hey.yo.wat}
1. hey.
2. yo.
3. wat

{hey.yowat.poi.wodj.snred}
1. hey.
2. wodj.
3. snred

A quick and easy way to test your regex against multiple conditions and see what matches and groups are being returned is to use the tool at

[fileformat.info...]

Have fun!

darkmasta

4:38 am on Feb 5, 2005 (gmt 0)

That output is not what I wanted, I thank you greatly for your help in the matter however.

'/\\{([a-z0-9\\-_\\.]+?)(\\$)?([a-z0-9\\-_]+?)\\}/si'

is a bit cleaner however, its just that if there is no period, it continues...
What I want is this, {hey.yo} gives hey. and yo, {hey.das.yo} gives hey.das. together, and yo seperatly. If I have {hey.das.go.as.yo}, I get hey.das.go.as. and yo

darkmasta

4:47 am on Feb 5, 2005 (gmt 0)

Finally finish. I used an atomic group.

'/\\{([a-z0-9\\-_\\.]+?)(?>\\.+)(\\$)?([a-z0-9\\-_]+?)\\}/si'

It could prolly be done more ellegantly.

darkmasta

5:00 am on Feb 5, 2005 (gmt 0)

Atomic Group did not like me much, final version: '/\\{([a-z0-9\\-_\\.]+?\\.+)([a-z0-9\\-_]+?)\\}/si'

It does it all :) Might be inefficent in one way or another but it is more efficent then my original creation.

ergophobe

5:00 pm on Feb 5, 2005 (gmt 0)

That output is not what I wanted,

I got a bit lost in it all. I forgot about the "hey.yo" requirement somewhere along the line.

Anyway, you still have some problems in your regex.

1. Not a problem really, but as I said "+?" is the same as "*". both mean zero or more occurrences. "+" is one or more occurrence and "?" makes the match optional, so it's the same as *

2. Where this becomes a problem is you have no requirement for matching anything in particular at the beginning of your regex, so your pattern matches fine with

{.....yo.wat}
and returns
0. {.....yo.wat}
1. .....yo
2. wat
It also matches on
{hey......yo.wat}
and returns
0. {hey......yo.wat}
1. hey......yo.
2. wat

I think what you're really looking for is

\{(([a-z0-9\-_]+\.)+)([a-z0-9\-_]+)\}

This yields

{hey.yo}
0. {hey.yo}
1. hey.
2. hey.
3. yo

{hey.yo.wat}
0. {hey.yo.wat}
1. hey.yo.
2. yo.
3. wat

{hey.yo.dsa.wat}
0. {hey.yo.dsa.wat}
1. hey.yo.dsa.
2. dsa.
3. wat

NO MATCH ON ANY OF THE FOLLOWING

{...yo.wat}
{hey......yo.wat}
{hey.yo.wat.....}
{hey.}
{hey}

darkmasta

8:04 pm on Feb 5, 2005 (gmt 0)

All of those number twos are useless thought. They are wastes of clock cycles and ram.

ergophobe

11:11 pm on Feb 5, 2005 (gmt 0)

Uhh... From message #1

If the string is only something like {hey.yo}, you would get 'hey.' for the first and second backreferences and 'yo' would be the third backrefernce

Anyway, the problems with your regex that I noted in the previous post still exist. I couldn't guess, but I would suspect that the inefficiencies of being permissive would outweigh those of a second capture group that's thrown away. I wish I had a good regex profiler. It's an interesting question.

ergophobe

1:00 am on Feb 6, 2005 (gmt 0)

I benchmarked the following three regex using xdebug and WinCacheGrind:

1.$y = preg_replace('/\{(([^\.]+\.)+)([^\.]+)\}/', '$1 :: $2 :: $3', $string);

2.$y = preg_replace('/\{(([a-z0-9\-_]+\.)+)([a-z0-9\-_]+)\}/', '$1 :: $2 :: $3', $string);

3. $y = preg_replace('/\{([a-z0-9\-_\.]+?\.+)([a-z0-9\-_]+?)\}/', '$1 :: $2', $string);

I got rid of the /si modifiers in #3 to even things out, but that acutally didn't make much difference.

The differences were miniscule between the methods, but method 1 was consistently the fastest and method 3 the slowest.

Some typical numbers for 2 sets of 20K iterations of the preg_replace call on {hey.yo.wat}

1. 919ms
2. 970ms
3. 1001ms

I reversed the call order every other time to avoid the effects of invoking the regex engine and so on (in other words, I called each regex twice, once in the order 123 and then again 321). I also called preg_replace() once before and once after the measurements to minimize any effects of overhead (i.e caching or garbabe collection).

So the difference is tiny. I ran this about 10 times and there were a couple of occasions where #2 was fastest, but that was before I put a preg_replace() before and after the measuring, which meant that #1 took an unfair hit being first and last. I got more consistent results once I did that.

The order is the same for {hey.yo.dsf.wan.top.wat} and there's about the same spread. With {hey.yo} the order could be anything and with a lot of variation (sometimes #1 was 50% slower than #3 and sometimes the exact opposite), so that's probably attributable to the fact that with such a small string, there are many other factors.

So in brief, I don't think performance or CPU time should be the reason for choosing one over the other as I so often find when I benchmark things.

The grain of salt: Jatar_K has argued that when I benchmark things with so many iterations of such simple cases that it isn't necessarily that representative, because that's not how the code is used. Those who want to get into the more complex cases are welcome to 'em!

darkmasta

2:14 pm on Feb 6, 2005 (gmt 0)

<?php
function microtime_float()
{
  list($usec, $sec) = explode(" ", microtime());
  return ((float)$usec + (float)$sec);
}
$string = '{hey.yo.dude}';
$time_start = microtime_float();
preg_match_all('/\{(([^\.]+\.)+)([^\.]+)\}/', $string, $array);
$time_ends = microtime_float();
$time_dies = $time_ends-$time_start;
echo $time_dies."\n";
$time_news = microtime_float();
preg_match_all('/\{(([a-z0-9\-_]+\.)+)([a-z0-9\-_]+)\}/', $string, $arays);
$time_stop = microtime_float();
$time_twos = $time_stop-$time_news;
echo $time_twos."\n";
$time_begin = microtime_float();
preg_match_all('/\{([a-z0-9\-_\.]+?\.+)([a-z0-9\-_]+?)\}/', $string, $arrys);
$time_nots = microtime_float();
$time_ones = $time_nots-$time_begin;
echo $time_ones."\n";
?>

is what I used to benchmark. The middle set was always the slowest, the last set always the fastest.

darkmasta

2:17 pm on Feb 6, 2005 (gmt 0)

Nope, '/\{(([^\.]+\.)+)([^\.]+)\}/' is the fastest... But it can take anything you give it. I can do {\.her.she}, that would be wrong... very wrong...

ergophobe

6:40 pm on Feb 6, 2005 (gmt 0)

Yeah, that's why I said you could only use it if you were sure of your data and didn't need any sort of validation.

If you do, you need your range with my pattern. That's the second fastest and is the most restrictive (it doesn't allow hey...yo.wat) as your previous version did.

ergophobe

7:48 pm on Feb 6, 2005 (gmt 0)

Interesting. I didn't notice your first benchmarking result.

I used the xdebug profiler which is designed for profiling PHP scripts. You just change a setting in your php.ini and it will output all kinds of profiling data.

So did you find that my [^\.] was fastest, then your version, then my other version? Or did you get the same order as I did in the end?

One thing that I was wondering about is whether there would be a Win/Lin difference. Since the PCRE engine is part of PHP not the OS, I wasn't sure (I would assume there would be a Win/Lin difference with the ereg functions).

I was wanting to run the xdebug profiler under Linux because then you can look at the data with KCacheGrind which will estimate clock cycles and everything, but I just didn't have time.

darkmasta

5:47 am on Feb 8, 2005 (gmt 0)

[^\.] is the fastest because it is the least strict.

darkmasta

5:56 am on Feb 8, 2005 (gmt 0)

\{([^\.][a-z0-9\-_\.]+?\.+)([a-z0-9\-_]+?)\}, will protect me from {..het.sad.sdas}, however, it will not save me from {hey..arg.ssda}...

mincklerstraat

6:27 am on Feb 8, 2005 (gmt 0)

Interesting thread here, darkmasta and ergophobe. Thanks for the heads-up re. wincachegrind, ergophobe. You know, I googled it and only about four links show, your thread here being third.

darkmasta

11:54 pm on Feb 9, 2005 (gmt 0)

{(?<!\.)([a-z0-9\-_\.]+\.+)([a-z0-9\-_]+)\} is better then my last one as it does not constantly check the beginging. However, I still have the previous {haey.sad..asd} issue

darkmasta

12:08 am on Feb 10, 2005 (gmt 0)

{(?<!\.)([a-z0-9\-_\.]+\.)([a-z0-9\-_]+)\} even more efficent..

darkmasta

12:12 am on Feb 10, 2005 (gmt 0)

{(?<!\.)([a-z0-9\-_\.]+(?<!\.)\.)([a-z0-9\-_]+)\} Woot, this is perfect. I can't see any inefficenties.. I heart it. :) Any suggestions? Critisms?

darkmasta

12:32 am on Feb 10, 2005 (gmt 0)

Damn, why can't I get this correct!
{(?!\.)([a-z0-9\-_\.]+(?<!\.)\.)([a-z0-9\-_]+)\}

{.ha.asd} fails
{hey..asdsa.asd} passes
{hey.sad..das} fails

I want the middle one to fail, I have to somehow get it outside of the character class... Help!

darkmasta

12:49 am on Feb 10, 2005 (gmt 0)

{(?!\.¦[a-z0-9\-_]+\.\.)([a-z0-9\-_\.]+(?<!\.)\.)([a-z0-9\-_]+)\}

This should work, how many times have I said that anyway? Regardless, It should and from all the checks I did, it does. I would think that there would be a more efficent way of doing this however..

ergophobe

6:57 pm on Feb 10, 2005 (gmt 0)

heh heh.... Just curious. Having gone through all this, how many iterations of this regex do you expect to do per request?

Anyway, in terms of matches, your latest matches (or fails) correctly for every case I could think to test.

As for the benchmark results, though, I think the complexity of the lookaheads and conditionals slow it down for simple searches. As the strings get longer, though it seems to do better and better until it finally wins.

Same testing as before - 20K iterations for each trial, times in ms, testing with the xdebug profiling module compiled into PHP (er, actually this is still on Windows, so it's a .dll, not compiled in) and using WinCacheGrind to look at the results.

test string: {hey.yo.asdf.wer.ouod.wat}
trials: 6
regex2: 914, 865,879, 854, 892, 851
regex3: 889, 848, 988, 891, 973, 815

regex2
- hi: 914
- low: 851
- difference: 63
- avg: 876

regex3
- hi: 988
- low: 815
- difference: 173
- avg: 901

test string: {hey.yo}
trials: 6
regex2: 850, 815, 831, 795, 862, 846
regex3: 905, 850, 927, 856, 910, 908

regex2
- hi: 862
- low: 795
- difference: 67
- avg: 833

regex3
- hi: 927
- low: 850
- difference: 77
- avg: 893

So on average we're talking only 2-3ms per 1000 iterations. Pretty small difference. But it appears that the simpler search that throws away the second group is faster. I don't know about the impact on memory usage.

The more complex cases seem to favor regex2 by a narrower margin. That made me wonder what happens with

string:

{hey.yo.wat.we.asd.asdf.wer.dsfg.t3sd.324s.w43sd.6yh.6uh.
7yuj.sad.dasdf.ase.dasdf.ed.faed.a}

There regex3 wins not only over regex2, but also over the originally fastest regex1.

regex3: 966, 941
regex1: 1091, 1081
regex2: 1133, 1118

At that point, the extra work of taking a second capture group and throwing it away seems to take more effort than the additional overhead of the more complex pattern.

I'm not sure any of this has any practical value, but it's an interesting exercise. I suppose the challenge now would be to find something faster than regex2 for the simplest cases and faster than regex3 for the more longer strings.

[^\.] is the fastest because it is the least strict.

Is it because it's least restrictive or because it's just a very simple test? I think the latter. From a test point of view, it is the equivalent of looking for a single character. So instead of

(>a and <z) or (>A and <Z) or (>0 and <9) or = _ or = -

you just have

!= .

Like ".*" that's a very fast test, but of course you can only use it if you don't need to check the data for any other conditions.

Now I'm curious, but I think I've spent enough time benchmarking thisso I'll have to let that go until I'm really bored!

This 35 message thread spans 2 pages: 35

PHP Regex help

darkmasta

coopster

darkmasta

StupidScript

darkmasta

darkmasta

ergophobe

darkmasta

darkmasta

ergophobe

darkmasta

darkmasta

darkmasta

ergophobe

darkmasta

ergophobe

ergophobe

darkmasta

darkmasta

ergophobe

ergophobe

darkmasta

darkmasta

mincklerstraat

darkmasta

darkmasta

darkmasta

darkmasta

darkmasta

ergophobe

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week