Forum Moderators: coopster
Have you had a look at the Learning PHP - Books, Tutorials and Online Resources [webmasterworld.com] thread in our PHP Forum Library [webmasterworld.com]? There are some good REGEX links in there. Try writing it out first and if you get stuck, post what you have so far and we'll try to help you get over the issues.
Given:
{hey.yo.wat} Your regex begins with:
\{hey\.yo\.wat\} Now, how to grab the various references?
\{([A-Z]{0,3}\.)([A-Z]{0,2}\.?)([A-Z]{0,3})?\} Doesn't cut it according to the requirements you posted, because it doesn't give a compound ref1 + ref2.
Can we get sub-references using standard regex syntax, illustrating:
\{(([A-Z]{0,3})\.([A-Z]{0,2})\.?)([A-Z]{0,3})?\} Nope.
However, why not test for the combination in another way? (to paraphrase)
if ( \1 + "." + \2 + "." == "trueCondition") Doing other, similar parsing operations to determine the values of the string's elements.
Here's one of many decent regex references: Regular-Expressions.info [regular-expressions.info]
Most engines (PHP, PERL, EditPad Pro, etc.) use the common syntax, so if you can figure the subreferences out, go nuts. I'd take a look at the logic of your program, first, though, IMHO ... :)
Can I cheat a bit and use named groups and a couple of extra regex statements?
$pattern =
'/(\{(([^.]*\.)[^.]*\.)([^\}]*)\})¦(\{(?P<2>(?P<3>[^.]*\.))(?P<4>[^\}]*)\})/';
preg_match($pattern, $text, $matches);
In the case of {hey.yo.wat}
print_r($matches) yields
Array
(
[0] => {hey.yo.wat}
[1] => {hey.yo.wat}
[2] => hey.yo.
[3] => hey.
[4] => wat
)
In the case of {hey.yo}
print_r($matches) yields
Array
(
[0] => {hey.yo}
[1] =>
[2] => hey.
[3] => hey.
[4] => yo
[5] => {hey.yo}
[6] => hey.
[7] => hey.
[8] => yo
)
Unfortunately, if you want to actually use this with preg_replace, it won't work because you would end up with conflicts in the indexing and it just chokes. What will work, though, is to split the pattern in two and test with preg_match and then replace with preg_replace
$pattern1 = '/\{(([^.]*\.)[^.]*\.)([^\}]*)\}/';
$pattern2 = '/\{(?P<1>(?P<2>[^.]*\.))(?P<3>[^\}]*)\}/';
if (preg_match($pattern1, $text))
{
$newtext = preg_replace($pattern1, '$3-$2-$1', $text);
}
if (preg_match($pattern2, $text))
{
$newtext = preg_replace($pattern2, '$3-$2-$1', $text);
}
In this case, because we don't have one capture group used up in defining our OR as in ()¦(), we don't have to shift the index and can use numbers 1, 2 and 3.
echo "Replacing $text with : $newtext"
Replacing {hey.yo.wat} with : wat-hey.-hey.yo.
Replacing {hey.yo} with : yo-hey.-hey.
That gets you where you want to be, but it requires two regex calls if the first case is met, and three if it fails and goes to the second condition. I couldn't get it to work with just one regex call, but I'm sure someone who is a lot better at regex than I am could handle it.
Tom
$string = substr($string, 0, -1); // shave off }
$string = substr($string, 1); // shave off {
$array = explode('.', $string);
or some such thing.
As for the regex, because I enjoy thinking these through...
First, I should have used + instead of * in my previous example.
More to the point, are you sure about your solution? It doesn't meet the conditions you laid out in your initial post and it doesn't meet the conditions you laid out in message 5 either as far as I can tell. For example, it matches {hey.yo} but not {hey.yo.wat}. Essentially, your regex says
"match all strings that begin with a curly brace which is followed by zero or more characters in the range "a-z0-9-_", followed by zero or more periods. If this group matches, meaning we have found either nothing at all or we have found a character in our set or we have found a period, move on and try to match the next group. If we find zero or more characters in our character set, match those in another group".
'/\\{([a-z0-9\\-_]+?\\.+?)([a-z0-9\\-_]+?)\\}/si'
Your second + applies to the \. not to the entire capture group, so you match
{...yo} - match
{hey.yo} - match
{hey..yo} - match
{hey.yo.wat} - no match!
Why don't you match the last one? Because your mathc works like this:
\{ - get a {. If found, mathc and move on.
( - start a capture group
[a-z0-9\\-_]+? - if we find zero or more characters in the set [], match and move on. This is identical to [a-z0-9\\-_]* because the? let's you not match and succeeed. That's why my regex with * instead of + is wrong.
\.+? - if we find zero or more periods, match and move on.
) close our first catpure group. Everything we've found is group one. Possible matches would be
....
hey
hey...
hey.
These values would not match
hey.yo
.yo
.you.
There is no repetition of the group, so we move on to the next group.
( - start group
[a-z0-9\\-_]+? if we find zero or more characters in the set [], match and move on.
) - close our capture group. Possible values are
hey
yo
wat
The following would not match
yo.
.wat
\}
So your regex as a whole matches {hey.yo} but not {hey.yo.wat} because the capture groupe with the \. is not repeated. So we can redo your regex so it matches. I have simplified it to checking for "not period" rather than a a-z range etc, but mostly, I've changed the repetition so that it matches your criteria
\{([^\.]+\.)+([^\.]+)\}
This matches strings
- at least one group that is followed by a dot
- one and only one group that is not followed by a dot
The problem here is that you would only have two capture groups under all circumstances.
So {hey.yo} returns
1 - hey.
2 - yo
{hey.yo.wat} returns
1. yo.
2. wat
{hey.yo.dsf.wat} returns
1. dsf.
2. wat
Because when it goes around and rematches the group, it throws away the previous value until the match fails and it moves on. So that won't work either. If you only want the first and last terms in all circumstances, then you could use a pattern like
\{([^\.]+\.)([^\.]+\.)*([^\.]+)?\}
That would always give you
- the first set of chars that are followed by a dot
- the last set of chars that are followed by a dot
- the last set of chars provided there is no dot
{hey.yo} would return
1. hey.
2. (empty)
3. yo
{hey.yo.wat}
1. hey.
2. yo.
3. wat
{hey.yowat.poi.wodj.snred}
1. hey.
2. wodj.
3. snred
A quick and easy way to test your regex against multiple conditions and see what matches and groups are being returned is to use the tool at
[fileformat.info...]
Have fun!
'/\\{([a-z0-9\\-_\\.]+?)(\\$)?([a-z0-9\\-_]+?)\\}/si'is a bit cleaner however, its just that if there is no period, it continues...
That output is not what I wanted,
I got a bit lost in it all. I forgot about the "hey.yo" requirement somewhere along the line.
Anyway, you still have some problems in your regex.
1. Not a problem really, but as I said "+?" is the same as "*". both mean zero or more occurrences. "+" is one or more occurrence and "?" makes the match optional, so it's the same as *
2. Where this becomes a problem is you have no requirement for matching anything in particular at the beginning of your regex, so your pattern matches fine with
{.....yo.wat}and returns
0. {.....yo.wat}
1. .....yo
2. watIt also matches on
{hey......yo.wat}
and returns
0. {hey......yo.wat}
1. hey......yo.
2. wat
I think what you're really looking for is
\{(([a-z0-9\-_]+\.)+)([a-z0-9\-_]+)\}
This yields
{hey.yo}
0. {hey.yo}
1. hey.
2. hey.
3. yo
{hey.yo.wat}
0. {hey.yo.wat}
1. hey.yo.
2. yo.
3. wat
{hey.yo.dsa.wat}
0. {hey.yo.dsa.wat}
1. hey.yo.dsa.
2. dsa.
3. wat
NO MATCH ON ANY OF THE FOLLOWING
{...yo.wat}
{hey......yo.wat}
{hey.yo.wat.....}
{hey.}
{hey}
If the string is only something like {hey.yo}, you would get 'hey.' for the first and second backreferences and 'yo' would be the third backrefernce
Anyway, the problems with your regex that I noted in the previous post still exist. I couldn't guess, but I would suspect that the inefficiencies of being permissive would outweigh those of a second capture group that's thrown away. I wish I had a good regex profiler. It's an interesting question.
1.$y = preg_replace('/\{(([^\.]+\.)+)([^\.]+)\}/', '$1 :: $2 :: $3', $string);
2.$y = preg_replace('/\{(([a-z0-9\-_]+\.)+)([a-z0-9\-_]+)\}/', '$1 :: $2 :: $3', $string);
3. $y = preg_replace('/\{([a-z0-9\-_\.]+?\.+)([a-z0-9\-_]+?)\}/', '$1 :: $2', $string);
I got rid of the /si modifiers in #3 to even things out, but that acutally didn't make much difference.
The differences were miniscule between the methods, but method 1 was consistently the fastest and method 3 the slowest.
Some typical numbers for 2 sets of 20K iterations of the preg_replace call on {hey.yo.wat}
1. 919ms
2. 970ms
3. 1001ms
I reversed the call order every other time to avoid the effects of invoking the regex engine and so on (in other words, I called each regex twice, once in the order 123 and then again 321). I also called preg_replace() once before and once after the measurements to minimize any effects of overhead (i.e caching or garbabe collection).
So the difference is tiny. I ran this about 10 times and there were a couple of occasions where #2 was fastest, but that was before I put a preg_replace() before and after the measuring, which meant that #1 took an unfair hit being first and last. I got more consistent results once I did that.
The order is the same for {hey.yo.dsf.wan.top.wat} and there's about the same spread. With {hey.yo} the order could be anything and with a lot of variation (sometimes #1 was 50% slower than #3 and sometimes the exact opposite), so that's probably attributable to the fact that with such a small string, there are many other factors.
So in brief, I don't think performance or CPU time should be the reason for choosing one over the other as I so often find when I benchmark things.
The grain of salt: Jatar_K has argued that when I benchmark things with so many iterations of such simple cases that it isn't necessarily that representative, because that's not how the code is used. Those who want to get into the more complex cases are welcome to 'em!
<?php
function microtime_float()
{
list($usec, $sec) = explode(" ", microtime());
return ((float)$usec + (float)$sec);
}
$string = '{hey.yo.dude}';
$time_start = microtime_float();
preg_match_all('/\{(([^\.]+\.)+)([^\.]+)\}/', $string, $array);
$time_ends = microtime_float();
$time_dies = $time_ends-$time_start;
echo $time_dies."\n";
$time_news = microtime_float();
preg_match_all('/\{(([a-z0-9\-_]+\.)+)([a-z0-9\-_]+)\}/', $string, $arays);
$time_stop = microtime_float();
$time_twos = $time_stop-$time_news;
echo $time_twos."\n";
$time_begin = microtime_float();
preg_match_all('/\{([a-z0-9\-_\.]+?\.+)([a-z0-9\-_]+?)\}/', $string, $arrys);
$time_nots = microtime_float();
$time_ones = $time_nots-$time_begin;
echo $time_ones."\n";
?>is what I used to benchmark. The middle set was always the slowest, the last set always the fastest.
I used the xdebug profiler which is designed for profiling PHP scripts. You just change a setting in your php.ini and it will output all kinds of profiling data.
So did you find that my [^\.] was fastest, then your version, then my other version? Or did you get the same order as I did in the end?
One thing that I was wondering about is whether there would be a Win/Lin difference. Since the PCRE engine is part of PHP not the OS, I wasn't sure (I would assume there would be a Win/Lin difference with the ereg functions).
I was wanting to run the xdebug profiler under Linux because then you can look at the data with KCacheGrind which will estimate clock cycles and everything, but I just didn't have time.
Anyway, in terms of matches, your latest matches (or fails) correctly for every case I could think to test.
As for the benchmark results, though, I think the complexity of the lookaheads and conditionals slow it down for simple searches. As the strings get longer, though it seems to do better and better until it finally wins.
Same testing as before - 20K iterations for each trial, times in ms, testing with the xdebug profiling module compiled into PHP (er, actually this is still on Windows, so it's a .dll, not compiled in) and using WinCacheGrind to look at the results.
test string: {hey.yo.asdf.wer.ouod.wat}
trials: 6
regex2: 914, 865,879, 854, 892, 851
regex3: 889, 848, 988, 891, 973, 815
regex2
- hi: 914
- low: 851
- difference: 63
- avg: 876
regex3
- hi: 988
- low: 815
- difference: 173
- avg: 901
test string: {hey.yo}
trials: 6
regex2: 850, 815, 831, 795, 862, 846
regex3: 905, 850, 927, 856, 910, 908
regex2
- hi: 862
- low: 795
- difference: 67
- avg: 833
regex3
- hi: 927
- low: 850
- difference: 77
- avg: 893
So on average we're talking only 2-3ms per 1000 iterations. Pretty small difference. But it appears that the simpler search that throws away the second group is faster. I don't know about the impact on memory usage.
The more complex cases seem to favor regex2 by a narrower margin. That made me wonder what happens with
string:
{hey.yo.wat.we.asd.asdf.wer.dsfg.t3sd.324s.w43sd.6yh.6uh.
7yuj.sad.dasdf.ase.dasdf.ed.faed.a}
There regex3 wins not only over regex2, but also over the originally fastest regex1.
regex3: 966, 941
regex1: 1091, 1081
regex2: 1133, 1118
At that point, the extra work of taking a second capture group and throwing it away seems to take more effort than the additional overhead of the more complex pattern.
I'm not sure any of this has any practical value, but it's an interesting exercise. I suppose the challenge now would be to find something faster than regex2 for the simplest cases and faster than regex3 for the more longer strings.
[^\.] is the fastest because it is the least strict.
Is it because it's least restrictive or because it's just a very simple test? I think the latter. From a test point of view, it is the equivalent of looking for a single character. So instead of
(>a and <z) or (>A and <Z) or (>0 and <9) or = _ or = -
you just have
!= .
Like ".*" that's a very fast test, but of course you can only use it if you don't need to check the data for any other conditions.
Now I'm curious, but I think I've spent enough time benchmarking thisso I'll have to let that go until I'm really bored!