Forum Moderators: coopster

Message Too Old, No Replies

RegExp Help !

         

FlashPack

3:30 pm on May 3, 2010 (gmt 0)

10+ Year Member



i made a php script to read the "error_log" file
and i use this pattern "/\[.*\]/" to grab the time of the error
it works fine if the presented error is like this :

[03-May-2010 02:21:36] PHP Warning: mysql_fetch_array(): supplied argument is not a valid MySQL result resource in /home/user/public_html/example.com/core/mysql.class.php on line 65

where there is only one pair of brackets
but if there are many pair or brackets , like the following , it doesn't work :

[02-May-2010 21:40:31] PHP Warning: framework::include(controllers/article.php) [<a href='function.framework-include'>function.framework-include</a>]: failed to open stream: No such file or directory in /home/user/public_html/example.com/core/framework.class.php on line 78

Readie

3:41 pm on May 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



/\[(\d\d?\-[a-z]{3}\-[\d]{4})\s([\d]{2}:[\d]{2}:[\d]{2})\]/im

Should do it.

Back reference ID#1 is the date, and back reference ID#2 is the time

FlashPack

3:51 pm on May 3, 2010 (gmt 0)

10+ Year Member



Thanks Readie

FlashPack

3:52 pm on May 3, 2010 (gmt 0)

10+ Year Member



may i ask what is the "im" used in the pattern

Readie

3:57 pm on May 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



i - case insensitive
m - multi-line

rocknbil

5:05 pm on May 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Many ways to do it . . . your bracket is the first item in the line, so this should work as well. Your problem was that .* means "zero or more of any character" - which includes ] - without a quantifier so it slurped up the entire line to the next ] found.

$this_time = preg_replace('/^\[([^\]]+)\]?.*/',"$1",$line);


An explanation:

() = This is what we will store in $1, just the time.

^ = in this context, denotes the beginning of a line which will always be the case in an error log.

\[ = the literal character [, see class below.

[^\]]+ = one or more of any character NOT a [. This [] denotes a character class, and in this context, when the first character in a class, ^ means anything NOT these. So this is what we store in (), just the time.

\] = the ending bracket.

? = a quantifier to stop "greedy pattern matching," so it doesn't slurp up the second instance (but the way it's matching, probably won't anyway.)

.* = zero or more of any character, the way it's matching this may not be necessary***, especially since it's all being discarded (or not, if you store this for some association with the time)

You **may** need to add the multiline modifier,

'/^\[([^\]]+)\]?.*/m'

But I don't think so, error logs generally output on one line and only appear to be multiline in the buffer.

*** If you *want* to store the error,

$this_time = preg_replace('/^\[([^\]]+)\]?(.*)/',"$1 Error: $2",$line);

Should do nicely.

May contain errors, but that's the logic . . .

FlashPack

8:34 pm on May 4, 2010 (gmt 0)

10+ Year Member



thanks rocknbil
and this pattern /\[([\w\s\:\-]+)\]/ also works

IanKelley

11:07 pm on May 4, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why not keep it simple and efficient?

/^\[([^]])\]/

This would match everything following the inital bracket except a closing bracket.

rocknbil

1:50 am on May 5, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes it does, but per the O.P., it slurps up the entire line . . .

<?php
header("content-type:text/html");
$string = '[03-May-2010 02:21:36] PHP Warning: mysql_fetch_array(): blah blah [<a href=\'function.framework-include\'>function.framework-include</a>]:';
// works
$reg = '/^\[([^\]]+)\]?.*/';
$blah = preg_replace("$reg","$1",$string);
echo "One: '$reg' <br> $blah<br><br>";
// slurps up all
$reg = '/^\[([^]])\]/';
$blah = preg_replace("$reg","$1",$string);
echo "Two: '$reg'<br> $blah<br><br>";
// save both strings in $1 and $2
$reg = '/^\[([^\]]+)\]?(.*)/';
$blah = preg_replace("$reg","$1 Error: $2",$string);
echo "Three, save error: '$reg'<br> $blah<br><br>";
?>

IanKelley

3:11 am on May 5, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm confused by your reply I think I must be misreading something.

Why are you using preg_replace for one thing?

The example regexp I posted actually wouldn't work:

/^\[([^]])\]/


Because I forgot to include a + or *. So to translate, the above would match the first non closing bracket character following an opening bracket occuring at the beginning of the string.

It is certainly not capable of matching the entire string.

The following (only change being the addition of the *) does work:

$string = '[03-May-2010 02:21:36] PHP Warning: mysql_fetch_array(): blah blah [<a href=\'function.framework-include\'>function.framework-include</a>]:';

preg_match('/^\[([^]]*)\]/',$string,$out);

// $out[1] now = '03-May-2010 02:21:36';


I'm fairly sure it's the least processor intensive way to do it, for whatever that's worth.

rocknbil

6:26 pm on May 5, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm confused by your reply


Who, me, confusing? :-) The original problem is that

where there is only one pair of brackets but if there are many pair or brackets , like the following , it doesn't work :


Greedy pattern matching causes the regexp to go all the way to the second pair of brackets encompassing the entire line from the date bracket to the second brackets, that's the original problem. Running the code in the previous post clearly demonstrates the issue.

Why are you using preg_replace for one thing?


As an example of extracting ONLY the date or usage of preg to extract the date and other sub strings, and to demonstrate proper functioning of the regex.

The example regexp I posted actually wouldn't work:


It does, but as said, it's because it's missing the quantifier which stops greedy pattern matching.

The two differences between your mod

'/^\[([^]]*)\]/'

and mine

'/^\[([^\]]+)\]?.*/'

is that + means "one or more of the preceding" and * means "zero or more of the preceding" which will allow it to match on

[]

Which, of course, is an impossibility in an error log (I think . . . ) in the spirit of TMTOWTDI your solution should be fine.

The second difference is the quantifier which stops greedy pattern matching, ? and .*which only becomes useful If you wish to store anything after the initial bracketed date, as in "Three, save error:", the last example above.

Always TMTOWTDI, always fun. :-)

IanKelley

7:41 pm on May 5, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Always TMTOWTDI, always fun. :-)


I agree there :-)

However the expression I originally posted is non greedy by definition because a closing bracket stops it from matching. It cannot match past the first closing bracket it encounters.

rocknbil

1:21 am on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



<?php
header("content-type:text/html");
$string = '[03-May-2010 02:21:36] PHP Warning: mysql_fetch_array(): blah blah [<a href=\'function.framework-include\'>function.framework-include</a>]:';
$reg = '/^\[([^]]*)\]/';
$blah = preg_replace("$reg","$1",$string);
echo "Four using just *: '$reg'<br> $blah<br><br>";
?>

outputs

Four using just *: '/^\[([^]]*)\]/'
03-May-2010 02:21:36 PHP Warning: mysql_fetch_array(): blah blah [function.framework-include]:

Unless something's goofy with my server.

TheMadScientist

8:36 am on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Looks like you have a set number of characters you want to match after that little opening brace, IDK, but if that's the standard 3 letter month format, maybe: '#^\[(.{20})#' would work too?

Readie

9:34 am on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



#^\[(.{20})#

I understand that the period character is to be avoided when doing regex as it has a fairly high overhead. A better choice there would be
/^\[([^\]]{20})/

TheMadScientist

3:07 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Where did you get that?
It's the * and + expressions that make the . a bad option.

The reason for a negative match is so the 'greedy' 'catch-all' patterns break rather than matching to the end of the line. The . should not be a problem, because it's the next 20 characters.
.* = high overhead.
.+ = high overhead.

(They're probably also the most used patterns.)

If you have some info I don't on the dot, please let me know, but it's the * and the + you want to avoid when you can and if not, then it's better to use a negative pattern than a dot.

Readie

3:34 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmm, perhaps I am remembering it out of context then.

* and + are fine if used in conjunction with ? to make them lazy.

IanKelley

6:16 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Unless something's goofy with my server.


Not sure, it's a mystery to me, you'll note that the example code I posted above works as expected. And the 'everything except' (^) character class modifier is pretty straightforward.

Maybe something to do with preg_replace. I know it returns the unmodified string when no matches are found, which explains the output using expression with the * missing earlier in the thread, but no idea what's going on with the latest version you posted.

rocknbil

7:19 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Did you run it, what did you get?

And it occurs to me, the possible misunderstanding of the intent: the original post shows the problem as "if it has a second set of brackets, it doesn't work." In a simple match, this will work, because it matches:

$reg = '/^\[([^]]*)\]/';

but by him/her saying "it doesn't work" it can only mean they are indeed trying to extract ONLY the date and it's pulling in the second set of brackets.

As mentioned, all that's needed is a quantifier --> ?

$reg = '/^\[([^]]*)\]?/';

Which bears a striking resemblance to

'/^\[([^\]]+)\]?.*/'

zero or more matches on [], one or more can't.

IanKelley

7:40 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I did, your code returns the entire string, I have no idea why.

Though I didn't spend much time thinking about it... After all, the expression plainly cannot continue to match beyond a closing bracket. I highly doubt PHP has managed to make it to version 5+ with incorrect implementation of perl regular expressions :-)

It works perfectly using the preg_match sample code I posted. It also works perfectly when run from perl.

I'm sure there's some simple explanation as to why it's not working with preg_replace. Anyone?

TheMadScientist

8:06 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



'/^\[([^]]*)\]/' doesn't return the entire string for me...

It removes the opening and closing [] from around the time in the string, which is expected, because they are not part of the parenthesized pattern, so the pattern matches the string inside the [] then does not replace anything else in the string, because it does not match anything else in the string.

It's behaving exactly as I would expect it to when I'm running it. The pattern matches only the string opening -> [ is not -> ] followed by -> ] the [ ] are not part of the back-reference pattern, so they are left out of the output, but the date string is included because it's referenced. The rest of the string is included because there's nothing in the regex matching it, so it's left unchanged.

Readie

8:13 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Scientist:

/^\[([^]]*)\]/

You have an unescaped closing square bracket in the character class there, does it work if you escape it?

I think you should also probably change the * to *?

TheMadScientist

8:22 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I was testing the pattern from earlier in this thread that was said to be behaving unexpectedly to see what results I got... It's not the pattern I would use anyway. I would use the one I posted earlier personally.

IanKelley

8:41 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You have an unescaped closing square bracket in the character class there, does it work if you escape it?


As part of a class it doesn't need to be escaped.

IanKelley

8:43 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The rest of the string is included because there's nothing in the regex matching it, so it's left unchanged.


There's the "duh" explanation I was looking for thanks :-)

Readie

9:34 pm on May 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As part of a class it doesn't need to be escaped.

Hmm. I didn't know that, though I can't help but wonder: is there any chance for it to be misinterpreted by the engine due to ambiguity? I'd not like to have an unescaped one in there anyways because it makes it a little more confusing to read.