help with regexp - smallest match in string - Perl Server Side CGI Scripting forum at WebmasterWorld

Forum Moderators: coopster & phranque

Message Too Old, No Replies

help with regexp - smallest match in string

g00dlife

2:38 am on Mar 12, 2003 (gmt 0)

I need some help on simple regexp.

Just say I have a string for example:

$str = "start start start match1 finish finish finish";

I would like to extract the smallest string that is bounded by "start" and "finish"... ie. in this case I would like to extract the string "match1".

My first guess was to use:

$pat = "start (.*?) finish";

however this matches the string "start start match1".

I know I can use the pattern:

$pat = ".*start (.*?) finish";

and get what I want, however there's a performance issue by using the greedy matcher .* at the start of the pattern. (This is not important but i'm actually tackling a problem cause by the java package Jakarta ORO... big performace issues by using combinations of greedy and non-greedy matches on large strings.)

Is there another single regular expression that will extract "match1" from the above $str when you only know it's bounded by "start" and "finish".

Any help would be greatly appreciated.

nosanity

9:04 pm on Mar 12, 2003 (gmt 0)

I would actually use something like this...


<?$str = "start start start test finish finish finish";
preg_match("/[\w\s]*start[\s]?(.*?)[\s]?finish/i", $str, $matches);
print_r($matches);?>

noSanity

andreasfriedrich

9:26 pm on Mar 12, 2003 (gmt 0)

If your match consists entirely out of word characters then /start (\w*?) finish/ will work. Alternatively you might want to try a positive or negative character class specifying the allowed characters for your match. The more specific you are the less need is there for a combination of greedy and non greedy matches.

If you cannot be more specific about the boundary and your matches it will be hard to come up with a better solution. For some more ideas of what is possible you might want to consult Jeffrey Friedl´s Mastering Regular Expressions.

HTH Andreas

g00dlife

11:34 pm on Mar 12, 2003 (gmt 0)

Thanks for your help. I'll be a little bit more specific, as requested.

The actual string i'm performing a pattern match on looks something like this:


$str = "<a href=\"http://www.website.com/page1.html\">Link Info 1</a><!-- blah blah blah -->
<a href=\"http://www.website.com/page2.html\">Page 2 Info</a><br>";

(... very simplified... the pages I am parsing average 100K... and there's a lot of pages with lots of links.)

My goal is to extract the link where the data within the anchor tags matches 'Page 2 Info'... all I know is the info within the anchor tags.

Using the following regexp:


$str =~ /<a href=\"(.*?)\">Page 2 Info<\/a>/;

returns:


$1: http://www.website.com/page1.html">Link Info 1</a><!-- blah blah blah --><a href="http://www.website.com/page2.html

Which is not what I want. Using this simple pattern:


$str =~ /.*<a href=\"(.*?)\">Page 2 Info<\/a>/;

returns what I want:


$1: http://www.website.com/page2.html

however there's a performance issue with using the .* greedy matcher as I have mentioned.

The best I can do is the following:


$str =~ /[^<a href=].*?<a href=\"(http.*?)\">Page 2 Info<\/a>/;

And that also gives me what I want... with a little performance gain... however I'm convinced there's a better solution.

This one is for you regexp masters out there. :)

Thanks again for your help.

andreasfriedrich

12:11 am on Mar 13, 2003 (gmt 0)

The first two expression will work. The first one is more exact while the second one is shorter albeit slightly incorrect. The third one would be nice but I do not know how to reference the first backreference in that character class, since it will interpret the \ and 1 separately instead of taking it for the \1 it is supposed to be.


m [perldoc.com]!<a href=(?:"([^"]+)"¦'([^']+)')>Page 2 Info</a>!; m [perldoc.com]!<a href=(["'])([^'"]+)\1>Page 2 Info</a>!; 
m [perldoc.com]!<a href=(["'])([^\1]+)\1>Page 2 Info</a>!;

RE #2 will match the link in $2. It looks for a href= followed by either double or single quote. This is stored as our first backreference. Then we match one or more characters that are not a single or double quote followed by what we matched in $1, i.e. either a single or double quote.

RE #1 does this more explicitly by first trying double quotes and then single quotes. In Perl [perl.com] a simple $1 ¦¦ $2 will give you the link no matter where it matched.

RE #3 does not work as intended for the reasons given above. It would be nice though.

HTH Andreas

Note: Make sure to replace "¦" with a solid vertical pipe.

g00dlife

1:45 am on Mar 13, 2003 (gmt 0)

Thank you andreasfriedrich for your reply.

The key with my problem is not using the .* (or the .+) greedy matcher. The use of [^"]+ is basically using a greedy matcher... which causes major performance problems in my particular scenario. I should have specific with my question.

My question is simply: Is it possible to find the last occurance of "start (.*?) finish" in any given string without using the .* (or .+) greedy matcher?

The regexp should extract 'match1' in each of the following:


$str1 = "start match1 finish";
$str2 = "start start match1 finish finish";
$str3 = "start start no match start start match1 finish finish";

Any help with this would be very much appreciated.

andreasfriedrich

1:56 am on Mar 13, 2003 (gmt 0)

Now I get it (hopefully): You don´t want any greedy match anywhere :). Then just don´t use it. The rules work equally well when using non-greedy matching.


m!<a href=(?:"([^"]+?)"¦'([^']+?)')>Page 2 Info</a>!;

HTH Andreas

g00dlife

2:12 am on Mar 13, 2003 (gmt 0)

andreasfriedrich I've tried your regexp in the following script:


$str = "<a href=\"http://www.website.com/page1.html\">Link Info 1</a><!-- blah blah blah -->
<a href=\"http://www.website.com/page2.html\">Page 2 Info</a><br>";$str =~ m!<a href=(?:"([^"]+?)"¦'([^']+?)')>Page 2 Info</a>!;print "1: '$1'\n";
print "2: '$2'\n";

but I get no results... what am i doing wrong?

(nb. my $str has no implicit \n characters)

andreasfriedrich

2:45 am on Mar 13, 2003 (gmt 0)

I just copied the exact code you posted to a file, replaced the broken pipe symbol with a solid pipe and everything worked fine:


1: 'http://www.website.com/page2.html' 
2: ''

Andreas

g00dlife

3:24 am on Mar 13, 2003 (gmt 0)

Note: Make sure to replace "¦" with a solid vertical pipe.

I wasn't paying attention. :)

Thanks very much for your help.

andreasfriedrich

3:28 am on Mar 13, 2003 (gmt 0)

>>Thanks very much for your help.

You´re welcome. It took some time but finally worked out ok, I guess :).

Andreas