Forum Moderators: coopster & phranque

Message Too Old, No Replies

help with regexp - smallest match in string

         

g00dlife

2:38 am on Mar 12, 2003 (gmt 0)

10+ Year Member



I need some help on simple regexp.

Just say I have a string for example:

$str = "start start start match1 finish finish finish";

I would like to extract the smallest string that is bounded by "start" and "finish"... ie. in this case I would like to extract the string "match1".

My first guess was to use:

$pat = "start (.*?) finish";

however this matches the string "start start match1".

I know I can use the pattern:

$pat = ".*start (.*?) finish";

and get what I want, however there's a performance issue by using the greedy matcher .* at the start of the pattern. (This is not important but i'm actually tackling a problem cause by the java package Jakarta ORO... big performace issues by using combinations of greedy and non-greedy matches on large strings.)

Is there another single regular expression that will extract "match1" from the above $str when you only know it's bounded by "start" and "finish".

Any help would be greatly appreciated.

nosanity

9:04 pm on Mar 12, 2003 (gmt 0)

10+ Year Member



I would actually use something like this...


<?

$str = "start start start test finish finish finish";

preg_match("/[\w\s]*start[\s]?(.*?)[\s]?finish/i", $str, $matches);

print_r($matches);

?>

noSanity

andreasfriedrich

9:26 pm on Mar 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If your match consists entirely out of word characters then /start (\w*?) finish/ will work. Alternatively you might want to try a positive or negative character class specifying the allowed characters for your match. The more specific you are the less need is there for a combination of greedy and non greedy matches.

If you cannot be more specific about the boundary and your matches it will be hard to come up with a better solution. For some more ideas of what is possible you might want to consult Jeffrey Friedl´s Mastering Regular Expressions.

HTH Andreas

g00dlife

11:34 pm on Mar 12, 2003 (gmt 0)

10+ Year Member



Thanks for your help. I'll be a little bit more specific, as requested.

The actual string i'm performing a pattern match on looks something like this:


$str = "<a href=\"http://www.website.com/page1.html\">Link Info 1</a><!-- blah blah blah -->
<a href=\"http://www.website.com/page2.html\">Page 2 Info</a><br>";

(... very simplified... the pages I am parsing average 100K... and there's a lot of pages with lots of links.)

My goal is to extract the link where the data within the anchor tags matches 'Page 2 Info'... all I know is the info within the anchor tags.

Using the following regexp:


$str =~ /<a href=\"(.*?)\">Page 2 Info<\/a>/;

returns:

$1: http://www.website.com/page1.html">Link Info 1</a><!-- blah blah blah --><a href="http://www.website.com/page2.html

Which is not what I want. Using this simple pattern:

$str =~ /.*<a href=\"(.*?)\">Page 2 Info<\/a>/;

returns what I want:

$1: http://www.website.com/page2.html

however there's a performance issue with using the .* greedy matcher as I have mentioned.

The best I can do is the following:


$str =~ /[^<a href=].*?<a href=\"(http.*?)\">Page 2 Info<\/a>/;

And that also gives me what I want... with a little performance gain... however I'm convinced there's a better solution.

This one is for you regexp masters out there. :)

Thanks again for your help.

andreasfriedrich

12:11 am on Mar 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The first two expression will work. The first one is more exact while the second one is shorter albeit slightly incorrect. The third one would be nice but I do not know how to reference the first backreference in that character class, since it will interpret the \ and 1 separately instead of taking it for the \1 it is supposed to be.


m [perldoc.com]!<a href=(?:"([^"]+)"¦'([^']+)')>Page 2 Info</a>!;

m [perldoc.com]!<a href=(["'])([^'"]+)\1>Page 2 Info</a>!;

m [perldoc.com]!<a href=(["'])([^\1]+)\1>Page 2 Info</a>!;

RE #2 will match the link in $2. It looks for a href= followed by either double or single quote. This is stored as our first backreference. Then we match one or more characters that are not a single or double quote followed by what we matched in $1, i.e. either a single or double quote.

RE #1 does this more explicitly by first trying double quotes and then single quotes. In Perl [perl.com] a simple $1 ¦¦ $2 will give you the link no matter where it matched.

RE #3 does not work as intended for the reasons given above. It would be nice though.

HTH Andreas


Note: Make sure to replace "¦" with a solid vertical pipe.

g00dlife

1:45 am on Mar 13, 2003 (gmt 0)

10+ Year Member



Thank you andreasfriedrich for your reply.

The key with my problem is not using the .* (or the .+) greedy matcher. The use of [^"]+ is basically using a greedy matcher... which causes major performance problems in my particular scenario. I should have specific with my question.

My question is simply: Is it possible to find the last occurance of "start (.*?) finish" in any given string without using the .* (or .+) greedy matcher?

The regexp should extract 'match1' in each of the following:


$str1 = "start match1 finish";
$str2 = "start start match1 finish finish";
$str3 = "start start no match start start match1 finish finish";

Any help with this would be very much appreciated.

andreasfriedrich

1:56 am on Mar 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Now I get it (hopefully): You don´t want any greedy match anywhere :). Then just don´t use it. The rules work equally well when using non-greedy matching.


m!<a href=(?:"([^"]+?)"¦'([^']+?)')>Page 2 Info</a>!;

HTH Andreas

g00dlife

2:12 am on Mar 13, 2003 (gmt 0)

10+ Year Member



andreasfriedrich I've tried your regexp in the following script:


$str = "<a href=\"http://www.website.com/page1.html\">Link Info 1</a><!-- blah blah blah -->
<a href=\"http://www.website.com/page2.html\">Page 2 Info</a><br>";

$str =~ m!<a href=(?:"([^"]+?)"¦'([^']+?)')>Page 2 Info</a>!;

print "1: '$1'\n";
print "2: '$2'\n";

but I get no results... what am i doing wrong?

(nb. my $str has no implicit \n characters)

andreasfriedrich

2:45 am on Mar 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just copied the exact code you posted to a file, replaced the broken pipe symbol with a solid pipe and everything worked fine:


1: 'http://www.website.com/page2.html'
2: ''

Andreas

g00dlife

3:24 am on Mar 13, 2003 (gmt 0)

10+ Year Member



Note: Make sure to replace "¦" with a solid vertical pipe.

I wasn't paying attention. :)

Thanks very much for your help.

andreasfriedrich

3:28 am on Mar 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>Thanks very much for your help.

You´re welcome. It took some time but finally worked out ok, I guess :).

Andreas