Just say I have a string for example:
$str = "start start start match1 finish finish finish";
I would like to extract the smallest string that is bounded by "start" and "finish"... ie. in this case I would like to extract the string "match1".
My first guess was to use:
$pat = "start (.*?) finish";
however this matches the string "start start match1".
I know I can use the pattern:
$pat = ".*start (.*?) finish";
and get what I want, however there's a performance issue by using the greedy matcher .* at the start of the pattern. (This is not important but i'm actually tackling a problem cause by the java package Jakarta ORO... big performace issues by using combinations of greedy and non-greedy matches on large strings.)
Is there another single regular expression that will extract "match1" from the above $str when you only know it's bounded by "start" and "finish".
Any help would be greatly appreciated.
If you cannot be more specific about the boundary and your matches it will be hard to come up with a better solution. For some more ideas of what is possible you might want to consult Jeffrey Friedl´s Mastering Regular Expressions.
HTH Andreas
The actual string i'm performing a pattern match on looks something like this:
$str = "<a href=\"http://www.website.com/page1.html\">Link Info 1</a><!-- blah blah blah -->
<a href=\"http://www.website.com/page2.html\">Page 2 Info</a><br>";
(... very simplified... the pages I am parsing average 100K... and there's a lot of pages with lots of links.)
My goal is to extract the link where the data within the anchor tags matches 'Page 2 Info'... all I know is the info within the anchor tags.
Using the following regexp:
$str =~ /<a href=\"(.*?)\">Page 2 Info<\/a>/;
$1: http://www.website.com/page1.html">Link Info 1</a><!-- blah blah blah --><a href="http://www.website.com/page2.html
$str =~ /.*<a href=\"(.*?)\">Page 2 Info<\/a>/;
$1: http://www.website.com/page2.html
The best I can do is the following:
$str =~ /[^<a href=].*?<a href=\"(http.*?)\">Page 2 Info<\/a>/;
This one is for you regexp masters out there. :)
Thanks again for your help.
m [perldoc.com]!<a href=(?:"([^"]+)"¦'([^']+)')>Page 2 Info</a>!;m [perldoc.com]!<a href=(["'])([^'"]+)\1>Page 2 Info</a>!;
m [perldoc.com]!<a href=(["'])([^\1]+)\1>Page 2 Info</a>!;
RE #2 will match the link in $2. It looks for a href= followed by either double or single quote. This is stored as our first backreference. Then we match one or more characters that are not a single or double quote followed by what we matched in $1, i.e. either a single or double quote.
RE #1 does this more explicitly by first trying double quotes and then single quotes. In Perl [perl.com] a simple $1 ¦¦ $2 will give you the link no matter where it matched.
RE #3 does not work as intended for the reasons given above. It would be nice though.
HTH Andreas
The key with my problem is not using the .* (or the .+) greedy matcher. The use of [^"]+ is basically using a greedy matcher... which causes major performance problems in my particular scenario. I should have specific with my question.
My question is simply: Is it possible to find the last occurance of "start (.*?) finish" in any given string without using the .* (or .+) greedy matcher?
The regexp should extract 'match1' in each of the following:
$str1 = "start match1 finish";
$str2 = "start start match1 finish finish";
$str3 = "start start no match start start match1 finish finish";
Any help with this would be very much appreciated.
$str = "<a href=\"http://www.website.com/page1.html\">Link Info 1</a><!-- blah blah blah -->
<a href=\"http://www.website.com/page2.html\">Page 2 Info</a><br>";$str =~ m!<a href=(?:"([^"]+?)"¦'([^']+?)')>Page 2 Info</a>!;
print "1: '$1'\n";
print "2: '$2'\n";
but I get no results... what am i doing wrong?
(nb. my $str has no implicit \n characters)