Welcome to WebmasterWorld Guest from 54.205.209.95

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Question - haven't done this in a while

Perl, script

   
5:12 am on Feb 7, 2010 (gmt 0)

10+ Year Member



Dear fellows!
I must be real tired or something that I am missing. I am trying to figure-out why in the world this would not match:

#!/usr/bin/perl -w
$n = 'some test here whatever...
<a href="read.cgi?do.htm"><img src="images/431.png" border=0 width=109 height=250</a>';
$n =~ s/(<a )(.*?)(<\/a>)/$&/; $r = "$&";
$n =~ s/$r/IMAGE2/s;


the whole links with image would not match in $n
Any ideas?
Thanks!

[edited by: phranque at 6:29 am (utc) on Feb 7, 2010]
[edit reason] disabled graphic smileys ;) [/edit]

7:04 pm on Feb 8, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



well, you'd need to escape the content of $r ... otherwise, the ? will get you in trouble in s/// ...
$r = quotemeta("$&");
should work fine ...
8:42 pm on Feb 10, 2010 (gmt 0)

5+ Year Member



/*
s/(<a )(.*?)(<\/a>)/$&/;
*/
should be s/(<a )(.*)?(<\/a>)/$&/;

[edited by: phranque at 6:51 am (utc) on Feb 11, 2010]
[edit reason] disabled graphic smileys ;) [/edit]

2:18 am on Feb 11, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member




should be s/(<a )(.*)?(<\/a>)/$&/;

no. apart from the fact that you cannot quantify backtracking matches (at least not like that), in this case, it wouldn't even make sense: .* means any character 0 or more times, and putting the ? outside of that would (if it was possible) mean to match "any char 0 or more times" 0 or one time (as that's pretty much what ? means ... bbba?bbb matches bbbbbb and bbbabbb, but not bbbcbbb), while, if it's put into the brackets, it makes the unlimited * ungreedy, basically saying "match any character 0 or more times, but as few times as possible", which, in this case, is necessary, because the .* would happily match the ending </a> etc.

in general: unless you're hacking stuff together for a quick fix, it's usually not the best idea to operate on markup languages like html or xml with regexps. not only is it hard to match what you want, you cannot define complex patterns which you could easily define with something like HTML::TreeBuilder [search.cpan.org], which will offer you look_down where you could simply look for all a-nodes that contain a b-node and a img-node which, itself, has a src matching a certain url-pattern.
regexps on html are able to fix easy problems, but are generally a bad idea, because they tend to break stuff 5 months from now when nobody remembers they're in effect.

[edited by: phranque at 6:53 am (utc) on Feb 11, 2010]
[edit reason] disabled graphic smileys ;) [/edit]