Welcome to WebmasterWorld Guest from 54.196.246.145

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Question - haven't done this in a while

Perl, script

     
5:12 am on Feb 7, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 7, 2001
posts: 48
votes: 0


Dear fellows!
I must be real tired or something that I am missing. I am trying to figure-out why in the world this would not match:

#!/usr/bin/perl -w
$n = 'some test here whatever...
<a href="read.cgi?do.htm"><img src="images/431.png" border=0 width=109 height=250</a>';
$n =~ s/(<a )(.*?)(<\/a>)/$&/; $r = "$&";
$n =~ s/$r/IMAGE2/s;


the whole links with image would not match in $n
Any ideas?
Thanks!

[edited by: phranque at 6:29 am (utc) on Feb 7, 2010]
[edit reason] disabled graphic smileys ;) [/edit]

7:04 pm on Feb 8, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 31, 2008
posts:661
votes: 0


well, you'd need to escape the content of $r ... otherwise, the ? will get you in trouble in s/// ...
$r = quotemeta("$&");
should work fine ...
8:42 pm on Feb 10, 2010 (gmt 0)

New User

5+ Year Member

joined:Feb 10, 2010
posts: 9
votes: 0


/*
s/(<a )(.*?)(<\/a>)/$&/;
*/
should be s/(<a )(.*)?(<\/a>)/$&/;

[edited by: phranque at 6:51 am (utc) on Feb 11, 2010]
[edit reason] disabled graphic smileys ;) [/edit]

2:18 am on Feb 11, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 31, 2008
posts:661
votes: 0



should be s/(<a )(.*)?(<\/a>)/$&/;

no. apart from the fact that you cannot quantify backtracking matches (at least not like that), in this case, it wouldn't even make sense: .* means any character 0 or more times, and putting the ? outside of that would (if it was possible) mean to match "any char 0 or more times" 0 or one time (as that's pretty much what ? means ... bbba?bbb matches bbbbbb and bbbabbb, but not bbbcbbb), while, if it's put into the brackets, it makes the unlimited * ungreedy, basically saying "match any character 0 or more times, but as few times as possible", which, in this case, is necessary, because the .* would happily match the ending </a> etc.

in general: unless you're hacking stuff together for a quick fix, it's usually not the best idea to operate on markup languages like html or xml with regexps. not only is it hard to match what you want, you cannot define complex patterns which you could easily define with something like HTML::TreeBuilder [search.cpan.org], which will offer you look_down where you could simply look for all a-nodes that contain a b-node and a img-node which, itself, has a src matching a certain url-pattern.
regexps on html are able to fix easy problems, but are generally a bad idea, because they tend to break stuff 5 months from now when nobody remembers they're in effect.

[edited by: phranque at 6:53 am (utc) on Feb 11, 2010]
[edit reason] disabled graphic smileys ;) [/edit]

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members