Does anyone have a regex handy to match whole anchors.
I am trying to create a script that parses text and I need to match all <a href>link</a> tags.
The main problem is that the links are going to be in different formats and have other attributes.
Examples:
<a href='htp://site.com' target='_blank'>link</a>
<a target='_blank' href='htp://site.com' >link</a>
<a href="htp://site.com" target="_blank">link</a>
etc...
I have searched high and low for this but have yet to find what I need.
Thanks in advance!
<[aA] .*<\/[aA]>
i'll break it down for you:
< - the pattern starts with a '<'
[aA] - followed by upper or lower case 'a ' and a space
.* - followed by zero or more of any character
< - followed by a '<'
\/ - followed by a '/' (escaped by the '\')
[aA] - followed by upper or lower case 'a '
> - followed by a '>'
in short, "<a (some stuff)</a>"
i just now realized while reviewing this that the ambiguous, greedy and promiscuous nature of using ".*" means if you have two anchors on the same line it will include both sets in one pattern match.
if you have no tags in your anchor text, you could fix this by using:
<[aA] [^<]*<\/[aA]>
the difference is:
[^<]* - this means zero or more characters that are not a '<'
if you want to use this in a loop on found patterns you could try something like this:
while ($yourstring =~ m/(<\s*<a\s[^<]*<\s*\/a\s*>)/igs) {
do something with $1;
}
this pattern should let you get sloppy with case and blank/tab usage.
I think i left some info out on my first post.
I need to backreference the URL too.
Here's what I'm doing:
It's actually a PHP script using preg_replace_callback() but I posted here since it's a regex question.
I am parsing text rss files and removing links but only after comparing them to an "allowed url" list.
Quick question: Would I be better suited to just use HTML::TokeParser?
I need to backreference the URL too.
not sure if php uses any different regexp syntax, but
this might do it:
while ($yourstring =~ m/(<\s*a\s[^>]*href\s*=\s*['"]([^'"]*)['"][^<]*<\s*\/a\s*>)/igs) {
do something with $1 if (you want the whole anchor tag);
do something with $2 if (you want the url only);
}
#
#
$html = "(contents of an entire HTML file)"#
#
use HTML::TokeParser;
#
$p = HTML::TokeParser->new(\$html);
#
while (my $token = $p->get_token()) {
#
my $tokenType = $token->[1] ¦¦ "";
my $tokenText = $token->[4] ¦¦ "";
my $tokenText = lc($tokenText);
#
if (lc($tokenType) eq "a") {
if (( index lc($tokenText), "http://")!= "-1") {
$tokenText =~ s/<a//g;
$tokenText =~ s/href=//g;
$tokenText =~ s/\"//g;
$tokenText =~ s¦http://¦¦g;
$tokenText =~ s/\>//g;
@split_text = split (' ',$tokenText);
@split_url = split ('/',$split_text[0]);
#
# compare $split_url[0] to your allowed list
# and do what you like with the data.
#
}
}
}
NOTE: check the WebmasterWorld pipe replacement...
BTW - if the files your want to parse are online (not on your local machine), simply marry up LWP::UserAgent to HTML::TokeParser
well, I ended up sticking with the regex, rather than HTML:TokeParser.
Here is what I came up with:
'/<a[^>]*\shref=["\'][^"\'](.*)["\']*\s?>.*?<\/a>/si''/<a[^>]*\shref=["\'][^"\'](.*)["\']*\s?>.*?<\/a>/si'
Pretty lengthy eh? It seems to work well though. If anyone is lookin, I'd appreciate any tips on shortening it if it can be done.
Thanks again!