Regular Expression problem

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Regular Expression problem

Trying to extracting all the links from a page

Phoog

8:30 pm on Nov 28, 2005 (gmt 0)

Hello WebmasterWorld!

Im trying to extract all the links from a string, I dont know the exact number on links.

This works fine, but its just extracting the first link on the page. Is it possible to loop all the links or do I need to rebuild it?

if($content =~ /<a href=\"(.+?)\">(.+?)<\/a>/i){
print "Url: $1<br>Text: $2";
}else{
print "Nothing found";
}

skinter

12:13 am on Nov 29, 2005 (gmt 0)

You could use a for loop that will terminate when it reaches the EOF.

skinter

1:27 am on Nov 29, 2005 (gmt 0)

(sorry for double post, but it was too late to edit)

A while loop would probably be better.
Hope that helps.

Phoog

10:43 am on Nov 29, 2005 (gmt 0)

On step closer to my problem, Now I can extract all the links and put em in an array, like this:

@hrefs = ($content =~ m¦<a.*href\s*=\s*\"(.+?)\">(.+?)</a>¦ig);
foreach $href (@hrefs){
print "$href<hr>";
}

But the next problem is that the link text and the url is stored in diffrent post in the array so the result look like this:

link url #1
--------------------------------------------------------------
link text #1
--------------------------------------------------------------
link url #2
--------------------------------------------------------------
link text #2

And that isent really what i want :/...

simon2263

11:32 am on Nov 29, 2005 (gmt 0)

You probably want something like:

$found = 0;
while ($content =~ /<a href=\"(.+?)\">(.+?)<\/a>/ig) {
print "Url: $1<br>Text: $2";
$found = 1;
}
if ($found == 0) {
print "Nothing found";
}

# haven't tested this, but the key thing is the 'g' flag
# at the end of the regexp operator to make the match
# global

simon2263

11:36 am on Nov 29, 2005 (gmt 0)

Why not write

@hrefs = ($content =~ m¦<a.*href\s*=\s*\"(.+?)\">(.+?)</a>¦ig);
while (@hrefs){
$href = shift @hrefs;
$text = shift @hrefs;
print "$href : $text\n";
}

wruppert

5:37 pm on Nov 29, 2005 (gmt 0)

Sample script using HTML::Extractor that plucks links from a page.


#!/usr/bin/perl
 
use strict;
use warnings;
use HTML::LinkExtractor;
use LWP::Simple qw(get);
 
# get a page to test
my $html = get('http://search.cpan.org/recent');
 
# setup the parser
my $LX = new HTML::LinkExtractor();
$LX->strip(1); # just anchor text, not entire tag
$LX->parse(\$html);
 
for my $Link (@{$LX->links}) {
  my $tag = $$Link{tag};
  # only regular links
  next unless $tag eq 'a';
  my $href = $$Link{href};
  my $text = $$Link{_TEXT};
  print $text, " -> ", $href, "\n";
}
 
undef $LX;

bennymack

6:08 pm on Nov 29, 2005 (gmt 0)

Yet Another Solution would be to put the info into and array of hashes (my fave):


@hrefs = map { {url=>$1,text=>$2} ($content =~ ¦<a.*href\s*=\s*\"(.+?)\">(.+?)</a>¦ig);
foreach $href (@hrefs){
print "Link: $href->{link}, Text: $href->{text}<hr /><br />\n";
}

Warning: untested ( but should work).

Phoog

7:21 pm on Nov 29, 2005 (gmt 0)

You are awsome, thanks!