More verbose HTML::LinkExtor or parser?

Hi, I am new to perl, and actually program full-time in Objective-C. A portion of the program I am writing requires parsing HTML which is dynamically fetched and fed to the parser. So I began writing the parser. I started looking into existing utilities yesterday and quickly found Perl, HTML::LinkExtor, and CPAN. Using CamelBones (a Cocoa - Perl bridge), and some basic perl, by last night I had a perl script running embeddedly in my app which retrieves a webpage and returns an array of links. Kudos to perl for being really robust in terms of available utilities and ease of installing them, and for terseness. My perl code is all of 4 lines.

Thanks for bearing with me. Trouble is, I need ALL the all data between the <a> and </a> tags to be extracted, not just the address. For me, why feeding

<a href="/reviews/hardware/e1405.ars">Dell Inspiron e1405 laptop</a>

would give

(a, href, "/reviews/hardware/e1405.ars")

is a real stumper. Where is the title? So, I figure I'm either using the wrong class or I have an option disabled.

Is there any way to get it to give me the raw <a> </a> block?
If I give some utility a raw block, can I get out all of its features, no matter how arbitrary?

Another example:

<a href="../index.html"><img width="425" height="50" border="0"
 src="../i/perldoc_banner.gif" alt="Welcome to Perldoc.com"></a>

gives

(a, href, "..index.html")
(img, src, "../i/perldoc_banner.gif")

...2 links? And nothing for the 'alt' attribute? That's way off, for me anyway. I don't see how to programmatically guarantee, just looking at those two links, that they are in fact the same link as seen in a web browser. I don't want to just arbitrarily associate a link with the nearest image, though I absolutely want to account for image links.

Is there a way to just get a dictionary (key-value pairs) of everything between the A tags?

Anyway, Thanks alot for reading all this. I'm hoping I'm just missing something, since I feel "almost there". It would be really sweet to leverage all this latent perlness and move on, without...well, you, know...reinventing the...something... :)

(a, (href, http://images.google.com/imghp?hl=en), (img, (src, /intl/en_ALL/images/images_res.gif), (width, 150), (height, 58), (alt, "Go to Google Image Search Home"), (border, 0), (vspace, 12)), Image Home)

#!/usr/bin/perl

use strict;
use warnings;
use HTML::LinkExtractor;
use LWP::Simple qw(get);

# get a page to test
my $page = shift ¦¦ "search.cpan.org/recent";
my $html = get("http://$page");

# setup the parser
my $LX = new HTML::LinkExtractor();
$LX->strip(1); # just anchor text, not entire tag
$LX->parse(\$html);

# print anchor text and href
for my $Link (@{$LX->links}) {
my $tag = $$Link{tag};
# only regular links
next unless $tag eq 'a';
my $href = $$Link{href};
my $text = $$Link{_TEXT};
print $text, " -> ", $href, "\n";
}

undef $LX;

Array
(
[0] => Array
(
[0] => <a href=http://images.google.com/imghp?hl=en><img src=/intl/en_ALL/images/images_res.gif width=150 height=58 alt="Go to Google Image Search Home" border=0 vspace=12>Image Home</a>
)

[1] => Array
(
[0] => [images.google.com...]
)

[2] => Array
(
[0] => <img src=/intl/en_ALL/images/images_res.gif width=150 height=58 alt="Go to Google Image Search Home" border=0 vspace=12>Image Home
)

)

More verbose HTML::LinkExtor or parser?

How can all possible info between two html tags be extracted?

omnius

Little_G

omnius

Little_G

omnius

omnius

wruppert

Little_G

omnius

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week