Forum Moderators: coopster & phranque

Message Too Old, No Replies

More verbose HTML::LinkExtor or parser?

How can all possible info between two html tags be extracted?

         

omnius

4:04 pm on Sep 9, 2006 (gmt 0)

10+ Year Member



Hi, I am new to perl, and actually program full-time in Objective-C. A portion of the program I am writing requires parsing HTML which is dynamically fetched and fed to the parser. So I began writing the parser. I started looking into existing utilities yesterday and quickly found Perl, HTML::LinkExtor, and CPAN. Using CamelBones (a Cocoa - Perl bridge), and some basic perl, by last night I had a perl script running embeddedly in my app which retrieves a webpage and returns an array of links. Kudos to perl for being really robust in terms of available utilities and ease of installing them, and for terseness. My perl code is all of 4 lines.

Thanks for bearing with me. Trouble is, I need ALL the all data between the <a> and </a> tags to be extracted, not just the address. For me, why feeding

<a href="/reviews/hardware/e1405.ars">Dell Inspiron e1405 laptop</a>

would give

(a, href, "/reviews/hardware/e1405.ars")

is a real stumper. Where is the title? So, I figure I'm either using the wrong class or I have an option disabled.

Is there any way to get it to give me the raw <a> </a> block?
If I give some utility a raw block, can I get out all of its features, no matter how arbitrary?

Another example:

<a href="../index.html"><img width="425" height="50" border="0"
src="../i/perldoc_banner.gif" alt="Welcome to Perldoc.com"></a>

gives

(a, href, "..index.html")
(img, src, "../i/perldoc_banner.gif")

...2 links? And nothing for the 'alt' attribute? That's way off, for me anyway. I don't see how to programmatically guarantee, just looking at those two links, that they are in fact the same link as seen in a web browser. I don't want to just arbitrarily associate a link with the nearest image, though I absolutely want to account for image links.

Is there a way to just get a dictionary (key-value pairs) of everything between the A tags?

Anyway, Thanks alot for reading all this. I'm hoping I'm just missing something, since I feel "almost there". It would be really sweet to leverage all this latent perlness and move on, without...well, you, know...reinventing the...something... :)

Little_G

4:36 pm on Sep 9, 2006 (gmt 0)

10+ Year Member



Hi,

I'm not perl person so you may have to adapt this a little, but here goes.
The following piece of PHP seems to do what I think you want:

$pattern = "/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+¦.*?)?<\/a>/";
preg_match_all($pattern, file_get_contents("http://www.google.com"), $matches);
var_dump($matches[1],$matches[2]);

This is a regular expression that looks for strings inside the 'href' attribute, inside an 'a' tag. It's supposed to be a 'perl compatible' regular expression, so you may be able to use it.

Andrew

omnius

9:01 pm on Sep 9, 2006 (gmt 0)

10+ Year Member



Hi Andrew, thanks. Unfortunately, not all websites use quotes when declaring attributes. witness google:

<a href=http://images.google.com/imghp?hl=en><img src=/intl/en_ALL/images/images_res.gif width=150 height=58 alt="Go to Google Image Search Home" border=0 vspace=12></a>

I am really hoping this tool already exists, because it is a perfect illustration of the kind of hair-splitting I really don't want to be involved in. Every time I come up with a possible rule for my string scanner I find an exception to that rule. Why can't html just be easy to dissect?

Little_G

12:03 am on Sep 10, 2006 (gmt 0)

10+ Year Member



Hi,

I understand it can be annoying to try and find ways of adapting to other peoples 'eccentric' use of HTML, but as far as I can see the only way of stopping the following RegEx from finding every a/href in a web page is to code it so badly that no browser could parse it.

$pattern = "/<a[\s]+[^>]*?href[\s]?\=[\s\"\']*([\w:?=@&\/#._;-]+)[\s\"\']*.*?\>([^<]+¦.*?)?<\/a>/";
preg_match_all($pattern, file_get_contents("http://www.google.com"), $matches);
echo htmlentities(print_r($matches,true));

I also found this link that may be useful to you. [webmasterworld.com ]

Andrew

omnius

3:11 pm on Sep 10, 2006 (gmt 0)

10+ Year Member



Hey G, I will try and integrate your code into my program and see if I can get it to work. But are you claiming that that block will turn

<a href=http://images.google.com/imghp?hl=en><img src=/intl/en_ALL/images/images_res.gif width=150 height=58 alt="Go to Google Image Search Home" border=0 vspace=12>Image Home</a>

into

(a, 
(href, http://images.google.com/imghp?hl=en),
(img, (src, /intl/en_ALL/images/images_res.gif), (width, 150), (height, 58), (alt, "Go to Google Image Search Home"), (border, 0), (vspace, 12)),
Image Home)

?

omnius

3:25 pm on Sep 10, 2006 (gmt 0)

10+ Year Member



Oh and I am wading through the WWW::Mech docs right now. thanks...

wruppert

3:37 pm on Sep 10, 2006 (gmt 0)

10+ Year Member



Here is a script I wrote to try out a different module that extracts anchor text as well as URL. Broken bars are supposed to be unbroken.

#!/usr/bin/perl

use strict;
use warnings;
use HTML::LinkExtractor;
use LWP::Simple qw(get);

# get a page to test
my $page = shift ¦¦ "search.cpan.org/recent";
my $html = get("http://$page");

# setup the parser
my $LX = new HTML::LinkExtractor();
$LX->strip(1); # just anchor text, not entire tag
$LX->parse(\$html);

# print anchor text and href
for my $Link (@{$LX->links}) {
my $tag = $$Link{tag};
# only regular links
next unless $tag eq 'a';
my $href = $$Link{href};
my $text = $$Link{_TEXT};
print $text, " -> ", $href, "\n";
}

undef $LX;

Little_G

4:38 pm on Sep 10, 2006 (gmt 0)

10+ Year Member



Hi,

In response to your question, no, my script will turn:

<a href=http://images.google.com/imghp?hl=en><img src=/intl/en_ALL/images/images_res.gif width=150 height=58 alt="Go to Google Image Search Home" border=0 vspace=12>Image Home</a>

into

Array 
(
[0] => Array
(
[0] => <a href=http://images.google.com/imghp?hl=en><img src=/intl/en_ALL/images/images_res.gif width=150 height=58 alt="Go to Google Image Search Home" border=0 vspace=12>Image Home</a>
)

[1] => Array
(
[0] => [images.google.com...]
)

[2] => Array
(
[0] => <img src=/intl/en_ALL/images/images_res.gif width=150 height=58 alt="Go to Google Image Search Home" border=0 vspace=12>Image Home
)

)

The first array contains the full match, the second the url and the third the string from in between the anchor tags.
Hopefuly Mechanize will do what you need it to, I could continue to develop the RegEx, but the longer it gets the slower it gets and you may find a Perl Extension faster.

Andrew

omnius

5:38 pm on Sep 12, 2006 (gmt 0)

10+ Year Member



Thanks guys. I have been trying to integrate these modules and code blocks into my app, but it's a little difficult to get reliable output due to the Cocoa-Perl bridge. Since I was not getting the output I wanted anyway, I decided yesterday to give another shot at writing the parser in Cocoa. It works (so far... we'll see how it handles really bad HTML... which there is ALOT of). I may have to come back to this anyway for www::mechanize's ability to fill out forms, for example. We'll see.

Thanks again but I needed the attributes from within the tags. Getting the tags is the easy part.