Forum Moderators: coopster & phranque

Message Too Old, No Replies

how to fetch url and extract url's

how to fetch url and extract url's

         

perluser1

9:05 am on Mar 16, 2004 (gmt 0)

10+ Year Member



how do you fetch a web page and then extract url's in perl

thanks

perluser1

9:39 am on Mar 16, 2004 (gmt 0)

10+ Year Member



by the way the url's being extracted are a href

Birdman

1:39 pm on Mar 16, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try LWP module.

[perldoc.com...]

Birdman

SeanW

5:12 pm on Mar 16, 2004 (gmt 0)

10+ Year Member



LWP is good, WWW::Mechanize is even easier (it uses LWP), since it provides functions like find_all_links(). The man page has examples.

Sean

perluser1

11:04 pm on Mar 16, 2004 (gmt 0)

10+ Year Member



Thanks,

I took a look at WWW::Mechanize

I have never installed a perl module before or run a makefile. I tried to locate this info for WWW::Mechanize but could not find instructions. Could someone direct me?

What I need to do is fetch a url and get several relative links (a href). I needs to have some logic not to get the absolute links. Then using use URI I rebuild the relative links into absolute links. Then I need to go to each of these url's and extract the title from the page and save to a file the title with the url name. Will WWW::Mechanize be best to do this? I am just a newbie at perl is this going to require some advanced scripting?

I started some basic code below.

#!/usr/bin/perl

use LWP::Simple;
my $content = get( "url" ) or die $!;

foreach line (

$line =~ m{href="(.*?)"}ig
$url = $1;
$wholeLink = "http\:\/\/" . "$url\n";

)

SeanW

12:06 am on Mar 17, 2004 (gmt 0)

10+ Year Member



You can install a module with CPAN

# cpan
cpan> install WWW::Mechanize

It'll follow dependencies if you let it, so it'll install whatever is needed.

I would really look at Mechanize for what you're doing... With LWP you have to build the canonical URL yourself (URI module) not to mention parse the page (HTML::TokeParser is good for this). Mech provides built in functions to extract links from a page, and provides access to the underlying LWP object.

I've been using LWP and associated modules for years, I discovered Mech a matter of months ago and it is *so* much easier. Check out my home page in my profile, I wrote a brief article comparing LWP and Mech for web scraping.

If you want to get further into it, O'Reilly makes a couple of great books -- LWP & Perl, and Spidering Hacks. The former is all about LWP and HTML parsing, the latter uses a variety of techniques to grab information off the web. Again, a review of LWP & Perl is on my website.

Sean

perluser1

1:52 am on Mar 17, 2004 (gmt 0)

10+ Year Member



How about installing this way?

CPAN:

% perl -MCPAN -e shell [as root]
> install WWW::Mechanize
> quit

perluser1

8:31 am on Mar 17, 2004 (gmt 0)

10+ Year Member



I ran the above and every thing seamed ok but then at the end I got this error msg. Any help is appriciated. Thanks

Checking if your kit is complete...
Looks good
Writing Makefile for WWW::Mechanize
make: *** No rule to make target `/usr/lib/perl5/5.8.3/i386-linux-thread-multi/CORE/config.h', needed by `Makefile'. Stop.
/usr/bin/make -- NOT OK
Running make test
Can't test without successful make
Running make install
make had returned bad status, install seems impossible