Welcome to WebmasterWorld Guest from

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Perl for retrieving information from a webpage (javascript/html)

Trying to find the best set of tools to retrieve the information in webpage

4:22 pm on Jun 5, 2014 (gmt 0)

New User

joined:June 5, 2014
posts: 1
votes: 0


I am kind of new in Perl but I would like to know what would you advise me to use in order to retrieve information allocated in a couple of webpages.

Given [HTML] tables [on web pages] what would be the best way to automatically access the pages, retrieve those tables, and manipulate them? Given that they are both in HTML, would it be enough to just parse the tables?

I know I'm asking this in a Perl forum but if MySQL would be more fitting, please let me know.

Note: I am not asking for you to do any coding, I'm just looking for advice on packages, tools and whatnot that I may use for this.

Thank you very much!

[edited by: coopster at 1:34 pm (utc) on Jun 9, 2014]
[edit reason] no site specifics please and thank you! [/edit]

1:37 pm on June 9, 2014 (gmt 0)


WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 31, 2003
votes: 2

Welcome to WebmasterWorld, sosippus.

If the pages are on the same server then you merely use the file open, read and write APIs available.

If the pages are on an external server then the most popular tool is cURL. You can open your own sockets and read write from them as well but the cURL library is quite extensive and a handy tool.

As far as parsing the tables you may want to investigate the "tidy" library.
8:35 pm on Aug 14, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 27, 2001
posts: 2548
votes: 0

Use Mojo::UserAgent and/or Mojo::DOM. For example (untested, but see the links below for more info)

use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new;

my $rows $ua->get('www.foo.com')->res->dom->find('table tr');

my $r = shift;
print $r->find('td')->pluck('text')->join("\t")."\n";