News harvester - Perl Server Side CGI Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster & phranque

Message Too Old, No Replies

News harvester

how to amend script to make it work for other sites

dawlish

11:58 pm on Nov 30, 2001 (gmt 0)

10+ Year Member

I have the following script which grabs news headlines from the BBC, however I was hoping someone could give me some advice on how to modify it to work for other sites. e.g [itv.com...]

I'm very new to scripting and can't really work out what's happening and any advice would be welcome.

#!/usr/bin/perl

print "Content-type: text/html\n\n";
use LWP::Simple;
$doc = get "http://news.bbc.co.uk/hi/english/uk/default.stm";# Read the BBCNews Retriever
@bbc = split(/\n/, $doc);
$flag = 0;
$next = 0;
foreach $line (@bbc) {# Look for the headlines (usually five)
if ($line =~ /<DIV CLASS="bodytext">/) {
if ($flag eq 0) {
$flag = 1;
}
} elsif ($line =~ /<A href="/) {# Get the URL
if ($flag eq 1) {
$buffer = $line;
$flag = 2;
}
} elsif ($line =~ /<B class="h/) {# Get the description
if ($flag eq 2) {
$buffer=$buffer.$line;
$flag = 3;
$next = 1;
}
} elsif ($next eq 1) {# Get the summary (usually the next three lines)
if ($flag eq 3) {
$story = $line;
$next = 2;
}
} elsif ($next eq 2) {
if ($flag eq 3) {
$story = $story.$line;
$next = 3;
}
} elsif ($next eq 3) {
if ($flag eq 3) {
$story = $story.$line;
&format();# Format the data
$news = $news.$buffer.$story;
$flag = 0;
$next = 0;
}
} else {
# Do nothing!
}
}

print qq~

<dl>$news</dl>

~;

exit;

sub format {

# This cleans the lines so that the HTML is displayed correctly (i.e. HTML 4.01)

$story =~ s/<br clear=all>//i;
$story =~ s/<.a>//i;
$story =~ s/<.b>//i;
$story =~ s/<br[^>]*>//i;
$story =~ s/<.div>//i;
$story =~ s/\t//g;
$story =~ s/\r//g;
$title = "@@@".$story;
$title =~ s/\.//;
$title =~ s/@@@ //;
$title =~ s/@@@//;
$title =~ s/"/''/g;
$buffer =~ s/<a href[^>]//i;
$buffer =~ s/"//i;
$buffer =~ s/<b class[^>]*>//i;
$link = $buffer."@@@";
$link =~ s/>[^@@@]*@@@//i;
$buffer =~ s/>/ target="_blank">/i; # Can add target="_blank" here if you want a new page opened
$buffer =~ s/\t//g;
$buffer =~ s/\r//g;
$story = "$story<br><br>";
$buffer = "<a href=\"http://news.bbc.co.uk".$buffer."</a>\n";
}

sugarkane

7:35 pm on Dec 2, 2001 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Hmm, I think it'd be very hard to simply modify this for another site - apart from first few lines, the script is dealing with the particular way that the BBC headlines are layed out in terms of HTML.

The generic version of the script would only consist of

[perl]

#!/usr/bin/perl

print "Content-type: text/html\n\n";
use LWP::Simple;
$doc = get "http://somesite.com";
@lines = split(/\n/, $doc);
foreach $line (@lines) {
# parse the page
}
[/perl]

...which is pretty useless as it only fetches the page but then does nothing with it. The code within the foreach {} loop would have to be custom written for each individual news source you wanted to use...

seriesint

10:19 pm on Dec 2, 2001 (gmt 0)

Since you're new to scripting I wouldn't even look at using that script as a starting point. A much better solution would be to look for RSS feeds at sites like www.syndic8.com or newsisfree.com. (I believe they offer a bbc feed, it might be from moreover.com ). Figure out how to parse a RSS file using Perl and you have access to several 1,000 sites headlines where you could just pass in a url and output the headlines.
It's more complicated to learn to use XML::Simple or the RSS modules but that code to match sections of the page just looks brutal for someone new to scripting . This might look just as brutal but it will work with almost any RSS .9 file. It's currently used with some misc. Moreover.com feeds. From a list of 10 news feeds, it selects one using the rand function. Then simply grabs the title's and links from the RSS File and prints out to a file including the current Feeds title (that is kinda important to note here, if you use this code you will be better off inserting the url for the RSS feed). Here's the code, you of course won't have the rsslist.txt since that's my list of newsfeeds.
***begin script***

#!C:\perl\bin\perl.exe rsspage.pl
###Creates a simple RSS table that could be
###used via a SSI call.
$limit = 6;
$rssout = 'rssout.inc';
$foo = "page2.xml";
use LWP::Simple;
open (LIST, "rsslist.txt") �� die "no rss list\n";
while (<LIST>) {
push (@list, $_);
}
close LIST;
srand;
$num = rand(11);
($n,$trash) = split(/\./, $num);
$fetch = $list[$n];
$rss = $fetch;
getstore("$rss", "$foo")�� die "Get failed\n";;

use XML::Simple;
$fd = new XML::Simple( );
my $simp = $fd -> XMLin();
$content = $simp->{channel}->{title};
for ($i=0; $i <= $limit; $i++) {
$url = $simp->{channel}->{item}->[$i]->{link};
$text = $simp->{channel}->{item}->[$i]->{title};
$body .= "<tr><td><a href=\"$url\" class=\"blk\">";
$body .= "$text</a></tr>";
}
open (OUT, ">$rssout") �� warn "Couldn't open out file\n";
print OUT "<h3> $content </h3>";
print OUT "<table>";
print OUT $body;
print OUT "</table>";

close OUT;
unlink ($foo);

****End Script
**Example moreover links in a file ,one per line.******

[p.moreover.com...]

***End Sample******

Now that might look bad but most of its explained in the perldoc page for XML::Simple and there is an IBM Developer's Works Page for XML that uses the same basic principles. And of course always one can get some examples of RSS and XML at Oreilly's site.
Last note is that all the links are assinged the CSS class "blk" for block , so you can put a CSS ruleset to match .blk { display: block; } and whatever other style is required for the links themselves. Could assign a ruleset to the table but I just wrap the SSI call into a DIV with its class/styles defined.

HTH
later

<fixed smileys - sugarkane>

(edited by: sugarkane at 10:28 pm (gmt) on Dec. 2, 2001)

sugarkane

10:34 pm on Dec 2, 2001 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Hi seriesint, welcome to wmw

Thanks for that, it looks a good solution :)