Welcome to WebmasterWorld Guest from 54.163.49.19

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Perl script processing XML problem

     
3:33 pm on Apr 2, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:Sept 12, 2001
posts:83
votes: 0


Hi,

I have a perl script with processes an XML feed. It creates a cache file of the XML and presents the output on my website formatted in the way outlined below. However, although I'm no expert, the script below works on the bases of there being line breaks (or spaces) within the cache file. Recently the link breaks in the XML feed have been removed so the script no longer processes the feed anymore. I am only a novice when it comes to xml and perl and I found this script on the web some time ago.

I would be really grateful if someone could advise what I need to change below in order for the script to process the file, now that there are no line breaks or spaces in the cache file.

use strict;
use CGI qw(:all);
#use Fcntl qw(:flock);
use LWP::Simple qw(get);

my $xml_url = "http://www.website.com/cgi-bin/xmlfeed.exe?&s=books&chan=xf";

my $newscache = "cache.xml";
########################################
# enter number of articles to include
my $howmanyarticles = "80";

########################################
# write the cache file or
# if the cache file is older than 1 hour
# then re-write it.

#get_lock();
if ((not -e $newscache) or (-M $newscache > .5)) { # (not -e) if the cache file doesn't exist, -M gives the modification time since creation, 1 is 1 day, 0.04 is about 1 hour.
my $newsdoc = get($xml_url);# uses the LWP module "get" function to get the XML file.
if (defined $newsdoc) {
open (CACHEFILE, ">$newscache") die "Writing to Cache : $!";
print CACHEFILE $newsdoc;
close (CACHEFILE);
}
}
#release_lock();

########################################
# now print the contents of the XML file

print header;
print "<table width=\"100%\" align= center border=\"0\" cellpadding=\"4\" cellspacing=\"0\">";

open (CF, "$newscache") die "Unable to open $newscache : $!";
my ($productname, $productvenue);
my $counter =0;
while (<CF>) {
if (m,<product_desc>(.*)</product_desc>,) {
$productname = $1;
$productname =~ s/&apos;/'/g;
}
if (m,<venue_desc>(.*)</venue_desc>,) {
$productvenue = $1;
$productvenue =~ s/&apos;/'/g;

}

if (m,<crypto_block>(.*)</crypto_block>,) {
print "
<tr>

<td valign=\"top\"><a href=\"\/cgi-bin/go.cgi?$1\"><b>$productname</b></a></td><td>$productvenue</td>
";
$counter++;
last if $counter == $howmanyarticles;
}
}
close(CF);

print "
<tr>
<td width=\"100\%\" colspan=\"4\" height=\"10\"></td>
</tr>
</table>
";
# END

Many thanks for any help / assistance.

5:03 pm on Apr 2, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 1, 2003
posts:815
votes: 0


The construction I see several times in the file:

/<somexmltag>(.*)</somexmltag>/

is faulty. It will grab everything between the first opening tag and the last closing tag because it's just place "greedy."

Replace it with:
/<somexmltag>(.*?)</somexmltag>/

Previously, the line breaks most likely kept the "greedy" pattern from matching too much.

5:24 pm on Apr 2, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:Sept 12, 2001
posts:83
votes: 0


timster,
Thanks for the suggestion. I made the change but the script now only outputs and formats one entry. It only processes the first set of tags - it doesn't go on to process the other lines.
4:14 am on Apr 3, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 1, 2004
posts:137
votes: 0


The <> operator works by bringing in a line delimited by a \n (that's the Perl code for line break), so since they've taken out the line breaks you won't be able to process the file that way.

You'll need to find the xml tag that delimits each record, then split the line based on that. So instead of:


while (<CF>)

use something like:


$a = <CF>;
@xml = split (/<\record_end_tag>/, $a);
for $xml_line(@xml) {
s/<record_start_tag>//;

and then continue on with your normal processing. That's really just a patch though. To deal with XML, use an XML parsing module. Check search.cpan.org and search for an XML module that does what you want (XML::Parser might be a good place to start).

3:29 pm on Apr 3, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:Sept 12, 2001
posts:83
votes: 0


VectorJ, thanks for the advice. However I'm a real novice when it comes to perl and I have tried introducing the changes you suggest, but all I get now is an internal server error with the following in the error logs:

Global symbol "@xml" requires explicit package name

The modified script is below. Any further advice most welcome.

use strict;
use CGI qw(:all);
#use Fcntl qw(:flock);
use LWP::Simple qw(get);

my $xml_url = "http://www.website.com/cgi-bin/xmlfeed.exe?&s=books&chan=xf";

my $newscache = "cache.xml";
########################################
# enter number of articles to include
my $howmanyarticles = "80";

########################################
# write the cache file or
# if the cache file is older than 1 hour
# then re-write it.

#get_lock();
if ((not -e $newscache) or (-M $newscache > .5)) { # (not -e) if the cache file doesn't exist, -M gives the modification time since creation, 1 is 1 day, 0.04 is about 1 hour.
my $newsdoc = get($xml_url);# uses the LWP module "get" function to get the XML file.
if (defined $newsdoc) {
open (CACHEFILE, ">$newscache") die "Writing to Cache : $!";
print CACHEFILE $newsdoc;
close (CACHEFILE);
}
}
#release_lock();

########################################
# now print the contents of the XML file

print header;
print "<table width=\"100%\" align= center border=\"0\" cellpadding=\"4\" cellspacing=\"0\">";

open (CF, "$newscache") die "Unable to open $newscache : $!";
my ($productname, $productvenue);
my $counter =0;
$a = <CF>;
@xml = split (/<\event>/, $a);
for $xml_line(@xml) {
s/<event>//;
{
if (m,<product_desc>(.*)</product_desc>,) {
$productname = $1;
$productname =~ s/&apos;/'/g;
}
if (m,<venue_desc>(.*)</venue_desc>,) {
$productvenue = $1;
$productvenue =~ s/&apos;/'/g;

}

if (m,<crypto_block>(.*)</crypto_block>,) {
print "
<tr>

<td valign=\"top\"><a href=\"\/cgi-bin/go.cgi?$1\"><b>$productname</b></a></td><td>$productvenue</td>
";
$counter++;
last if $counter == $howmanyarticles;
}
}
close(CF);

print "
<tr>
<td width=\"100\%\" colspan=\"4\" height=\"10\"></td>
</tr>
</table>
";
# END

12:40 pm on Apr 5, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 1, 2003
posts:815
votes: 0


When you "use strict" you need to declare your variables, like so:

my @xml = split (/<\event>/, $a);

This is no big deal, but you might also want to replace your "END" comment with an actual end line, like so:

__END__

7:08 pm on May 17, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:May 14, 2004
posts:41
votes: 0


You might also find some help in researching XML::Treebuilder to parse the xml for you. A lot of the complications of parsing xml and tag-based data have been thought out for you. Don't duplicate the work that they have done and given to the community.

XML::Treebuilder(or XML::Simple) for smaller xml files and SAX processing for bigger jobs....at least that seems to have worked for me.