Welcome to WebmasterWorld Guest from 54.158.36.59

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Perl script processing XML problem

     
3:33 pm on Apr 2, 2004 (gmt 0)

10+ Year Member



Hi,

I have a perl script with processes an XML feed. It creates a cache file of the XML and presents the output on my website formatted in the way outlined below. However, although I'm no expert, the script below works on the bases of there being line breaks (or spaces) within the cache file. Recently the link breaks in the XML feed have been removed so the script no longer processes the feed anymore. I am only a novice when it comes to xml and perl and I found this script on the web some time ago.

I would be really grateful if someone could advise what I need to change below in order for the script to process the file, now that there are no line breaks or spaces in the cache file.

use strict;
use CGI qw(:all);
#use Fcntl qw(:flock);
use LWP::Simple qw(get);

my $xml_url = "http://www.website.com/cgi-bin/xmlfeed.exe?&s=books&chan=xf";

my $newscache = "cache.xml";
########################################
# enter number of articles to include
my $howmanyarticles = "80";

########################################
# write the cache file or
# if the cache file is older than 1 hour
# then re-write it.

#get_lock();
if ((not -e $newscache) or (-M $newscache > .5)) { # (not -e) if the cache file doesn't exist, -M gives the modification time since creation, 1 is 1 day, 0.04 is about 1 hour.
my $newsdoc = get($xml_url);# uses the LWP module "get" function to get the XML file.
if (defined $newsdoc) {
open (CACHEFILE, ">$newscache") die "Writing to Cache : $!";
print CACHEFILE $newsdoc;
close (CACHEFILE);
}
}
#release_lock();

########################################
# now print the contents of the XML file

print header;
print "<table width=\"100%\" align= center border=\"0\" cellpadding=\"4\" cellspacing=\"0\">";

open (CF, "$newscache") die "Unable to open $newscache : $!";
my ($productname, $productvenue);
my $counter =0;
while (<CF>) {
if (m,<product_desc>(.*)</product_desc>,) {
$productname = $1;
$productname =~ s/&apos;/'/g;
}
if (m,<venue_desc>(.*)</venue_desc>,) {
$productvenue = $1;
$productvenue =~ s/&apos;/'/g;

}

if (m,<crypto_block>(.*)</crypto_block>,) {
print "
<tr>

<td valign=\"top\"><a href=\"\/cgi-bin/go.cgi?$1\"><b>$productname</b></a></td><td>$productvenue</td>
";
$counter++;
last if $counter == $howmanyarticles;
}
}
close(CF);

print "
<tr>
<td width=\"100\%\" colspan=\"4\" height=\"10\"></td>
</tr>
</table>
";
# END

Many thanks for any help / assistance.

5:03 pm on Apr 2, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The construction I see several times in the file:

/<somexmltag>(.*)</somexmltag>/

is faulty. It will grab everything between the first opening tag and the last closing tag because it's just place "greedy."

Replace it with:
/<somexmltag>(.*?)</somexmltag>/

Previously, the line breaks most likely kept the "greedy" pattern from matching too much.

5:24 pm on Apr 2, 2004 (gmt 0)

10+ Year Member



timster,
Thanks for the suggestion. I made the change but the script now only outputs and formats one entry. It only processes the first set of tags - it doesn't go on to process the other lines.
4:14 am on Apr 3, 2004 (gmt 0)

10+ Year Member



The <> operator works by bringing in a line delimited by a \n (that's the Perl code for line break), so since they've taken out the line breaks you won't be able to process the file that way.

You'll need to find the xml tag that delimits each record, then split the line based on that. So instead of:


while (<CF>)

use something like:


$a = <CF>;
@xml = split (/<\record_end_tag>/, $a);
for $xml_line(@xml) {
s/<record_start_tag>//;

and then continue on with your normal processing. That's really just a patch though. To deal with XML, use an XML parsing module. Check search.cpan.org and search for an XML module that does what you want (XML::Parser might be a good place to start).

3:29 pm on Apr 3, 2004 (gmt 0)

10+ Year Member



VectorJ, thanks for the advice. However I'm a real novice when it comes to perl and I have tried introducing the changes you suggest, but all I get now is an internal server error with the following in the error logs:

Global symbol "@xml" requires explicit package name

The modified script is below. Any further advice most welcome.

use strict;
use CGI qw(:all);
#use Fcntl qw(:flock);
use LWP::Simple qw(get);

my $xml_url = "http://www.website.com/cgi-bin/xmlfeed.exe?&s=books&chan=xf";

my $newscache = "cache.xml";
########################################
# enter number of articles to include
my $howmanyarticles = "80";

########################################
# write the cache file or
# if the cache file is older than 1 hour
# then re-write it.

#get_lock();
if ((not -e $newscache) or (-M $newscache > .5)) { # (not -e) if the cache file doesn't exist, -M gives the modification time since creation, 1 is 1 day, 0.04 is about 1 hour.
my $newsdoc = get($xml_url);# uses the LWP module "get" function to get the XML file.
if (defined $newsdoc) {
open (CACHEFILE, ">$newscache") die "Writing to Cache : $!";
print CACHEFILE $newsdoc;
close (CACHEFILE);
}
}
#release_lock();

########################################
# now print the contents of the XML file

print header;
print "<table width=\"100%\" align= center border=\"0\" cellpadding=\"4\" cellspacing=\"0\">";

open (CF, "$newscache") die "Unable to open $newscache : $!";
my ($productname, $productvenue);
my $counter =0;
$a = <CF>;
@xml = split (/<\event>/, $a);
for $xml_line(@xml) {
s/<event>//;
{
if (m,<product_desc>(.*)</product_desc>,) {
$productname = $1;
$productname =~ s/&apos;/'/g;
}
if (m,<venue_desc>(.*)</venue_desc>,) {
$productvenue = $1;
$productvenue =~ s/&apos;/'/g;

}

if (m,<crypto_block>(.*)</crypto_block>,) {
print "
<tr>

<td valign=\"top\"><a href=\"\/cgi-bin/go.cgi?$1\"><b>$productname</b></a></td><td>$productvenue</td>
";
$counter++;
last if $counter == $howmanyarticles;
}
}
close(CF);

print "
<tr>
<td width=\"100\%\" colspan=\"4\" height=\"10\"></td>
</tr>
</table>
";
# END

12:40 pm on Apr 5, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



When you "use strict" you need to declare your variables, like so:

my @xml = split (/<\event>/, $a);

This is no big deal, but you might also want to replace your "END" comment with an actual end line, like so:

__END__

7:08 pm on May 17, 2004 (gmt 0)

10+ Year Member



You might also find some help in researching XML::Treebuilder to parse the xml for you. A lot of the complications of parsing xml and tag-based data have been thought out for you. Don't duplicate the work that they have done and given to the community.

XML::Treebuilder(or XML::Simple) for smaller xml files and SAX processing for bigger jobs....at least that seems to have worked for me.

 

Featured Threads

Hot Threads This Week

Hot Threads This Month