Forum Moderators: coopster & phranque

Message Too Old, No Replies

XML coding using Perl script

XML coding

         

krishnanr

1:30 pm on Nov 23, 2007 (gmt 0)

10+ Year Member



Hi,

I am new to Perl.

I am given a task to do automation on XML coding using perl script. The task is explained below with an example.

Example:

We use the short delimiter in the doc file like below:

<NL>First Item
Second Item
....
Last Item<NL>

It should be converted into XML coded like below:

<orderedlist><listitem>First Item</listitem>
<listitem>Second Item</listitem>
...
<listitem>Last Item</listitem></orderedlist>

Can anybody help me on how to achive this.

Thanks,
Krishnan

balam

3:53 pm on Nov 23, 2007 (gmt 0)

10+ Year Member



Welcome to WebmasterWorld, krishnanr!

Myself, I would tackle this problem using regular expressions, as that would seem to be the easiest solution.

Just what is in this "doc file" you mention? Is it only a list of items with your <NL> delimiters, or is there (a bunch more) data before or after the list? That's more of a rhetorical question, since this solution can handle both situations...

[perl]
#!/usr/bin/perl -w

use strict;
use warnings;

# Declare vars
#
my $filename = 'data.txt'; # Filename of input 'doc file'
my $all_data; # Complete contents of 'doc file'
my $data_before; # Data in 'doc file' before list
my $old_list; # The list before modification
my $data_after; # Data in 'doc file' after list
my $new_list; # The list after modification to XML
my $new_file; # The new XML file

# Load whole 'doc file' into scalar var
#
{
local (*DATAFILE, $/);
open (DATAFILE, $filename);
$all_data = <DATAFILE>;
}

# Separate file into three parts - data before list, list, data after list
#
$all_data =~ /^(.*?<NL>)(.*?)(<NL>.*)$/s;
$data_before = $1;
$old_list = $2;
$data_after = $3;

# Convert $data_before & $data_after to XML
#
$data_before =~ s/<NL>$/<orderedlist>/;
$data_after =~ s/^<NL>/<\/orderedlist>/;

# Convert $old_list to XML
#
$old_list =~ s/\n/<\/listitem>\n<listitem>/g;
$new_list = "<listitem>$old_list</listitem>";

# Put it all together
#
$new_file = $data_before . $new_list . $data_after;

print $new_file;
[/perl]

rocknbil

10:34 pm on Nov 23, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Although it's been said that it's "slow," I've never seen a major performance difference using the module XML::Simple. It will create XML for output and parse incoming XML.

krishnanr

7:12 am on Nov 24, 2007 (gmt 0)

10+ Year Member



Hi,

Thanks for your reply. It works great.

I need one more help.

In my text file there are lots and lots of delimiters like this. For example:

<BM>
<H1>First First level title</H1>
<H2>First second level title</H2>
<BL>Bulleted list item first
Bulleted list item middle
Bulleted list item last</BL>

<H1>Second First level title</H1>
<NL>Numbered list item first
Numbered list item middle
Numbered list item last</NL>
</BM>

This should be converted into structured XML document like this:

<section role='bm'></title> ----------- "Start of BM tag"
<section id='head1.1'> ------- "Start of First H1 tag"
<title>First First level title</title>
<section id='head1.1.1'>
<title>First second level title</title>
</section>
<itemizedlist>
<listitem>Bulleted list item first</listitem>
<listitem>Bulleted list item middle</listitem>
<listitem>Bulleted list item last</listitem>
</itemizedlist>
</section> ------- "End of First H1 tag"
<section id='head1.2'>
<title>Second First level title</title>
<orderedlist>
<listitem>Numbered list item first</listitem>
<listitem>Numbered list item mid</listitem>
<listitem>Numbered list item last</listitem>
</orderedlist>
</section>
</section> ---------------- "End of BM tag"

Could you please give me an idea how to tokenize the element to get the above structure.

Thanks,
Krishnan

phranque

11:24 am on Nov 24, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld [webmasterworld.com], krishnanr!

the HTML::Content::HTMLTokenizer Perl module [search.cpan.org] might be helpful...