The issue is splitting the array containing the input so that each HTML tag is a seperate line. Using /n fixed the printing format but I still need to split the lines themselves before they are put into the second array.
@data;
#... file open etc. S is the file handle
while(<S>){
$line = $_;
# regex to convert closing part of HTML tag
# add to newline character
$line =~ s/>/>\n/g;
# problem is here, I need a way to break at each /n
# so that each $line is only ONE line.
push (@data, $line);
}
I got most of the url finding and format conversion working. Its just this one little split/regex problem with newline and arrays has got me stumped.
basically the regex produces something like this at the moment (i'll leave out the <> part for clarity).
@data={ " tag1\n tag2\n tag3\n", "tag4\n tag5\n" }
but I need:
@data={ " tag1\n"," tag2\n'"," tag3\n", "tag4\n"," tag5\n" }
that is each HTML tag should have its own entry in the array.
Opinions vary, but my favorite is HTML::TokeParser. If you're into writing callbacks, then HTML::TreeBuilder is the way to go.
If you post some HTML and examples of what you want to get out of it I could probably be convinced to supply you with some working code..
besides I have almost got it working now, I just need to solve this little problem. Plus I would like to know how to do this as I will need to use something simular for some other NON HTML parsing code.
I'll take a look at HTML::TokeParser and HTML::TreeBuilder anyway, thanks for at least giving me the module names at the least.
As for the pages I'm trying to dismantle:
[images.google.com...]
(has practically no newline chars in the HTML :O )
[altavista.com...]
(even more messy)
@data={ " tag1\n tag2\n tag3\n", "tag4\n tag5\n" }
to this"
@data={ " tag1\n"," tag2\n'"," tag3\n", "tag4\n"," tag5\n" }
may not be easy, it all depends on the real data. If indeed there are newlines like in the first example above then you could split the array elements on the newlines\spaces:
my @data = ("<tag1>\n <tag2>\n <tag3>\n", "<tag4>\n <tag5>\n");
my @new = ();
foreach my $lines (@data) {
my @temp = map {"$_\n"} split(/\n\s*/,$lines);
push @new,@temp,
}
print "$_" for @new;