Forum Moderators: coopster & phranque

Message Too Old, No Replies

arrays, new lines and html tags

reformat html and process links and image urls found

         

fabricator

5:24 pm on Apr 28, 2006 (gmt 0)

10+ Year Member




Basically I need to convert a HTML document so as to store the data within another file format. I am mainly interested in the link urls and image URLs.

The issue is splitting the array containing the input so that each HTML tag is a seperate line. Using /n fixed the printing format but I still need to split the lines themselves before they are put into the second array.

@data;

#... file open etc. S is the file handle

while(<S>){
$line = $_;
# regex to convert closing part of HTML tag
# add to newline character
$line =~ s/>/>\n/g;
# problem is here, I need a way to break at each /n
# so that each $line is only ONE line.
push (@data, $line);
}

I got most of the url finding and format conversion working. Its just this one little split/regex problem with newline and arrays has got me stumped.

perl_diver

6:38 pm on Apr 28, 2006 (gmt 0)

10+ Year Member



you will be much better off using an html parsing module whan dealing with messy html code, but this might help you anyway:

$line =~ s/\r?\n//g;
$line =~ s/>/>\n/g;

first it removes any \r\n (<-windows) or just \n from the string then replaces it with a \n where you hopefully want it to be.

fabricator

8:07 am on Apr 29, 2006 (gmt 0)

10+ Year Member



Perhaps it would help if I show an example text string.

basically the regex produces something like this at the moment (i'll leave out the <> part for clarity).

@data={ " tag1\n tag2\n tag3\n", "tag4\n tag5\n" }

but I need:
@data={ " tag1\n"," tag2\n'"," tag3\n", "tag4\n"," tag5\n" }

that is each HTML tag should have its own entry in the array.

bennymack

5:29 pm on Apr 29, 2006 (gmt 0)

10+ Year Member



Attempting to parse HTML with a regex is an exercise in futility. There are modules that make this MUCH easier.

Opinions vary, but my favorite is HTML::TokeParser. If you're into writing callbacks, then HTML::TreeBuilder is the way to go.

If you post some HTML and examples of what you want to get out of it I could probably be convinced to supply you with some working code..

fabricator

1:44 pm on May 1, 2006 (gmt 0)

10+ Year Member



I'm trying to parse HTML using regex etc as it is a prety basic page. I would also like to avoid, being tied to a particular page design.

besides I have almost got it working now, I just need to solve this little problem. Plus I would like to know how to do this as I will need to use something simular for some other NON HTML parsing code.

I'll take a look at HTML::TokeParser and HTML::TreeBuilder anyway, thanks for at least giving me the module names at the least.

As for the pages I'm trying to dismantle:
[images.google.com...]
(has practically no newline chars in the HTML :O )

[altavista.com...]
(even more messy)

perl_diver

5:45 pm on May 1, 2006 (gmt 0)

10+ Year Member



how to get from this:

@data={ " tag1\n tag2\n tag3\n", "tag4\n tag5\n" }

to this"

@data={ " tag1\n"," tag2\n'"," tag3\n", "tag4\n"," tag5\n" }

may not be easy, it all depends on the real data. If indeed there are newlines like in the first example above then you could split the array elements on the newlines\spaces:


my @data = ("<tag1>\n <tag2>\n <tag3>\n", "<tag4>\n <tag5>\n");
my @new = ();
foreach my $lines (@data) {
my @temp = map {"$_\n"} split(/\n\s*/,$lines);
push @new,@temp,
}
print "$_" for @new;

fabricator

10:59 am on May 4, 2006 (gmt 0)

10+ Year Member



Thankyou for that perl_diver, all solved now.

oh and this line had a comma at the end instead of semicomma (for those reading this post later).

push @new,@temp;

perl_diver

4:20 pm on May 4, 2006 (gmt 0)

10+ Year Member



oops.... sowwy about that :)