homepage Welcome to WebmasterWorld Guest from 107.21.163.227
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
Regular expression matches too much
"?" for non-greedy doesn't solve the problem
MichaelBluejay




msg:4155959
 11:55 am on Jun 21, 2010 (gmt 0)

I'm trying to find <td>'s followed by <center> so I can turn them into <td align=center>

Of course the <td> could have attributes, like <td valign=top> or <td style="color:green">, so I have to preserve that. Finally, my old wysiwyg editor sometimes inserted a lot of whitespace, so I have to account for that when I'm trying to match.

Let's take this sample string: $data = q[<TD VALIGN=top WIDTH=120></TD><TD> <CENTER>];

This idea doesn't work:

while ($data=~ /<td.*?>\s*?<center>/is) {
$data =~ s|(<td.*?>)\s*<center>|$1 align=center>|is;
}


This matches the whole string, not just the second <td>. It's looking for something that starts with "<td", any characters, then a ">", then any whitespace, and then ends with "center". So it grabs the first "<td" and matches through to the end. The ">" that's matched is the one for the second <td>, not the first one.

It occurred to me that the bad result contains a </td>, so I could just add a negative test for that:

while ($data=~ /<td.*?>\s*?<center>/is && $& !~/<td>/i) {


However, once I'm ready to do the substitution, I'm stuck again, because I have to use the original regexp which matches the whole string.

So I found a new operator which lets me do negative lookaheads, and so I tried this to negate the </td>:

while ($data=~ m|<td.*?>(?!.*</td>)\s*?<center>|is) {


But it still matched the whole string.

This is way above my skillset. Is there an easy solution for this?

 

phranque




msg:4155965
 12:16 pm on Jun 21, 2010 (gmt 0)

i didn't study this too carefully so i may have missed part of your problem but did you try using a character class?
such as ...<td([^>]*)>...
which means "...less than, followed by td, followed by zero or more non-greater-than, follow by greater than..." and any possible td tag attributes and values are captured in a group and available for substitution as $1.

for completeness you might add a check to see if the align attribute was already specified and for validation you might add quotes around the attribute value in the substitution.

MichaelBluejay




msg:4155973
 12:38 pm on Jun 21, 2010 (gmt 0)

Thank you very much for the reply. It's funny, while I was continuing to work on this on my own, I hit on that same idea of using <td([^>]*?)> myself. But the problem remains: It's still gonna grab the ">" from the second <td>, because that's the only legal match, since the first set of <td>'s doesn't contain a <center>.

Good idea about checking for "align=" already being set, but first thing's first! So...any other ideas?

phranque




msg:4155975
 12:44 pm on Jun 21, 2010 (gmt 0)

i'm not sure i understand the problem:
[^>]* stops before the first >

[^>] means anything that is not >

phranque




msg:4155977
 12:50 pm on Jun 21, 2010 (gmt 0)

i also just noticed you are using multiple quantifiers:
[^>]*?

that actually would match "zero or more non-'>' followed by a '?'

MichaelBluejay




msg:4155985
 1:10 pm on Jun 21, 2010 (gmt 0)

Thanks again for your help.

I'm not sure I understand your questions.

I didn't think [^>]* stops before the first >. I thought it meant "any number of characters that aren't >, including no characters".

I didn't think [^>]*? meant "zero or more non-'>' followed by a '?'. I thought ? was the anti-greedy operator, not a literal character.

Maybe I'd understand better if I could see an example. Can you suggest a regexp that matches the first <td></td> set in my original string?

jdMorgan




msg:4155987
 1:13 pm on Jun 21, 2010 (gmt 0)

The new pattern finds the end of the first tag, in this case, the "<td>".

The next step is to search the string starting at that point for "<center"

So, "the problem" seems to be a disconnect between what the new pattern does, and an expectation of a "one-line" solution, which is not what was proposed.

Jim

MichaelBluejay




msg:4155993
 1:18 pm on Jun 21, 2010 (gmt 0)

Okay, how do I search the string starting at the end of the first <td>?

rocknbil




msg:4156120
 4:51 pm on Jun 21, 2010 (gmt 0)

I'm trying to find <td>'s followed by <center> so I can turn them into <td align=center>


Here's what I would try.

while ($data =~ /<td[^>]*>\s*<center>/isg) {
$data =~ s/<td[^>]*>\s*<center>/<td class="center-align">/isg;
$data =~ s/<\/center>//isg;
}

I don't even know that you need the while.

$data = 'Some big block of multiline text';
$data =~ s/<td[^>]*>\s*<center>/<td class="center-align">/img;
$data =~ s/<\/center>//img;

MichaelBluejay




msg:4156229
 7:33 pm on Jun 21, 2010 (gmt 0)

Thanks, rocknbill. But I'm not sure that will work because of another problem I realized: I can't kill </center> tags as a separate regexp, because it could kill the *wrong* </center> tag. For example, I've got nested tables and some of the tables themselves are centered with <center></center>.

I'm starting to think there is no easy answer, and that I'll have to construct something that takes several lines with multiple if/then's, if I can even wrap my head around *that* one.

chorny




msg:4156260
 8:31 pm on Jun 21, 2010 (gmt 0)
HTML::TreeBuilder::XPath is a good tool to find all occurrences. For manipulating you will need to study HTML::Element.
jdMorgan




msg:4156289
 9:22 pm on Jun 21, 2010 (gmt 0)

The "while" needs to start after the <td> is found, and be defined as "while not </td>" so that only center tags entirely contained within the current <td> are matched. And consider that an unclosed <center> may be present; You have to decide whether to ignore it, or to try to fix it up like most browsers would (by closing it with the </td>).

Jim

[edited by: phranque at 9:59 pm (utc) on Jun 21, 2010]
[edit reason] disabled graphic smileys ;) [/edit]

MichaelBluejay




msg:4156562
 5:34 am on Jun 22, 2010 (gmt 0)

jdMorgan, that concept makes sense, but I really don't have a clue as to how to implement it.

rocknbil




msg:4156932
 4:38 pm on Jun 22, 2010 (gmt 0)

For example, I've got nested tables and some of the tables themselves are centered with <center></center>.


Ugh, and nested too . . . . well . . . the center tag must die, you could also sub out the table tags for a margin:auto style. :-) These are always "fun" projects.

I'm on a project now like this. I just found it easier to rebuild a template, visit the existing page and copy it from within the browser, avoiding the source code like the plague. Seems more tedious, but a lot less stressful.

MichaelBluejay




msg:4157320
 3:08 am on Jun 23, 2010 (gmt 0)

I had an epiphany. I can easily loop through *only* the <td></td>'s like so:
while ($data =~ m|<td.+?</td>|g)

That way I'm sure to not get more than one <td></td> pair at a time to work with

The question then becomes, how do I get the fixed text back into the string after I've parsed it? The match is set to $&, and I can run a s/// regexp on it without fear of it including more than one <td>. But once I fix it, it's in a variable separate from the original string. How to put it back?

janharders




msg:4158237
 8:12 am on Jun 24, 2010 (gmt 0)

I'd seriously advise to give chorny's post another read:
HTML::TreeBuilder::XPath is a good tool to find all occurrences. For manipulating you will need to study HTML::Element.


HTML::Element is great for both navigating and manipulating DOM-trees. With the look_down [search.cpan.org]-method you can run pretty heavy searches that should solve your problems with a much smaller risk of missing something. It might seem a little overhead for simple tasks, but regexps usually either get way too complicated or start to fail as soon as someone comes along with
<img src="" alt="look here >" />
Before you start throwing loops together and end up writing your own DOM parser, give HTML::Element [search.cpan.org] and it's helper HTML::TreeBuilder [search.cpan.org] a chance.

phranque




msg:4158265
 9:28 am on Jun 24, 2010 (gmt 0)

i will third what chorny and janharders are suggesting.
html structure, especially when it is not validated and well-formed, is extremely difficult to handle with regular expressions.

MichaelBluejay




msg:4158316
 11:30 am on Jun 24, 2010 (gmt 0)

I did take a look at those. They are way beyond my level of expertise. I'd have to spend many hours trying to make sense of them, and even then I'm not confident that I'd be able to build a solution out of them, so I'm especially wary. Yes, it does seem like overkill to me for such a simple problem I'm trying to solve, especially since I don't know whether I could even get it working anyway.

As per my last post, I realized that I can easily loop through *only* the <td></td>'s like so:
while ($data =~ m|<td.+?</td>|g)


That way I'm sure to not get more than one <td></td> pair at a time to work with. Then I just need a way to get the fixed text back into the main string. It seems like there should be some way to do it, I just don't know how.

janharders




msg:4158424
 2:32 pm on Jun 24, 2010 (gmt 0)

Get into it, it'll save you alot of time, not just now but in the future.
Here's something to get you started:


use strict;
use HTML::TreeBuilder;
my $content = '<table><tr><TD VALIGN=top WIDTH=120></TD><TD> <CENTER>centered!</center></td> <TD><CENTER>also centered?</center></td></tr></table>';
my $tree = HTML::TreeBuilder->new;
$tree->parse($content);
$tree->eof();

print "Before: " . $tree->as_HTML;

print "\n\n";

for my $centertd ($tree->look_down('_tag' => 'td'sub $_[0]->descendants() > 0 && ($_[0]->descendants())[0]->tag() eq 'center' }) ) {
   $centertd->attr('align''center');
   $centertd->push_content( ($centertd->descendants())[0]->content_list() );
   ($centertd->descendants())[0]->detach();
   
}

print "After: " . $tree->as_HTML;


which outputs


Before: <html><head></head><body><table><tr><td valign="top" width="120"></td><td><center>centered!</center></td><td><center>also centered?</center></td></tr></table></body></html>


After: <html><head></head><body><table><tr><td valign="top" width="120"></td><td align="center">centered!</td><td align="center">also centered?</td></tr></table></body></html>


Note that HTML::TreeBuilder even repairs faulty Trees (which, of course, you can tell it not to do).

MichaelBluejay




msg:4159075
 9:29 am on Jun 25, 2010 (gmt 0)

Thanks, janharders, but this is still way, way, way over my head. Anyway, I figured out a way to do it using regexps. I'm sure there's a simpler way to do it, but at least this works, and I can understand the code. What I did was to isolate each <td></td> set, apply the substitution, save that set into an array, tag the original string so I could tell where to reinsert the fixed sets, and then run another loop to reinsert the fixed sets.
$html = q[ <TD VALIGN=top WIDTH=120>cell-1</TD><TD> <CENTER>cell-2</center></td>];
$tagged = $html;

$counter=0;
while ($html =~ m|<td.+?</td>|isg) {
   $counter++;
   ($TD[$counter] = $&) =~ s|(<td.*?)>\s*<center>(.*?)</td>|$1 align=center>$2</td>|isg;
   $TD[$counter] =~ s|</center>\s*</td>|</td>|isg; # Kill only the final </center> before the </td>, not any earlier </center>'s within the <td></td>
   $tagged =~ s|([^#])<td|$1<flag$counter>#<td|is; # Tag it
}
$total = $counter;

for $counter (1..$total) {
   $tagged =~ s|<flag$counter>#<td.+?</td>|$TD[$counter]|is;
}
$html = $tagged;

janharders




msg:4159512
 8:54 pm on Jun 25, 2010 (gmt 0)

Glad you got it working.
What are you not understanding about the code I posted? As usual with perl, I think it's pretty straight forward and you'll understand most lines if you just read them as plain english.
As I said, regexps are not really the right tool for the job. yes, you might get it done, but you might also hit a few cases where it doesn't work and chance is, you won't notice because it's just the special scenarios.

phranque




msg:4159534
 9:18 pm on Jun 25, 2010 (gmt 0)

.*?
.+?

i'm not sure these are doing what you are intending in those regular expressions.
the first means "zero or more of any character followed by a question mark" and the second means "one or more of any character followed by a question mark".
in other words those question marks are literal characters, not quantifiers.

janharders




msg:4159562
 10:19 pm on Jun 25, 2010 (gmt 0)


.*?
.+?
i'm not sure these are doing what you are intending in those regular expressions.
the first means "zero or more of any character followed by a question mark" and the second means "one or more of any character followed by a question mark".


Mh, unless I'm totally wrong (which, I must add, is quite possible, I've had a few drinks during and after the football-games), you've mixed up something. the question mark in that scenario just changes the greediness, a ? alone will never match a literal ?, but rather either quantify the previous character to be matched zero or one time or, in case it follows a quantifier like * or +, change the greediness.

phranque




msg:4159639
 1:47 am on Jun 26, 2010 (gmt 0)

my bad!
even though it was mentioned in the thread i completely forgot about non-greedy quantifiers.
embarrassed to admit my perl has been getting rusty.
rtfm...

MichaelBluejay




msg:4159693
 5:12 am on Jun 26, 2010 (gmt 0)

What are you not understanding about the code I posted?


Oh man, where do I start?

(1) The nesting is very deep. (Lots of nested ()'s.) Makes the meaning hard to understand, for me.
(2) I don't know what it means when a subroutine doesn't have a name.
(3) I don't know what it means when a subroutine is embedded directly into a FOR loop.
(4) I don't know what $_[0] means, at least not in this context.
(5) I'm not sure what the -> notation means here. It's not something I've used, at least not for a long time.
(6) I have no earthly idea what the code actually does.

I think it's pretty straight forward and you'll understand most lines if you just read them as plain english.


Are you kidding me?! That doesn't even *approach* plain English. None of my friends has ever used $_[0] or -> in a conversation. The code isn't even similar to the simple style of *Perl* that I'm used to.

Please don't try to explain it, I don't want to waste your time, especially as I'm loathe to wrack my brain trying to understand something I don't need right now. I was just trying to explain that my Perl skills are pretty meager.

I did come up with a regexp to solve my problem all by myself, though. :)

janharders




msg:4159774
 9:46 am on Jun 26, 2010 (gmt 0)

(2) I don't know what it means when a subroutine doesn't have a name.


It's an anonymous subroutine. mostly, that's used when you want to stuff that routine into a code-reference, like in this case.

(3) I don't know what it means when a subroutine is embedded directly into a FOR loop.


It's not. It's embedded into the search for the td-tags. It's alot easier to read like this:

$tree->look_down( # look down the DOM-tree of HTML elements
'_tag' => 'td', # look for td-tags
sub { # only find those td-tags that ...
$_[0]->descendants() > 0 && # have child elements
($_[0]->descendants())[0]->tag() eq 'center' # and the first child is a center-tag
}
)


that method will return all matching td-tags, which are then used in the for-loop.

(4) I don't know what $_[0] means, at least not in this context.


It means the same it does in normal subroutines, it's just the first passed parameter. In this case, look_down passes a possibly matching element into that anonymous sub so the sub can decide wether it is a match or not. I've been lazy and haven't added the return, so it should rather be
sub { # only find those td-tags that ...
return $_[0]->descendants() > 0 && # have child elements
($_[0]->descendants())[0]->tag() eq 'center' # and the first child is a center-tag
}

look_down will discard this possible hit if my sub returns false.

(5) I'm not sure what the -> notation means here. It's not something I've used, at least not for a long time.

It's basically the derefencing-operator and here it's used for OOP to call a method of an object.

And now for the part that matters and actually does stuff to the html-tree:
the code in the for-loop. $centertd is alaways a td that was found by look_down, a table cell with <center> as it's first child.

$centertd->attr('align', 'center'); # make <td> to <td align="center">
$centertd->push_content( ($centertd->descendants())[0]->content_list() ); # copy all the childs of the
# td's first child (which is <center>) directly into the td
($centertd->descendants())[0]->detach(); # delete the now obsolete first child


But yeah, HTML::Element is not totally intuitive because of it's great power. I use it all the time to automatically convert evil old html into new shiny xhtml. recent example of what it fixed for me:
<!--START HEADER-->
<table class="header"><tr><td>
<h1> a very important heading </h1>
</td></tr></table>
<!--END HEADER-->

<!--START TEXT-->
<table class="imageleft"><tr><td>
<img src="/images/myimage.jpg" alt="LOOK AT IT" width="264" height="250">
<p class="center"><small><i>that image is great</i></small></p>
</td></tr></table>


You just don't want to tackle a beast like that with regular expressions...

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved