Regular expression matches too much

I'm trying to find <td>'s followed by <center> so I can turn them into <td align=center>

Of course the <td> could have attributes, like <td valign=top> or <td style="color:green">, so I have to preserve that. Finally, my old wysiwyg editor sometimes inserted a lot of whitespace, so I have to account for that when I'm trying to match.

Let's take this sample string: $data = q[<TD VALIGN=top WIDTH=120></TD><TD> <CENTER>];

This idea doesn't work:

while ($data=~ /<td.*?>\s*?<center>/is) {
$data =~ s|(<td.*?>)\s*<center>|$1 align=center>|is;
}

This matches the whole string, not just the second <td>. It's looking for something that starts with "<td", any characters, then a ">", then any whitespace, and then ends with "center". So it grabs the first "<td" and matches through to the end. The ">" that's matched is the one for the second <td>, not the first one.

It occurred to me that the bad result contains a </td>, so I could just add a negative test for that:

while ($data=~ /<td.*?>\s*?<center>/is && $& !~/<td>/i) {

However, once I'm ready to do the substitution, I'm stuck again, because I have to use the original regexp which matches the whole string.

So I found a new operator which lets me do negative lookaheads, and so I tried this to negate the </td>:

while ($data=~ m|<td.*?>(?!.*</td>)\s*?<center>|is) {

But it still matched the whole string.

This is way above my skillset. Is there an easy solution for this?

Before: <html><head></head><body><table><tr><td valign="top" width="120"></td><td><center>centered!</center></td><td><center>also centered?</center></td></tr></table></body></html> After: <html><head></head><body><table><tr><td valign="top" width="120"></td><td align="center">centered!</td><td align="center">also centered?</td></tr></table></body></html>

$tree->look_down( # look down the DOM-tree of HTML elements '_tag' => 'td', # look for td-tags sub { # only find those td-tags that ... $_[0]->descendants() > 0 && # have child elements ($_[0]->descendants())[0]->tag() eq 'center' # and the first child is a center-tag } )

$centertd->attr('align', 'center'); # make <td> to <td align="center"> $centertd->push_content( ($centertd->descendants())[0]->content_list() ); # copy all the childs of the # td's first child (which is <center>) directly into the td ($centertd->descendants())[0]->detach(); # delete the now obsolete first child

 <table class="header"><tr><td> <h1> a very important heading </h1> </td></tr></table>   <table class="imageleft"><tr><td> <img src="/images/myimage.jpg" alt="LOOK AT IT" width="264" height="250"> <p class="center"><small><i>that image is great</i></small></p> </td></tr></table>

Regular expression matches too much

"?" for non-greedy doesn't solve the problem

MichaelBluejay

phranque

MichaelBluejay

phranque

phranque

MichaelBluejay

jdMorgan

MichaelBluejay

rocknbil

MichaelBluejay

chorny

jdMorgan

MichaelBluejay

rocknbil

MichaelBluejay

janharders

phranque

MichaelBluejay

janharders

MichaelBluejay

janharders

phranque

janharders

phranque

MichaelBluejay

janharders

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week