Welcome to WebmasterWorld Guest from 54.221.30.139

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Regular expression matches too much

"?" for non-greedy doesn't solve the problem

     

MichaelBluejay

11:55 am on Jun 21, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm trying to find <td>'s followed by <center> so I can turn them into <td align=center>

Of course the <td> could have attributes, like <td valign=top> or <td style="color:green">, so I have to preserve that. Finally, my old wysiwyg editor sometimes inserted a lot of whitespace, so I have to account for that when I'm trying to match.

Let's take this sample string: $data = q[<TD VALIGN=top WIDTH=120></TD><TD> <CENTER>];

This idea doesn't work:

while ($data=~ /<td.*?>\s*?<center>/is) {
$data =~ s|(<td.*?>)\s*<center>|$1 align=center>|is;
}


This matches the whole string, not just the second <td>. It's looking for something that starts with "<td", any characters, then a ">", then any whitespace, and then ends with "center". So it grabs the first "<td" and matches through to the end. The ">" that's matched is the one for the second <td>, not the first one.

It occurred to me that the bad result contains a </td>, so I could just add a negative test for that:

while ($data=~ /<td.*?>\s*?<center>/is && $& !~/<td>/i) {


However, once I'm ready to do the substitution, I'm stuck again, because I have to use the original regexp which matches the whole string.

So I found a new operator which lets me do negative lookaheads, and so I tried this to negate the </td>:

while ($data=~ m|<td.*?>(?!.*</td>)\s*?<center>|is) {


But it still matched the whole string.

This is way above my skillset. Is there an easy solution for this?

phranque

12:16 pm on Jun 21, 2010 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



i didn't study this too carefully so i may have missed part of your problem but did you try using a character class?
such as ...<td([^>]*)>...
which means "...less than, followed by td, followed by zero or more non-greater-than, follow by greater than..." and any possible td tag attributes and values are captured in a group and available for substitution as $1.

for completeness you might add a check to see if the align attribute was already specified and for validation you might add quotes around the attribute value in the substitution.

MichaelBluejay

12:38 pm on Jun 21, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you very much for the reply. It's funny, while I was continuing to work on this on my own, I hit on that same idea of using <td([^>]*?)> myself. But the problem remains: It's still gonna grab the ">" from the second <td>, because that's the only legal match, since the first set of <td>'s doesn't contain a <center>.

Good idea about checking for "align=" already being set, but first thing's first! So...any other ideas?

phranque

12:44 pm on Jun 21, 2010 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



i'm not sure i understand the problem:
[^>]* stops before the first >

[^>] means anything that is not >

phranque

12:50 pm on Jun 21, 2010 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



i also just noticed you are using multiple quantifiers:
[^>]*?

that actually would match "zero or more non-'>' followed by a '?'

MichaelBluejay

1:10 pm on Jun 21, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks again for your help.

I'm not sure I understand your questions.

I didn't think [^>]* stops before the first >. I thought it meant "any number of characters that aren't >, including no characters".

I didn't think [^>]*? meant "zero or more non-'>' followed by a '?'. I thought ? was the anti-greedy operator, not a literal character.

Maybe I'd understand better if I could see an example. Can you suggest a regexp that matches the first <td></td> set in my original string?

jdMorgan

1:13 pm on Jun 21, 2010 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



The new pattern finds the end of the first tag, in this case, the "<td>".

The next step is to search the string starting at that point for "<center"

So, "the problem" seems to be a disconnect between what the new pattern does, and an expectation of a "one-line" solution, which is not what was proposed.

Jim

MichaelBluejay

1:18 pm on Jun 21, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Okay, how do I search the string starting at the end of the first <td>?

rocknbil

4:51 pm on Jun 21, 2010 (gmt 0)

WebmasterWorld Senior Member rocknbil is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I'm trying to find <td>'s followed by <center> so I can turn them into <td align=center>


Here's what I would try.

while ($data =~ /<td[^>]*>\s*<center>/isg) {
$data =~ s/<td[^>]*>\s*<center>/<td class="center-align">/isg;
$data =~ s/<\/center>//isg;
}

I don't even know that you need the while.

$data = 'Some big block of multiline text';
$data =~ s/<td[^>]*>\s*<center>/<td class="center-align">/img;
$data =~ s/<\/center>//img;

MichaelBluejay

7:33 pm on Jun 21, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks, rocknbill. But I'm not sure that will work because of another problem I realized: I can't kill </center> tags as a separate regexp, because it could kill the *wrong* </center> tag. For example, I've got nested tables and some of the tables themselves are centered with <center></center>.

I'm starting to think there is no easy answer, and that I'll have to construct something that takes several lines with multiple if/then's, if I can even wrap my head around *that* one.

chorny

8:31 pm on Jun 21, 2010 (gmt 0)

5+ Year Member


HTML::TreeBuilder::XPath is a good tool to find all occurrences. For manipulating you will need to study HTML::Element.

jdMorgan

9:22 pm on Jun 21, 2010 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



The "while" needs to start after the <td> is found, and be defined as "while not </td>" so that only center tags entirely contained within the current <td> are matched. And consider that an unclosed <center> may be present; You have to decide whether to ignore it, or to try to fix it up like most browsers would (by closing it with the </td>).

Jim

[edited by: phranque at 9:59 pm (utc) on Jun 21, 2010]
[edit reason] disabled graphic smileys ;) [/edit]

MichaelBluejay

5:34 am on Jun 22, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



jdMorgan, that concept makes sense, but I really don't have a clue as to how to implement it.

rocknbil

4:38 pm on Jun 22, 2010 (gmt 0)

WebmasterWorld Senior Member rocknbil is a WebmasterWorld Top Contributor of All Time 10+ Year Member



For example, I've got nested tables and some of the tables themselves are centered with <center></center>.


Ugh, and nested too . . . . well . . . the center tag must die, you could also sub out the table tags for a margin:auto style. :-) These are always "fun" projects.

I'm on a project now like this. I just found it easier to rebuild a template, visit the existing page and copy it from within the browser, avoiding the source code like the plague. Seems more tedious, but a lot less stressful.

MichaelBluejay

3:08 am on Jun 23, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I had an epiphany. I can easily loop through *only* the <td></td>'s like so:
while ($data =~ m|<td.+?</td>|g)

That way I'm sure to not get more than one <td></td> pair at a time to work with

The question then becomes, how do I get the fixed text back into the string after I've parsed it? The match is set to $&, and I can run a s/// regexp on it without fear of it including more than one <td>. But once I fix it, it's in a variable separate from the original string. How to put it back?

janharders

8:12 am on Jun 24, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



I'd seriously advise to give chorny's post another read:
HTML::TreeBuilder::XPath is a good tool to find all occurrences. For manipulating you will need to study HTML::Element.


HTML::Element is great for both navigating and manipulating DOM-trees. With the look_down [search.cpan.org]-method you can run pretty heavy searches that should solve your problems with a much smaller risk of missing something. It might seem a little overhead for simple tasks, but regexps usually either get way too complicated or start to fail as soon as someone comes along with
<img src="" alt="look here >" />

Before you start throwing loops together and end up writing your own DOM parser, give HTML::Element [search.cpan.org] and it's helper HTML::TreeBuilder [search.cpan.org] a chance.

phranque

9:28 am on Jun 24, 2010 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



i will third what chorny and janharders are suggesting.
html structure, especially when it is not validated and well-formed, is extremely difficult to handle with regular expressions.

MichaelBluejay

11:30 am on Jun 24, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I did take a look at those. They are way beyond my level of expertise. I'd have to spend many hours trying to make sense of them, and even then I'm not confident that I'd be able to build a solution out of them, so I'm especially wary. Yes, it does seem like overkill to me for such a simple problem I'm trying to solve, especially since I don't know whether I could even get it working anyway.

As per my last post, I realized that I can easily loop through *only* the <td></td>'s like so:
while ($data =~ m|<td.+?</td>|g)


That way I'm sure to not get more than one <td></td> pair at a time to work with. Then I just need a way to get the fixed text back into the main string. It seems like there should be some way to do it, I just don't know how.

Software error:

Can't locate /home/deploy/webmasterworld/code_format-v6.lib in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.18.2 /usr/local/share/perl/5.18.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.18 /usr/share/perl/5.18 /usr/local/lib/site_perl .) at decode-post-v6.lib line 27, <THREADDAT> line 20.

For help, please send mail to the webmaster (it@imninjas.com), giving this error message and the time and date of the error.