Forum Moderators: coopster

Message Too Old, No Replies

Replacing forum code with regular expressions

Help with preg_replace needed

         

Hester

2:57 pm on Jul 16, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member


Caution: newbie to regular expressions here (but not newbie to PHP).

I'm trying to replace forum code (such as "[b]" and "[quote]"[smilestopper]) with the HTML equivalents. So far so good - my script works. But it can lead to invalid markup if forum code tags are nested. I cannot see a way round this.

My guess is that my expressions are too basic. I'm filtering the text and matching opening and closing tags. But what I need to do is look for opening tags and *only* replace them if they are followed directly by the identical closing tag to match.

Here's what I've got so far:

<style>i {color:red;}</style>

<?php

$text = "[b][i]A[/b][/i] [i]B[/i] b [b]C[/b] [b][i]D

d[/i][/b] E [/b] [i]F

[/i]G[/i]";

$pattern = "/\[b](.+)\[\/b]/Uis";
$pattern2 = "/\[i](.+)\[\/i]/Uis";

$text2 = preg_replace($pattern, "<b>\\1</b>", $text);
$text3 = preg_replace($pattern2, "<i>\\1</i>", $text2);

echo "<pre>
$text3
</pre>";

?>

This works, but the start of the output looks like this:

<b><i>A</b></i>

As you can see, the tags are misplaced. I tried filtering the text a third time to remove the problem, but it took out tags near the end of the text as well.

Is there a way to make a regular expression that uses something like 'non-greedy quantifiers' or 'lookahead assertions' to check for a closing tag ahead? I'm not sure exactly how to use those methods. I have looked at sample scripts but find them mind-bogglingly complex.

Hope someone can help.

timster

5:21 pm on Jul 16, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



$test4 = preg_replace('/(<(\w)>[^<]+)(<\/\w>)(<\/\2>)/', "$1$4$3", $text3);

And look, here comes some nice programmer to explain this code.

[edited by: rogerd at 5:28 pm (utc) on July 16, 2004]
[edit reason] disable smilies [/edit]

Hester

9:14 am on Jul 19, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ah, thanks. You've taken a different approach - to re-order the tags. I was trying to replace the bad italic ones back to square brackets, so the user would know they had got them wrong.

Did you come up with this code yourself? I always try to credit people who've helped my code, so I would like to add a credit for you. (And if you did the code, are you not willing to explain it a little? I can see what some of it is doing, but the "$1$4$3" is intriguing. Is that re-ordering the words?)

Hester

2:01 pm on Jul 19, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



STUCK. If I use an HTML tag longer than one character, the script fails to correct wrongly ordered markup. So bold and italic are fine, but blockquote not.

I've studied the line of code and cannot see any alterations I should make to get it to work with longer tags.

Please help!

timster

8:19 pm on Jul 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member


$test4 = preg_replace('/(<(\w[^>]+)>[^<]+)(<\/\w[^>]+>)(<\/\2>)/', "$1$4$3", $text3);

This should allow for longer tags.

Sorry, it's a hopeless bunch of chicken scratch. But yes, I did write it.

Yes, $2$4$3 reorders the tags. You can look up "backreferences" for an explanation of what's going on here.

Be aware, this line will fall apart if there are more than 2 incorrect nested tags, e.g.:

[block][b][i]Stuff[/block][/b][/i]

Hester

9:49 am on Jul 21, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member


Thanks for that. Sadly the script doesn't appear to be working. Here's what I got:

<style>
i {color:red;}
b {color:blue;}
i b, b i {color:purple;}
blockquote {border:1px solid #ccc;}
</style>

<?php
$text = "[i][b]A[/i][/b] [b][i]B[/i][/b] b [b]C $9.99[/b] [b][i]D

d[/b][/i] E [/b] [i]F

[/i]G[/i] [q][b]This [b]is[/b] a [b][i]quote[/b][/i] here [i]like[/i] $ this.[/q]
Text
[q][i]hello[/q][/i]
More text";

$text = preg_replace("!" . '\x24' . "!", '\\$', $text); //replace dollars

$pattern = "/\[b](.+)\[\/b]/Uis";
$pattern2 = "/\[i](.+)\[\/i]/Uis";
$pattern3 = "/\[q](.+)\[\/q]/Uis";

$text2 = preg_replace($pattern, "<b>\\1</b>", $text);
$text3 = preg_replace($pattern2, "<i>\\1</i>", $text2);
$text4 = preg_replace($pattern3, "<blockquote>\\1</blockquote>", $text3);

//by timster - (URL snipped) - corrects wrongly nested tags
$text5 = preg_replace('/(<(\w[^>]+)>[^<]+)(<\/\w[^>]+>[smilestopper])(<\/\2>[smilestopper])/', "$1$4$3", $text4);

echo "<pre>
$text5
</pre>";
?>

If you see "[smilestopper]" in the above code, it's the forum - how do you get round that? Replace the code with the line in the post before mine.

timster

1:08 pm on Jul 21, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oops, I meant:

$test4 = preg_replace('/(<(\w[^>]*)>[^<]+)(<\/\w[^>]*>)(<\/\2>)/', "$1$4$3", $text3);

(Still won't work if there are 3+ nested tags.)

Hester

1:39 pm on Jul 21, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Excellent! Thanks for that code. Is there any way to make it work with 3 or more matches though?

I'm thinking now I need to go through the script and build an array. Then replace only valid codes. Something like this:

  1. if opening bracket found, increase an array value for the code by 1. (Each code has a separate value stored.)
  2. if another opening bracket found, increase its value as well (allowing for nested tags)
  3. if a closing tag is found, check current values stored in array to see if it's valid to close the tag there or not. If not, ignore it.
  4. carry on until end of text reached. If final values held for opening and closing tags do not match, add the correct codes to close them, while keeping the order right (valid HTML).

What do you think?

Also, does anyone know of a way to move through a string of text in such a way as to compare characters as you go? I always filter text by grabbing each line and splitting it (using the file pointer to progress). Is there a way to 'jump' to each opening bracket in the text to speed it up?