Regular expression - How to replace all &'s within an HTML comment - PHP Server Side Scripting forum at WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

Regular expression - How to replace all &'s within an HTML comment

Using preg_replace()

penders

3:08 pm on Apr 28, 2008 (gmt 0)

I have a variable $CONTENT which contains HTML source code and wish to replace all '&' (ampersands) that occur inside HTML comments () with something else (ie. '[[AMPERSAND]]').

Either the match is being too greedy (from the first "" and matching everything) or it's only replacing the first '&', I can't get it right.

Using preg_replace(), I have:

$CONTENT = preg_replace('/<!--([\w\W]*?)&([\w\W]*?)-->/', '<!--$1[[AMPERSAND]]$2-->', $CONTENT);

Or is there a better way? Any comments appreciated.

RonPK

8:37 am on Apr 29, 2008 (gmt 0)

You might try using a callback function:

$content = preg_replace_callback('/(<!-- [\w\W]*?&[\w\W]*? -->)/',  
  create_function('$m', 'return str_replace("&", "[ampersand]", $m[1]);'),  
  $content);

That won't get you into regexp heaven but it will do the job :)

penders

11:55 pm on May 7, 2008 (gmt 0)

Thanks RonPK, that certainly got me onto the right track. However, I still had the problem initially of the regexp being too greedy (matching everything from the first ""). I was able to get round this with the following, which does the job...

$CONTENT = preg_replace_callback('/<!--[\w\W]+?-->/', create_function('$m', 'return str_replace("&", "#AMPERSAND#", $m[0]);'), $CONTENT);

This replaces all '&' (ampersands) that occur just within HTML comments into the character sequence '#AMPERSAND#'.

1. Are there other ways of doing this?

------------------------------------------------------------------------------

2. Ultimately my goal is actually the opposite... to replace all '&' that DO NOT appear within HTML comments with '&'. Is there an elegant regexp way of doing this?!

(Instead I replace all '&' that DO occur in HTML comments with 'X'. Then replace ALL '&' with '&'. And finally replace ALL 'X' with '&' to correct the HTML comments.)

coopster

1:41 pm on May 8, 2008 (gmt 0)

RonPK nailed it. If you have a comment such as:

<!-- I have a comment here & here & here -->

The regular expression you have is only going to match and replace the first ampersand occurrence:

<!-- I have a comment here [[AMPERSAND]] here & here -->

If you step back and look at the regular expression, it says that every match of any ampersand must begin with the comment code (

<!--

) so it hits that first one and changes it because it falls between the comments. It does not say to repeat within the comment itself though, it says to move on and find the next beginning comment and start again. Therefore the callback routine is required to further analyze the data between the comments in order to catch and replace multiple occurrences of ampersands between the comment begin/end tags.

The only issue you ran into is what is called greediness. Quantifiers are "greedy" but if a quantifier is followed by a question mark, then it ceases to be greedy, and instead matches the minimum number of times possible. There are two ways to overcome the greediness in PHP. First, as was just stated: you can make them "ungreedy" by following them with a question mark. PHP also has a "U" modifier that makes them all ungreedy for that particular pattern.

To get where you are going, to actually modify the ampersands outside of the comments, you need to start your pattern just the opposite. The ending comment tag would be the beginning of your pattern and the beginning comment tag would be the end of your pattern. Watch out though so you don't end up changing entities that are already encoded! For example, if there is a

&lt;

in your HTML you want to make sure you don't change it to

&amp;lt

now!

chorny

10:36 pm on May 8, 2008 (gmt 0)

penders, better use HTML parser

penders

5:59 pm on May 11, 2008 (gmt 0)

RonPK nailed it.

@coopster: Many thanks for your explanation. Yep, RonPK was pretty much on the button, however, the regular expression was still too greedy (as you mention). I think this was because an HTML comment might not contain an '&' at all (the character we are looking for) but the regexp wanted exactly 1. So, in the following example, all '&' would end up being converted between the first ''.

<!-- An HTML Comment --> 
<p>&nbsp; &amp; &gt; &lt;</p> 
<!-- 
<p>&nbsp; &amp; &gt; &lt;</p> 
-->

I realise now that this could be solved by simply placing a '?' after the '&' in RonPK's original regexp to make the '&' optional. Or simply match the whole comment (which I have done) since it's the callback function which actually does the search and replace, passing the container to search in.

To get where you are going, to actually modify the ampersands outside of the comments, you need to start your pattern just the opposite. The ending comment tag would be the beginning of your pattern...

Ha, yeah, not sure why I didn't think of that! Thanks. :) Will give that a go sometime... the closing/opening comment tags will need to be optional, as there might not be any comments... hhmmm...

Watch out though so you don't end up changing entities that are already encoded!

Actually, that is exactly what I am trying to do. A bit of background... I'm not trying to correct invalid HTML, rather preserve valid HTML for display on the page. This is all part of an online Text/HTML editor (a CMS if you like). The HTML content is loaded into a <textarea> to be edited. HTML entities, however, get converted by the browser into the actual characters for display so the HTML entity is lost. So < is converted into &lt; so that it actually displays in the textarea as <. However, I've found that in IE and FF HTML entities that appear within HTML comments do not get converted into the actual characters, they stay as HTML entities (hence this thread), except in Opera they do (hhmmmm)!

penders, better use HTML parser

@chorny: The content is actually passed through an XHTML parser afterwards to attempt validation, but the editor itself is really only intended to be a plain text editor; the users are all HTML whizzes - well, kinda! :)