Forum Moderators: coopster
Either the match is being too greedy (from the first "<!--" to the last "-->" and matching everything) or it's only replacing the first '&', I can't get it right.
Using preg_replace(), I have:
$CONTENT = preg_replace('/<!--([\w\W]*?)&([\w\W]*?)-->/', '<!--$1[[AMPERSAND]]$2-->', $CONTENT); Or is there a better way? Any comments appreciated.
$CONTENT = preg_replace_callback('/<!--[\w\W]+?-->/', create_function('$m', 'return str_replace("&", "#AMPERSAND#", $m[0]);'), $CONTENT); This replaces all '&' (ampersands) that occur just within HTML comments into the character sequence '#AMPERSAND#'.
1. Are there other ways of doing this?
------------------------------------------------------------------------------
2. Ultimately my goal is actually the opposite... to replace all '&' that DO NOT appear within HTML comments with '&'. Is there an elegant regexp way of doing this?!
(Instead I replace all '&' that DO occur in HTML comments with 'X'. Then replace ALL '&' with '&'. And finally replace ALL 'X' with '&' to correct the HTML comments.)
<!-- I have a comment here & here & here -->
<!-- I have a comment here [[AMPERSAND]] here & here -->
<!--) so it hits that first one and changes it because it falls between the comments. It does not say to repeat within the comment itself though, it says to move on and find the next beginning comment and start again. Therefore the callback routine is required to further analyze the data between the comments in order to catch and replace multiple occurrences of ampersands between the comment begin/end tags.
The only issue you ran into is what is called greediness. Quantifiers are "greedy" but if a quantifier is followed by a question mark, then it ceases to be greedy, and instead matches the minimum number of times possible. There are two ways to overcome the greediness in PHP. First, as was just stated: you can make them "ungreedy" by following them with a question mark. PHP also has a "U" modifier that makes them all ungreedy for that particular pattern.
To get where you are going, to actually modify the ampersands outside of the comments, you need to start your pattern just the opposite. The ending comment tag would be the beginning of your pattern and the beginning comment tag would be the end of your pattern. Watch out though so you don't end up changing entities that are already encoded! For example, if there is a
<in your HTML you want to make sure you don't change it to
&ltnow!
RonPK nailed it.
@coopster: Many thanks for your explanation. Yep, RonPK was pretty much on the button, however, the regular expression was still too greedy (as you mention). I think this was because an HTML comment might not contain an '&' at all (the character we are looking for) but the regexp wanted exactly 1. So, in the following example, all '&' would end up being converted between the first '<!--' and the very last '-->'.
<!-- An HTML Comment -->
<p> & > <</p>
<!--
<p> & > <</p>
-->
I realise now that this could be solved by simply placing a '?' after the '&' in RonPK's original regexp to make the '&' optional. Or simply match the whole comment (which I have done) since it's the callback function which actually does the search and replace, passing the container to search in.
To get where you are going, to actually modify the ampersands outside of the comments, you need to start your pattern just the opposite. The ending comment tag would be the beginning of your pattern...
Ha, yeah, not sure why I didn't think of that! Thanks. :) Will give that a go sometime... the closing/opening comment tags will need to be optional, as there might not be any comments... hhmmm...
Watch out though so you don't end up changing entities that are already encoded!
Actually, that is exactly what I am trying to do. A bit of background... I'm not trying to correct invalid HTML, rather preserve valid HTML for display on the page. This is all part of an online Text/HTML editor (a CMS if you like). The HTML content is loaded into a <textarea> to be edited. HTML entities, however, get converted by the browser into the actual characters for display so the HTML entity is lost. So < is converted into &lt; so that it actually displays in the textarea as <. However, I've found that in IE and FF HTML entities that appear within HTML comments do not get converted into the actual characters, they stay as HTML entities (hence this thread), except in Opera they do (hhmmmm)!
penders, better use HTML parser
@chorny: The content is actually passed through an XHTML parser afterwards to attempt validation, but the editor itself is really only intended to be a plain text editor; the users are all HTML whizzes - well, kinda! :)