Forum Moderators: coopster
I am trying to build a regular expression to scrub some incoming text to remove stuff that comes into the WYSIWIG editor from Word (even though we explicitly tell the users to copy their text into Notepad first).
We end up with stuff like this:
<li class="MsoNormal">
<span style="font-size: 9pt; font-family: Arial">
What I want to do is to remove the offensive mark-up, but leave the original HTML tag there. So:
<li class="MsoNormal">
<span style="font-size: 9pt; font-family: Arial">
would become:
<li>
<span>
I've been trying to get my head around regular expressions for a few days, but can't quite get it. I *think* that this is the expression:
^[[:space:]](class¦font¦style)=\".\"$
but if I enter any of the above examples and eregi them, it doesn't find the pattern I'm looking for.
Can anyone see what I'm doing wrong? TIA
$pattern = "/(<[^\s]+)[^>]*(>)/";
$subject = preg_replace($pattern, "$1$2", $subject);
$pattern = '/\s+(class¦font¦style)="[^"]+"/';You can always add to it to check for single quotation marks or even missing quotation marks, but I think MS Word always drops the doubles in.
$subject = preg_replace($pattern, '', $subject);
Unfortunately, that last pattern is not working. No text is being matched as "illegal" and removed.
I'm using this code:
$pattern = '/\s+(class¦font¦style)="[^"]+"/';
$subject = preg_replace($pattern, '', $subject);
if (eregi($pattern, $subject)) echo "illegal formatting found!";
else echo 'formatting looks good!';
echo '<br /><br />NEW TEXT:<br /> <textarea name="new" cols="75" rows="4">'.$subject.'</textarea><br /><br />';
The "new text" is always the same as the original text:
<a href="RetrievePage?site=endicott&page=IntlStudyAbIntlIntern">International Internships</a><BR>
<span style="font-size: 9pt; font-family: Arial">Assistive Technology Journal Co-founder, 1988 </span>
<li class="MsoNormal">
<P dir="ltr" STYLE='margin-right: 0px' align="left">