Regular Expression to Scrub Text Copied From Word

Forum Moderators: coopster

Message Too Old, No Replies

Regular Expression to Scrub Text Copied From Word

bomburmusicmallet

8:45 pm on Feb 26, 2007 (gmt 0)

Hi,

I am trying to build a regular expression to scrub some incoming text to remove stuff that comes into the WYSIWIG editor from Word (even though we explicitly tell the users to copy their text into Notepad first).

We end up with stuff like this:

What I want to do is to remove the offensive mark-up, but leave the original HTML tag there. So:

would become:

<li>

I've been trying to get my head around regular expressions for a few days, but can't quite get it. I *think* that this is the expression:

^[[:space:]](class¦font¦style)=\".\"$

but if I enter any of the above examples and eregi them, it doesn't find the pattern I'm looking for.

Can anyone see what I'm doing wrong? TIA

coopster

9:05 pm on Feb 26, 2007 (gmt 0)

If you only want the element itself you could go with something like ...

$pattern = "/(<[^\s]+)[^>]*(>)/"; 
$subject = preg_replace($pattern, "$1$2", $subject);

The pattern says to match an opening marker followed by one or more characters that is not a space character followed by zero or more characters that is not a closing marker followed by the closing marker. The two sets of parentheses match the relevant pieces we want to keep and are used in the replacement.

bomburmusicmallet

9:26 pm on Feb 26, 2007 (gmt 0)

That works, but it also removes links, alignment and image stuff too.

<a href="../../includes/error.php">Enter junk text</a>

became:

<a>Enter junk text</a>

coopster

9:49 pm on Feb 26, 2007 (gmt 0)

OK, that's why I asked. Then you weren't too far off in your first attempt. The main thing you missed was to match one or more of anything that is not a double quotation mark in between the attribute values.

$pattern = '/\s+(class¦font¦style)="[^"]+"/'; 
$subject = preg_replace($pattern, '', $subject);

You can always add to it to check for single quotation marks or even missing quotation marks, but I think MS Word always drops the doubles in.

bomburmusicmallet

9:38 pm on Mar 5, 2007 (gmt 0)

I'm just getting back to this problem at work...

Unfortunately, that last pattern is not working. No text is being matched as "illegal" and removed.

I'm using this code:

$pattern = '/\s+(class¦font¦style)="[^"]+"/';
$subject = preg_replace($pattern, '', $subject);

if (eregi($pattern, $subject)) echo "illegal formatting found!";
else echo 'formatting looks good!';

echo ' NEW TEXT: <textarea name="new" cols="75" rows="4">'.$subject.'</textarea> ';

The "new text" is always the same as the original text:

<a href="RetrievePage?site=endicott&page=IntlStudyAbIntlIntern">International Internships</a> 
Assistive Technology Journal Co-founder, 1988 
<li class="MsoNormal">

coopster

10:05 pm on Mar 5, 2007 (gmt 0)

Works fine for me, except it missed the uppercase STYLE attributes. For two reasons, first, the pattern is case sensitive, but that is easily fixed, just add the case-insensitive modifier. Secondly, it contains single-quotation mark delimiters, like I said before you'll have to make the modifications to catch those as well if necessary.

bomburmusicmallet

3:06 pm on Mar 6, 2007 (gmt 0)

thank you! i got it to work once i made my pipes all the same.

coopster

3:31 pm on Mar 6, 2007 (gmt 0)

Ah yes, I forgot to remind you that the forum breaks the pipes and you would need to rekey them! Glad you got it sorted.

phranque

2:31 am on Mar 7, 2007 (gmt 0)

Ah yes, I forgot to remind you that the forum breaks the pipes and you would need to rekey them! Glad you got it sorted.

i wonder if it is possible to include this disclaimer at the bottom of any post or reply that has the pipe symbol filtered/converted.

coopster

5:25 am on Mar 8, 2007 (gmt 0)

That would be a great idea, except that not all posts are using code. Know what I mean?