Regular expression to strip certain HTML tags - PHP Server Side Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

Regular expression to strip certain HTML tags

premasagar

3:13 pm on Mar 8, 2004 (gmt 0)

10+ Year Member

Hello all,
I am new to the world of PHP and regular expressions and have been scratching my head trying to solve this one...

I wish to match and remove all html tags except a few used for basic text formatting:
b, i, strong, em, p, font, br, div, span, acronym

So the following regex matches the opening and closing tags I stated (including XHTML formatting for <br />:

'</?(b¦strong¦i¦em¦p¦font¦div¦span¦acronym¦br(?/)?)(?=>)[^<>]*>'

But I want to match all tags that are not those tags! I've looked at regex references and there doesn't seem to be a simple!NOT function. There's the [^...] thing, but that doesn't seem to be what I need.

E.g. something like:
'</?[^((b)¦(strong)¦(i)¦(em)¦(p)¦(font)¦(div)¦(span)¦(acronym)¦(br(?/)?))(?=>)][^<>]*>'

I also thought this might work:
'</?((b¦strong¦i¦em¦p¦font¦div¦span¦acronym¦br(?/)?)(?=>)){0}[^<>]*>'

...but it didn't.

If anyone can point me in the right direction, I'd be so happy. Then my brain could rest.

Thank you!
Prem

lorax

3:52 pm on Mar 8, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Welcome to WebmasterWorld!
You may want to check into strip_tags() [us4.php.net].

premasagar

4:14 pm on Mar 8, 2004 (gmt 0)

10+ Year Member

Aha! Isn't it just like that! You spend ages trying to solve something, then find a simple solution ready and waiting.

I think I did actually see that function, but overlooked the "allowable tags" option, and so didn't use it.

Any comments on whether the tags I have selected to allow are the best and only ones to permit formatted text from a user, but no nasties?

Prem.

premasagar

4:50 pm on Mar 8, 2004 (gmt 0)

10+ Year Member

Come to think of it, if I use the strip_tags function but allow tags such as <p>, that would still allow such things as:
<p onclick="myEvilFunction;">

I guess I could strip out onclick and other events with preg_replace, but are there other security implications with the method I plan to use?

Why is it that WebMasterWorld.com and other sites completely disable HTML tags in user's posts, in favour of proprietary font formatting tags, rather than the method I'm taking to allow simple, harmless tags?

Is it just so that text in posts will appear as text (such as I intended in the first paragraph of this reply) and not processed as HTML?

Thanks for any thoughts on this,
Prem.

webdevjim

5:01 pm on Mar 8, 2004 (gmt 0)

10+ Year Member

Hey,
How about this
The script first goes through the file and changes all the tags you want to keep to something unique.
Example: change all <B> tags to STARTBOLD & </b> to ENDBOLD

Then strip out all the html

Then the last step change all the STARTBOLD strings, etc. back into the correct html tag.
ex: replace("STARTBOLD", "<b>")

/Webdevjim

premasagar

3:25 pm on Mar 11, 2004 (gmt 0)

10+ Year Member

Thanks for your replies Lorax and Webdevjim.

I'm still wondering about the security problems with allowing these formatting tags and why sites such as WebmasterWorld don't allow them (I'm sure I have somehting to learn from them).

Any suggestions?