Forum Moderators: coopster

Message Too Old, No Replies

Stripping HTML tag

         

DFrag

12:35 am on Feb 20, 2004 (gmt 0)

10+ Year Member



Hello all,

Is there a function for stripping HTML tags from a string? Iam trying to read in an HTML page and want to strip the tags from it. Is there an easy way of doing this?

Thanks
DFrag

DrDoc

3:48 am on Feb 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



$string = preg_replace("/<[^>]+>/","",$string);

I don't know exactly how you're planning on using this... But, if it's for a form (and you want to strip HTML out of the posts) you can also render it harmless by doing something like:

$string = preg_replace(array("/</","/>/"),array("&lt;","&gt;"),$string);

RonPK

8:41 am on Feb 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



preg gives you all the flexibility you could possibly need. If you're looking for a built-in function, ready to use, try strip_tags($string). It, erm, strips tags from the input string.

dcrombie

5:30 pm on Feb 20, 2004 (gmt 0)



What about just using strip_tags [php.net]?

webadept

11:37 pm on Feb 20, 2004 (gmt 0)

10+ Year Member



I think you will find that stripping a whole page of HTML with reliable accuracy is more difficult than the simple regex given can do, or even the more function strip_tags. strip_tags, page, will show you that. There is a great deal to HTML these days. DOM tags, inline javascript, all kinds of stuff.

Here's a regex I use, it is not perfect either, just like all those on the PHP strip_tags page, eventually it misses something, but it is fast, and it does work a good percentage of the time.

s/
< # open tag
(?: # open group (A)
(!--) ¦ # comment (1) or
(\?) ¦ # another comment (2) or
(?i: # open group (B) for /i
( TITLE ¦ # one of start tags
SCRIPT ¦ # for which
APPLET ¦ # must be skipped
OBJECT ¦ # all content
STYLE # to correspond
) # end tag (3)
) ¦ # close group (B), or
([!/A-Za-z]) # one of these chars, remember in (4)
) # close group (A)
(?(4) # if previous case is (4)
(?: # open group (C)
(?! # and next is not : (D)
[\s=] # \s or "="
["`'] # with open quotes
) # close (D)
[^>] ¦ # and not close tag or
[\s=] # \s or "=" with
`[^`]*` ¦ # something in quotes ` or
[\s=] # \s or "=" with
'[^']*' ¦ # something in quotes ' or
[\s=] # \s or "=" with
"[^"]*" # something in quotes "
)* # repeat (C) 0 or more times
¦ # else (if previous case is not (4))
.*? # minimum of any chars
) # end if previous char is (4)
(?(1) # if comment (1)
(?<=--) # wait for "--"
) # end if comment (1)
(?(2) # if another comment (2)
(?<=\?) # wait for "?"
) # end if another comment (2)
(?(3) # if one of tags-containers (3)
</ # wait for end
(?i:\3) # of this tag
(?:\s[^>]*)? # skip junk to ">"
) # end if (3)
> # tag closed
///gsx

DFrag

10:19 pm on Feb 22, 2004 (gmt 0)

10+ Year Member



Thanks for the replies people. I think the strip_tags will suffice for the time being.