Welcome to WebmasterWorld Guest from

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Extracting text enclosed beween HTML tags

10:56 am on Oct 22, 2009 (gmt 0)

5+ Year Member

Hi there, I'm a big PHP rookie, so I'm still experimenting regular expressions.

That is, I'm asking my self what code to use to extract a string of text enclosed between all possible HTML tags?

12:30 pm on Oct 22, 2009 (gmt 0)

5+ Year Member

echo strip_tags(preg_replace(array('@<head[^>]*?>.*?</head>@siu','@<style[^>]*?>.*?</style>@siu','@<script[^>]*?.*?</script>@siu','@<object[^>]*?.*?</object>@siu','@<embed[^>]*?.*?</embed>@siu','@<applet[^>]*?.*?</applet>@siu','@<noframes[^>]*?.*?</noframes>@siu','@<noscript[^>]*?.*?</noscript>@siu','@<noembed[^>]*?.*?</noembed>@siu','@<((br)¦(hr))@iu','@</?((address)¦(blockquote)¦(center)¦(del))@iu','@</?((div)¦(h[1-9])¦(ins)¦(isindex)¦(p)¦(pre))@iu','@</?((dir)¦(dl)¦(dt)¦(dd)¦(li)¦(menu)¦(ol)¦(ul))@iu','@</?((table)¦(th)¦(td)¦(caption))@iu','@</?((form)¦(button)¦(fieldset)¦(legend)¦(input))@iu','@</?((label)¦(select)¦(optgroup)¦(option)¦(textarea))@iu','@</?((frameset)¦(frame)¦(iframe))@iu',),array(' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0","\n\$0", "\n\$0",),$html));
12:32 pm on Oct 22, 2009 (gmt 0)

5+ Year Member

Let me know if it works for you....also you'll need to replace the broken pipe line -> ¦
with solid ones (the forum screens them out)

I'm pretty sure thats the right piece, I might have accidently excluded something because my filter is extremely long (I put all codes on one line instead of building on my systems memory)

12:45 pm on Oct 22, 2009 (gmt 0)

5+ Year Member

The fact is that I need to manipulate text string without HTML tags, one by one, just because after manipulating the text I need to place it back were it was.


<div class="someclass">this is enclosed text</div>
<p> other text to be manipulated here</p>

-- do the job --
<div class="someclass">this is the new enclosed text</div>
<p> other text to has been manipulated here</p>

Basically I should (pseudocode):
- parse the html code
- for each match of enclosed text into HTML tags do:
----- see if text is between proper tag, for example skip (but keep) text into <meta>, <script>, <link> & <style> tags
----- manipulate enclose text if needed
- next match, start from top