Removing HTML tags and the code in between - PHP Server Side Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

Removing HTML tags and the code in between

glimbeek

9:42 am on Sep 9, 2010 (gmt 0)

10+ Year Member

Hi,

I have the following HTML code:


A line of text, which is of any length. Ohw yes it is!<br /><br />
<h2>And a header</h2>
<img src="image.jpg>With an image and a <a href="/">link</a>. With some more text.

After "cleaning" the code, I want to end up with the following:


A line of text, which is of any length. Ohw yes it is! With an image and a <a href="/">link</a>. With some more text.

AKA, I want to remove all the HTML tags except the link

I searched on Google, this forum and others.
I tried regular expressions [en.wikipedia.org...] combined with preg_
I tried [simplehtmldom.sourceforge.net...]
I tried [php.net...]
Nothing seemed to work.

In the end I ended up with the following:


$string    = $this->item->fulltext;
$search    = array('/\<(.+)>(.*)>/si'); //Strip out HTML tags
$string    = preg_replace($search, '', $string);

I do believe that using a reg express should be the way to go, but I'm struggling of coming up with one that covers everything I need.

Any help would be greatly appreciated.

With kind regards,
George

morehawes

10:29 am on Sep 9, 2010 (gmt 0)

10+ Year Member

Nothing seemed to work.

Hi, what did you get after using strip_tags($text, '<a>')?

glimbeek

10:32 am on Sep 9, 2010 (gmt 0)

10+ Year Member

Thank you for the reply.

Haven't tried that, but looking at how strip_tags works... I guess I would get all the text and the link. So also the text between the <h2> tags, which I don't want.

wrightee

11:00 am on Sep 9, 2010 (gmt 0)

10+ Year Member

If it's just the h2 you need to remove the content from, why not strip that out first with a regex, then strip_tags for the rest, retaining the <a>'s.

$t=strip_tags(preg_replace("/<h2>[^>]+>/","",$t),'<a>');

h2 regex needs a bit of work of course to handle spaces, cases etc.

glimbeek

11:57 am on Sep 9, 2010 (gmt 0)

10+ Year Member

It's just the h2 in the above example but in my original post it's also the image and obviously there are more HTML tags to think off.

wrightee

12:04 pm on Sep 9, 2010 (gmt 0)

10+ Year Member

You seemed to say you wanted: All the text, except that between the h2, without any HTML other than the link - that's what the line above does.

What would you want from this string:

<h2>A title for</h2><p>Some text with <b>tags</b> and <a href="x">links</a> with</p><p>odd paras and <img src='x'> images.</p><h1>Bigger title</h1><div>for some stuff</div>

glimbeek

12:25 pm on Sep 9, 2010 (gmt 0)

10+ Year Member

I didn't think this through... I want to keep a lot more then I initially thought.

From your example wrightee I would want:
<p>Some text with <b>tags</b> and <a href="x">links</a> with</p><p>odd paras and images.</p>for some stuff

Please note though that the content I'm checking is added with a editor, JCE editor for Joomla. So it's "clean". It's setup so there shouldn't be any <p> or <div> tags.

Trying to achieve the above might get really complicated?
What if we do it the other way around?
Check for the tags I dont want with a reg expression by putting the things I dont want in an array of some sorts?

wrightee

12:41 pm on Sep 9, 2010 (gmt 0)

10+ Year Member

Sounds like you want to remove a few items and keep a lot, in which case just use a regex to strip the ones you don't want - either a complicated one to do them all in one go, or a loop over an array.

glimbeek

2:09 pm on Sep 9, 2010 (gmt 0)

10+ Year Member

Not sure this is a 100% full proof, but I managed to come up with the following:

$string = $this->item->fulltext;
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = strip_tags($text, '<b><i><em><strong><a>');

This removes the H2 from the string
and then it removes and other tags except the once I want to be removed. The next step would be to also have it remove h3 h4 etc...
Can I put those in a array and call that array in the preg_replace?

wrightee

4:01 pm on Sep 9, 2010 (gmt 0)

10+ Year Member

Try:

$t=strip_tags(preg_replace("/<h[0-9]>[^>]+>/","",$t),'<b><i><em><strong><a>');

to remove H1,H2..H9 and leave b,i,em,strong,a

rocknbil

6:26 pm on Sep 9, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Give this a whirl. Threw some classes in there to see if it works for everything, looks like it does. :-)

<?php
header("content-type:text/html");

$sampletext= '
<h2>This is some text</h2>
<p>This is a link to <a style="some-style" href="https://www.example.com">example.com</a>.
I used <strong class="red">https</strong> so it doesn\'t <em class="bold">munge up</em> the code here.</p>
<p>Generally a <span style="color:#ff0000">not character</span> in a class is better than .* any character.</p>
<div id="some-div">Just a div to test other elements. <br> <br /> <br/> Looks like it works.</div>
<img src="xhtml-enthusiasts.jpg" alt=" you probably don\'t really even need XHTML" />
';
$reg = '/<\/*\b[^a][^>]*>/ims'; // The magic regex doing all the heavy lifting

echo "<h1>Original text</h1> $sampletext";
$cleansed = preg_replace("$reg",'',$sampletext);
echo "<h1>Cleansed</h1> $cleansed";
?>

The output I get is

<h1>Original text</h1>
<h2>This is some text</h2>
<p>This is a link to <a style="some-style" href="https://www.example.com">example.com</a>.
I used <strong class="red">https</strong> so it doesn't <em class="bold">munge up</em> the code here.</p>
<p>Generally a <span style="color:#ff0000">not character</span> in a class is better than .* any character.</p>

<div id="some-div">Just a div to test other elements. <br> <br /> <br/> Looks like it works.</div>
<img src="xhtml-enthusiasts.jpg" alt=" you probably don't really even need XHTML" />
<h1>Cleansed</h1>
This is some text
This is a link to <a style="some-style" href="https://www.example.com">example.com</a>.
I used https so it doesn't munge up the code here.
Generally a not character in a class is better than .* any character.
Just a div to test other elements. Looks like it works.

Examination:

/ = regex delimiter

< = pattern starts with <

\/* = followed by zero or more /'s, which catches both opening and closing brackets

\b = followed by a word boundary - doesn't work without it

[^a] = followed by any character but A; a is NOT the first character. putting it in the next character class skips span, class, etc.

[^>]* = followed by zero or more of anything NOT a >. You want *, zero or more, not +, one or more, or it will replace </a>.

> = followed by the ending carat

/ = ending regex delimiter

ims = regex modifiers, look 'em up, they are needed. :-)

glimbeek

6:41 am on Sep 14, 2010 (gmt 0)

10+ Year Member

Hi rocknbil,

Thanks for the effort.
It seems to "work", that is it cleans the tags. However things like:
"This is some text"
I don't want. It's in a <h2> tag and the text nodes in the tag need to be removed as well.

I came up with the following:


$string = $this->item->fulltext; 
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = strip_tags($text, '<b><i><em><strong><a>');

But I'm not a 100% sure it's the correct way to go at it. I "just" need to enhance this so it can easily remove more tags and the text nodes in those tags. For instance h3, h4 and <img. All though my code seems to remove <img tag as well.

**EDIT**

The <img is removed by the strip_tags...
and to the reg ex for the h3 I can just copy the h2, so I end up with:


$string = $this->item->fulltext; 
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = preg_replace('|\<h3.*\>(.*\n*)\</h3\>|isU', '', $text);
$text = strip_tags($text, '<b><i><em><strong><a>');

This should be ready to go by the looks of it.