Welcome to WebmasterWorld Guest from 54.166.152.121

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Removing HTML tags and the code in between

     
9:42 am on Sep 9, 2010 (gmt 0)

5+ Year Member



Hi,

I have the following HTML code:

A line of text, which is of any length. Ohw yes it is!<br /><br />
<h2>And a header</h2>
<img src="image.jpg>With an image and a <a href="/">link</a>. With some more text.


After "cleaning" the code, I want to end up with the following:

A line of text, which is of any length. Ohw yes it is! With an image and a <a href="/">link</a>. With some more text.


AKA, I want to remove all the HTML tags except the link

I searched on Google, this forum and others.
I tried regular expressions [en.wikipedia.org...] combined with preg_
I tried [simplehtmldom.sourceforge.net...]
I tried [php.net...]
Nothing seemed to work.

In the end I ended up with the following:

$string = $this->item->fulltext;
$search = array('/\<(.+)>(.*)>/si'); //Strip out HTML tags
$string = preg_replace($search, '', $string);


I do believe that using a reg express should be the way to go, but I'm struggling of coming up with one that covers everything I need.

Any help would be greatly appreciated.

With kind regards,
George
10:29 am on Sep 9, 2010 (gmt 0)

5+ Year Member



Nothing seemed to work.


Hi, what did you get after using strip_tags($text, '<a>')?
10:32 am on Sep 9, 2010 (gmt 0)

5+ Year Member



Thank you for the reply.

Haven't tried that, but looking at how strip_tags works... I guess I would get all the text and the link. So also the text between the <h2> tags, which I don't want.
11:00 am on Sep 9, 2010 (gmt 0)

10+ Year Member



If it's just the h2 you need to remove the content from, why not strip that out first with a regex, then strip_tags for the rest, retaining the <a>'s.

$t=strip_tags(preg_replace("/<h2>[^>]+>/","",$t),'<a>');

h2 regex needs a bit of work of course to handle spaces, cases etc.
11:57 am on Sep 9, 2010 (gmt 0)

5+ Year Member



It's just the h2 in the above example but in my original post it's also the image and obviously there are more HTML tags to think off.
12:04 pm on Sep 9, 2010 (gmt 0)

10+ Year Member



You seemed to say you wanted: All the text, except that between the h2, without any HTML other than the link - that's what the line above does.

What would you want from this string:

<h2>A title for</h2><p>Some text with <b>tags</b> and <a href="x">links</a> with</p><p>odd paras and <img src='x'> images.</p><h1>Bigger title</h1><div>for some stuff</div>
12:25 pm on Sep 9, 2010 (gmt 0)

5+ Year Member



I didn't think this through... I want to keep a lot more then I initially thought.

From your example wrightee I would want:
<p>Some text with <b>tags</b> and <a href="x">links</a> with</p><p>odd paras and images.</p>for some stuff

Please note though that the content I'm checking is added with a editor, JCE editor for Joomla. So it's "clean". It's setup so there shouldn't be any <p> or <div> tags.

Trying to achieve the above might get really complicated?
What if we do it the other way around?
Check for the tags I dont want with a reg expression by putting the things I dont want in an array of some sorts?
12:41 pm on Sep 9, 2010 (gmt 0)

10+ Year Member



Sounds like you want to remove a few items and keep a lot, in which case just use a regex to strip the ones you don't want - either a complicated one to do them all in one go, or a loop over an array.
2:09 pm on Sep 9, 2010 (gmt 0)

5+ Year Member



Not sure this is a 100% full proof, but I managed to come up with the following:

$string = $this->item->fulltext;
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = strip_tags($text, '<b><i><em><strong><a>');

This removes the H2 from the string
and then it removes and other tags except the once I want to be removed. The next step would be to also have it remove h3 h4 etc...
Can I put those in a array and call that array in the preg_replace?
4:01 pm on Sep 9, 2010 (gmt 0)

10+ Year Member



Try:

$t=strip_tags(preg_replace("/<h[0-9]>[^>]+>/","",$t),'<b><i><em><strong><a>');

to remove H1,H2..H9 and leave b,i,em,strong,a
6:26 pm on Sep 9, 2010 (gmt 0)

WebmasterWorld Senior Member rocknbil is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Give this a whirl. Threw some classes in there to see if it works for everything, looks like it does. :-)

<?php
header("content-type:text/html");

$sampletext= '
<h2>This is some text</h2>
<p>This is a link to <a style="some-style" href="https://www.example.com">example.com</a>.
I used <strong class="red">https</strong> so it doesn\'t <em class="bold">munge up</em> the code here.</p>
<p>Generally a <span style="color:#ff0000">not character</span> in a class is better than .* any character.</p>
<div id="some-div">Just a div to test other elements. <br> <br /> <br/> Looks like it works.</div>
<img src="xhtml-enthusiasts.jpg" alt=" you probably don\'t really even need XHTML" />
';
$reg = '/<\/*\b[^a][^>]*>/ims'; // The magic regex doing all the heavy lifting

echo "<h1>Original text</h1> $sampletext";
$cleansed = preg_replace("$reg",'',$sampletext);
echo "<h1>Cleansed</h1> $cleansed";
?>


The output I get is


<h1>Original text</h1>
<h2>This is some text</h2>
<p>This is a link to <a style="some-style" href="https://www.example.com">example.com</a>.
I used <strong class="red">https</strong> so it doesn't <em class="bold">munge up</em> the code here.</p>
<p>Generally a <span style="color:#ff0000">not character</span> in a class is better than .* any character.</p>

<div id="some-div">Just a div to test other elements. <br> <br /> <br/> Looks like it works.</div>
<img src="xhtml-enthusiasts.jpg" alt=" you probably don't really even need XHTML" />
<h1>Cleansed</h1>
This is some text
This is a link to <a style="some-style" href="https://www.example.com">example.com</a>.
I used https so it doesn't munge up the code here.
Generally a not character in a class is better than .* any character.
Just a div to test other elements. Looks like it works.



Examination:

/ = regex delimiter

< = pattern starts with <

\/* = followed by zero or more /'s, which catches both opening and closing brackets

\b = followed by a word boundary - doesn't work without it

[^a] = followed by any character but A; a is NOT the first character. putting it in the next character class skips span, class, etc.

[^>]* = followed by zero or more of anything NOT a >. You want *, zero or more, not +, one or more, or it will replace </a>.

> = followed by the ending carat

/ = ending regex delimiter

ims = regex modifiers, look 'em up, they are needed. :-)
6:41 am on Sep 14, 2010 (gmt 0)

5+ Year Member



Hi rocknbil,

Thanks for the effort.
It seems to "work", that is it cleans the tags. However things like:
"This is some text"
I don't want. It's in a <h2> tag and the text nodes in the tag need to be removed as well.

I came up with the following:

$string = $this->item->fulltext;
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = strip_tags($text, '<b><i><em><strong><a>');


But I'm not a 100% sure it's the correct way to go at it. I "just" need to enhance this so it can easily remove more tags and the text nodes in those tags. For instance h3, h4 and <img. All though my code seems to remove <img tag as well.

**EDIT**

The <img is removed by the strip_tags...
and to the reg ex for the h3 I can just copy the h2, so I end up with:

$string = $this->item->fulltext;
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = preg_replace('|\<h3.*\>(.*\n*)\</h3\>|isU', '', $text);
$text = strip_tags($text, '<b><i><em><strong><a>');


This should be ready to go by the looks of it.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month