homepage Welcome to WebmasterWorld Guest from 54.147.248.118
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Removing HTML tags and the code in between
glimbeek



 
Msg#: 4199212 posted 9:42 am on Sep 9, 2010 (gmt 0)

Hi,

I have the following HTML code:

A line of text, which is of any length. Ohw yes it is!<br /><br />
<h2>And a header</h2>
<img src="image.jpg>With an image and a <a href="/">link</a>. With some more text.


After "cleaning" the code, I want to end up with the following:

A line of text, which is of any length. Ohw yes it is! With an image and a <a href="/">link</a>. With some more text.


AKA, I want to remove all the HTML tags except the link

I searched on Google, this forum and others.
I tried regular expressions [en.wikipedia.org...] combined with preg_
I tried [simplehtmldom.sourceforge.net...]
I tried [php.net...]
Nothing seemed to work.

In the end I ended up with the following:

$string = $this->item->fulltext;
$search = array('/\<(.+)>(.*)>/si'); //Strip out HTML tags
$string = preg_replace($search, '', $string);


I do believe that using a reg express should be the way to go, but I'm struggling of coming up with one that covers everything I need.

Any help would be greatly appreciated.

With kind regards,
George

 

morehawes

5+ Year Member



 
Msg#: 4199212 posted 10:29 am on Sep 9, 2010 (gmt 0)

Nothing seemed to work.


Hi, what did you get after using strip_tags($text, '<a>')?

glimbeek



 
Msg#: 4199212 posted 10:32 am on Sep 9, 2010 (gmt 0)

Thank you for the reply.

Haven't tried that, but looking at how strip_tags works... I guess I would get all the text and the link. So also the text between the <h2> tags, which I don't want.

wrightee

5+ Year Member



 
Msg#: 4199212 posted 11:00 am on Sep 9, 2010 (gmt 0)

If it's just the h2 you need to remove the content from, why not strip that out first with a regex, then strip_tags for the rest, retaining the <a>'s.

$t=strip_tags(preg_replace("/<h2>[^>]+>/","",$t),'<a>');

h2 regex needs a bit of work of course to handle spaces, cases etc.

glimbeek



 
Msg#: 4199212 posted 11:57 am on Sep 9, 2010 (gmt 0)

It's just the h2 in the above example but in my original post it's also the image and obviously there are more HTML tags to think off.

wrightee

5+ Year Member



 
Msg#: 4199212 posted 12:04 pm on Sep 9, 2010 (gmt 0)

You seemed to say you wanted: All the text, except that between the h2, without any HTML other than the link - that's what the line above does.

What would you want from this string:

<h2>A title for</h2><p>Some text with <b>tags</b> and <a href="x">links</a> with</p><p>odd paras and <img src='x'> images.</p><h1>Bigger title</h1><div>for some stuff</div>

glimbeek



 
Msg#: 4199212 posted 12:25 pm on Sep 9, 2010 (gmt 0)

I didn't think this through... I want to keep a lot more then I initially thought.

From your example wrightee I would want:
<p>Some text with <b>tags</b> and <a href="x">links</a> with</p><p>odd paras and images.</p>for some stuff

Please note though that the content I'm checking is added with a editor, JCE editor for Joomla. So it's "clean". It's setup so there shouldn't be any <p> or <div> tags.

Trying to achieve the above might get really complicated?
What if we do it the other way around?
Check for the tags I dont want with a reg expression by putting the things I dont want in an array of some sorts?

wrightee

5+ Year Member



 
Msg#: 4199212 posted 12:41 pm on Sep 9, 2010 (gmt 0)

Sounds like you want to remove a few items and keep a lot, in which case just use a regex to strip the ones you don't want - either a complicated one to do them all in one go, or a loop over an array.

glimbeek



 
Msg#: 4199212 posted 2:09 pm on Sep 9, 2010 (gmt 0)

Not sure this is a 100% full proof, but I managed to come up with the following:

$string = $this->item->fulltext;
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = strip_tags($text, '<b><i><em><strong><a>');

This removes the H2 from the string
and then it removes and other tags except the once I want to be removed. The next step would be to also have it remove h3 h4 etc...
Can I put those in a array and call that array in the preg_replace?

wrightee

5+ Year Member



 
Msg#: 4199212 posted 4:01 pm on Sep 9, 2010 (gmt 0)

Try:

$t=strip_tags(preg_replace("/<h[0-9]>[^>]+>/","",$t),'<b><i><em><strong><a>');

to remove H1,H2..H9 and leave b,i,em,strong,a

rocknbil

WebmasterWorld Senior Member rocknbil us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4199212 posted 6:26 pm on Sep 9, 2010 (gmt 0)

Give this a whirl. Threw some classes in there to see if it works for everything, looks like it does. :-)

<?php
header("content-type:text/html");

$sampletext= '
<h2>This is some text</h2>
<p>This is a link to <a style="some-style" href="https://www.example.com">example.com</a>.
I used <strong class="red">https</strong> so it doesn\'t <em class="bold">munge up</em> the code here.</p>
<p>Generally a <span style="color:#ff0000">not character</span> in a class is better than .* any character.</p>
<div id="some-div">Just a div to test other elements. <br> <br /> <br/> Looks like it works.</div>
<img src="xhtml-enthusiasts.jpg" alt=" you probably don\'t really even need XHTML" />
';
$reg = '/<\/*\b[^a][^>]*>/ims'; // The magic regex doing all the heavy lifting

echo "<h1>Original text</h1> $sampletext";
$cleansed = preg_replace("$reg",'',$sampletext);
echo "<h1>Cleansed</h1> $cleansed";
?>


The output I get is


<h1>Original text</h1>
<h2>This is some text</h2>
<p>This is a link to <a style="some-style" href="https://www.example.com">example.com</a>.
I used <strong class="red">https</strong> so it doesn't <em class="bold">munge up</em> the code here.</p>
<p>Generally a <span style="color:#ff0000">not character</span> in a class is better than .* any character.</p>

<div id="some-div">Just a div to test other elements. <br> <br /> <br/> Looks like it works.</div>
<img src="xhtml-enthusiasts.jpg" alt=" you probably don't really even need XHTML" />
<h1>Cleansed</h1>
This is some text
This is a link to <a style="some-style" href="https://www.example.com">example.com</a>.
I used https so it doesn't munge up the code here.
Generally a not character in a class is better than .* any character.
Just a div to test other elements. Looks like it works.



Examination:

/ = regex delimiter

< = pattern starts with <

\/* = followed by zero or more /'s, which catches both opening and closing brackets

\b = followed by a word boundary - doesn't work without it

[^a] = followed by any character but A; a is NOT the first character. putting it in the next character class skips span, class, etc.

[^>]* = followed by zero or more of anything NOT a >. You want *, zero or more, not +, one or more, or it will replace </a>.

> = followed by the ending carat

/ = ending regex delimiter

ims = regex modifiers, look 'em up, they are needed. :-)

glimbeek



 
Msg#: 4199212 posted 6:41 am on Sep 14, 2010 (gmt 0)

Hi rocknbil,

Thanks for the effort.
It seems to "work", that is it cleans the tags. However things like:
"This is some text"
I don't want. It's in a <h2> tag and the text nodes in the tag need to be removed as well.

I came up with the following:

$string = $this->item->fulltext;
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = strip_tags($text, '<b><i><em><strong><a>');


But I'm not a 100% sure it's the correct way to go at it. I "just" need to enhance this so it can easily remove more tags and the text nodes in those tags. For instance h3, h4 and <img. All though my code seems to remove <img tag as well.

**EDIT**

The <img is removed by the strip_tags...
and to the reg ex for the h3 I can just copy the h2, so I end up with:

$string = $this->item->fulltext;
$text = preg_replace('|\<h2.*\>(.*\n*)\</h2\>|isU', '', $string);
$text = preg_replace('|\<h3.*\>(.*\n*)\</h3\>|isU', '', $text);
$text = strip_tags($text, '<b><i><em><strong><a>');


This should be ready to go by the looks of it.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved