Regex: Removing empty tags - PHP Server Side Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

Regex: Removing empty tags

Remove empty tags, regex, php, regular expressions

RyanM

1:42 am on Mar 10, 2005 (gmt 0)

10+ Year Member

Hi everyone, I am in need of yet more regex help.

What I want to do is remove any empty tags or tags with white spaace within them from my page. The Regex that I have created is the following:

<[^/¦^!¦^input¦^br¦^img¦^meta¦^hr][^>]*>[\s]*<*/[^>]*>

Basically it says select any tag that is not an end tag, or a comment, or an input item, or a break or a image or a meta tag or a horizontal rule

and only contains spaces

and its closing tag. This script works within dreamweaver but does not work within PHP. An earlier version of this regex is:

<[^/][^>]*>[\s] *<*</[^>]*>

this works in PHP but only where the tags contain no characters between them, on the other hand it works in dreamweaver where there is more than two spaces.

If someone could point out where I am going wrong that would be great. Also if somebody could offer some ideas on how I could select a tag that does not close or comes across another tag of the same sort that is open, ie:

<strong>sdfsfdssfd<strong>sfdfs</strong>

would find the fist strong because there is improper nesting, consequintly I would like to do the same on the other side.

Anyway I can figure out how to do it with what I know of regex's however it would select all the text upto the opening tag, ie <strong>sdfsfdssfd when what I want is simply <strong>.

Thanks

ironik

3:00 am on Mar 10, 2005 (gmt 0)

10+ Year Member

Removing whitespace and empty tags is fairly straight forward, I've just modified your regex a bit:

<?php
$html = "<a></a><b>non-empty</b>";
function removeEmptyTags($html_replace)
{
$pattern = "/<[^\/>]*>([\s]?)*<\/[^>]*>/";
return preg_replace($pattern, '', $html_replace);
}

// Usage:
echo removeEmptyTags($html);
// Returns '<b>non-empty</b>'
?>

The duplicate tags thing is a little more complex as your nested tags may or may not be different types, they might be overlapping or whatever. Maybe someone else has a regex for this?

RyanM

5:11 am on Mar 10, 2005 (gmt 0)

10+ Year Member

Hi Ikonic, thanks for that, it works a treat.

However if you look at the example that I gave my first Regex (the one that did not work in PHP) ignored any tags that did not require a closing tag, ie comments, inputs, breaks, images etc. I Modified your regex to:

/<[^!¦^input¦^br¦^img¦^meta¦^hr¦^\/>]*>([\s]?)*<\/[^>]*>/

however it does not seem to work. Any suggestions?

Thanks

Ryan

killroy

7:16 pm on Apr 3, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Hi,

please first review my post here: [webmasterworld.com...] about the usage of [].

So what you really need to do is convert this:

/<[^!�^input�^br�^img�^meta�^hr�^\/>]*>([\s]?)*<\/[^>]*>/

to

/<(?!input¦br¦img¦meta¦hr¦\/)[^>]*>\s*<\/[^>]*>/

About nesting and overlapping tags, this is not a problem you can solve perfectly using regexes. What you need is a quasi-parser. i.e. Use a regex to plit the text into tags and text and then use a recursive function or stack to match up all the pairs. Allowing overlapping is also not easy, and one of the reasons it's been so difficult to get a browser that both complies to standards as well as beeing kind on shoddy html.

SN