Forum Moderators: coopster & phranque

Message Too Old, No Replies

Regex to close HTML tags left open

Andreas, if you're listening, this one's for you...

         

Donboy

7:19 pm on Jun 30, 2003 (gmt 0)

10+ Year Member



I asked for you because you posted something in an earlier thread that is now closed because it's more than a year old.

[webmasterworld.com...]

I'm looking for a regex similar to what you're showing for the PHP example, but I need to do it in Perl. Maybe the Perl example is what I need, but I'm not sure I'm following it.

What I'm looking for is a regex that will close HTML tags that are left open. For example, if my users start a tag with A HREF and just forget to put the closing /A at the end, I want to eliminate the first part completely so that the rest of my page content doesn't end up being a huge link.

I've seen tons of expressions that kill HTML tags completely, but I don't want to do that. I still want people to be able to use tags, but I just want to make sure they close them properly or otherwise have the script remove them.

If you (or somebody) can help me come up with a regex that targets a specific type of HTML tag, that would be fine too. I don't need a regex that's super-complicated that would encompass any tag you can imagine. If it's just for the A tag, that's fine. I can always duplicate the regex to look for other types of common tags that might be left open, like bold, underline, or font tags.

Any ideas?

andreasfriedrich

10:00 pm on Jul 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, contrary to popular belief I am still around just not as much as I used to be. When I was busy as a student, I am super busy now. I´d never had imagined that there would be THAT much work to do. But it just keeps getting more and more.

BTW I finished my law degree about three weeks ago :-) and am really happy about that.

As for a Perl [perl.com] regular expression that would do what you asked I can´t think of a simple one or even a complex one for that matter ;-). But then I haven´t been into REs that much lately.

I believe you would be better off with splitting your html string on tags and then check whether you get a balanced result.

Andreas

sugarkane

10:37 am on Jul 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I started to look at this and then suddenly noticed a couple of hours had gone by - too easy to get lost in regexes ;)

Anyhow this is as far as I've got:

$content=~s@(<a\s.*?>)(.*?)<(/a[^>]¦/[^a]>¦/[^a][^>]¦[^/]a>¦[^/]a[^>]¦[^/][^a]>¦[^/][^a][^>])@$1$2</a><$3@g;
$content=~s@</a></a>@</a>@g;

It's a kludge, and has a few problems but it might be a starting point. In the first regex, if an <a> tag is opened, and then another tag appears before the </a>, an </a> is inserted before the second tag found.

EG

<p><a href="foo.html">blah</p> will become <p><a href="foo.html">blah</a></p>

Unfortunately, I kept getting double </a>'s being inserted and couldn't work out why, so the second regex fixes that.

The problem comes when another tag is nested inside the <a></a> - it will produce something like this:
<a href="foo.html"></a><i>Link</i></a>

You could fix that with a third regex, but we're getting into real kludge territory now, and it'd be preferable to fix the original regex rather than trying to catch all the exceptions.

Anyone got any ideas to take that further?

Donboy

11:59 am on Jul 2, 2003 (gmt 0)

10+ Year Member



Hours of work!?! Man, you didn't have to go to that much trouble! And you may have made it more complicated than was needed. My idea was just to strip off the offending HTML tag if it didn't have a closing tag following it.

But regardless, I appreciate all your hard work on this. I'll definitely give it a try and report back here with my findings.

Robber

12:43 pm on Jul 2, 2003 (gmt 0)

10+ Year Member



Could you use some sort of script that adds tags onto a stack and removes then when a closing tag is met - if the closing tag doesn't match the last tag on the stack then you have an offender?

bird

1:36 pm on Jul 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You're fighting with a recursive problem here, that can't be solved with regular expressions. What you need is a real (if maybe simplistic) HTML parser.

Just consider the following example to illustrate the two most obvious issues:

<a href="/"><img src="/logo.gif" alt="<go home>"></a>

a) There may be other tags between your <a>...</a>.
b) Any attribute within a tag may also contain "<" and ">".
Those two combined make it impossible to reliably detect the end of a nested tag with a regular expression, which means that you'll likely end up removing valid links.

DrDoc

4:04 am on Jul 3, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In other words... You need something that validates the page against W3C, parses the output to pinpoint any parsing errors, interprets the error description, and takes suitable action ;)

ShawnR

1:49 pm on Jul 4, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi All

A agree that in the most general case Donboy is fighting a recursive problem and would need to code up a full validator...

But, given the purpose that he wants to put the code to, you can make a few assumptions and simplify the problem...

The purpose is to prevent posters to his bulletin board who forget to close their <a> tags from stuffing up his site. You can't try close them for them, because you don't know where they were meant to be closed... e.g. say a poster enters:

click <a href="blah"> this <b>link</b>

did they mean
click <a href="blah"> this</a> <b>link</b>

or
click <a href="blah"> this <b>link</b></a>


So I agree with Donboy that the best option is just to de-link it if there is a problem.

Now instead of worrying about the complexity of recursing and/or full parsing and matching arbitrary opening tags, with their closing tags, lets just make a management descision that you can have max 6 tags within an <a> tag. After all, we don't want to try build a full validator.
e.g. lets allow:

<a href = "blah"><i><b>hello</i></b><br/><img ...></a>

but not allow
<a href = "blah"><i><b>hello</i></b><br/><img ...><p>
because that would make 7 tags (<i>, <b>, </i>, </b>, <img> and <p>) and not yet found the closing </a> tag

(Note: we don't really care that the poster didn't nest their tags properly... that is an exercise for a different thread)

So here is what I suggest:


$_ = $the_string;
if (!m#(<a[^<]*?>)[^<]*?(<(/[^a]¦/a[^<]+?¦[^/a][^<]*?)>[^<]*?){0,6}?</a>#g)
{
s#(?:<a([^<]*?>))([^<]*?)((?:<(?:/[^a]¦/a[^<]+?¦[^/a][^<]*?)>[^<]*?){0,6}?)#$1$2$3#g;
}

To change the arbitrary management decision of 6 to some other number, change the {0,6}

Sorry about the lenght of the post. If your still reading, thanks. I must become more efficient with words.

Shawn

killroy

1:59 pm on Jul 4, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'll asume your real target is to protect your site, not beautify the post.

simply count the <a> and</a> in thepost, subtracte the second form the first, and add that number of </a>'s at the end of the post.

You'll ensure any link spill will be limtied to the post only.

A hack you asked for, a hack you got ;)

SN

ShawnR

2:15 pm on Jul 4, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hey, that's cheating ;) the rest of us were trying to do it by RE's (OK, my suggestion used an if statement, but still ;)

Damian

2:44 pm on Jul 19, 2003 (gmt 0)

10+ Year Member



Interesting problem..How about this one below, it sets the first word after an orphan <a tag as the link.

[perl]
#!E:\perl\bin\perl.exe
print "Content-type: text/html\n\n";

# a test post
$string = qq~
<a href="testpage1.html">I don't close this tag<b> <br>
This is some interesting bold text in between</b> to make this test look real. It should not become part of the link the user intended to end earlier.
<b><a href="testpage2.html">and I do add bold overlapping b/href link that I do close</b></a> <br>
This is more interesting text in between
<i><a href="testpage3.html">This link</i> is at the end and is not closed, it has overlapping italics <br>
This line should not become part of the link
<a href="testpage1.html">QQ
~;

$string =~s/(\n¦\t¦\r)//g;
@strings = split(/<a/,$string);

foreach $between_ahref (@strings) {

# Do we have a closing a href tag here? If not add it, only link the first word
if ($between_ahref!~ /<\/a>/gis && $between_ahref =~ /^\s*\b(.*?)\b /) {
$firstword = $1;
$string =~ s/$firstword/$firstword<\/a>/is;
}

elsif ($between_ahref!~ /<\/a>/gis) { # it's the last word in the post tha misses a closing tag
$string =~ s/$between_ahref/$between_ahref<\/a>/;
}

}

print "$string";

[/perl]