[webmasterworld.com...]
I'm looking for a regex similar to what you're showing for the PHP example, but I need to do it in Perl. Maybe the Perl example is what I need, but I'm not sure I'm following it.
What I'm looking for is a regex that will close HTML tags that are left open. For example, if my users start a tag with A HREF and just forget to put the closing /A at the end, I want to eliminate the first part completely so that the rest of my page content doesn't end up being a huge link.
I've seen tons of expressions that kill HTML tags completely, but I don't want to do that. I still want people to be able to use tags, but I just want to make sure they close them properly or otherwise have the script remove them.
If you (or somebody) can help me come up with a regex that targets a specific type of HTML tag, that would be fine too. I don't need a regex that's super-complicated that would encompass any tag you can imagine. If it's just for the A tag, that's fine. I can always duplicate the regex to look for other types of common tags that might be left open, like bold, underline, or font tags.
Any ideas?
BTW I finished my law degree about three weeks ago :-) and am really happy about that.
As for a Perl [perl.com] regular expression that would do what you asked I can´t think of a simple one or even a complex one for that matter ;-). But then I haven´t been into REs that much lately.
I believe you would be better off with splitting your html string on tags and then check whether you get a balanced result.
Andreas
Anyhow this is as far as I've got:
$content=~s@(<a\s.*?>)(.*?)<(/a[^>]¦/[^a]>¦/[^a][^>]¦[^/]a>¦[^/]a[^>]¦[^/][^a]>¦[^/][^a][^>])@$1$2</a><$3@g;
$content=~s@</a></a>@</a>@g;
It's a kludge, and has a few problems but it might be a starting point. In the first regex, if an <a> tag is opened, and then another tag appears before the </a>, an </a> is inserted before the second tag found.
EG
<p><a href="foo.html">blah</p> will become <p><a href="foo.html">blah</a></p>
Unfortunately, I kept getting double </a>'s being inserted and couldn't work out why, so the second regex fixes that.
The problem comes when another tag is nested inside the <a></a> - it will produce something like this:
<a href="foo.html"></a><i>Link</i></a>
You could fix that with a third regex, but we're getting into real kludge territory now, and it'd be preferable to fix the original regex rather than trying to catch all the exceptions.
Anyone got any ideas to take that further?
But regardless, I appreciate all your hard work on this. I'll definitely give it a try and report back here with my findings.
Just consider the following example to illustrate the two most obvious issues:
<a href="/"><img src="/logo.gif" alt="<go home>"></a> a) There may be other tags between your <a>...</a>.
b) Any attribute within a tag may also contain "<" and ">".
Those two combined make it impossible to reliably detect the end of a nested tag with a regular expression, which means that you'll likely end up removing valid links.
A agree that in the most general case Donboy is fighting a recursive problem and would need to code up a full validator...
But, given the purpose that he wants to put the code to, you can make a few assumptions and simplify the problem...
The purpose is to prevent posters to his bulletin board who forget to close their <a> tags from stuffing up his site. You can't try close them for them, because you don't know where they were meant to be closed... e.g. say a poster enters:
click <a href="blah"> this <b>link</b>
click <a href="blah"> this</a> <b>link</b>
click <a href="blah"> this <b>link</b></a>
Now instead of worrying about the complexity of recursing and/or full parsing and matching arbitrary opening tags, with their closing tags, lets just make a management descision that you can have max 6 tags within an <a> tag. After all, we don't want to try build a full validator.
e.g. lets allow:
<a href = "blah"><i><b>hello</i></b><br/><img ...></a>
<a href = "blah"><i><b>hello</i></b><br/><img ...><p>because that would make 7 tags (<i>, <b>, </i>, </b>, <img> and <p>) and not yet found the closing </a> tag
(Note: we don't really care that the poster didn't nest their tags properly... that is an exercise for a different thread)
So here is what I suggest:
$_ = $the_string;
if (!m#(<a[^<]*?>)[^<]*?(<(/[^a]¦/a[^<]+?¦[^/a][^<]*?)>[^<]*?){0,6}?</a>#g)
{
s#(?:<a([^<]*?>))([^<]*?)((?:<(?:/[^a]¦/a[^<]+?¦[^/a][^<]*?)>[^<]*?){0,6}?)#$1$2$3#g;
}
To change the arbitrary management decision of 6 to some other number, change the {0,6}
Sorry about the lenght of the post. If your still reading, thanks. I must become more efficient with words.
Shawn
simply count the <a> and</a> in thepost, subtracte the second form the first, and add that number of </a>'s at the end of the post.
You'll ensure any link spill will be limtied to the post only.
A hack you asked for, a hack you got ;)
SN
[perl]
#!E:\perl\bin\perl.exe
print "Content-type: text/html\n\n";
# a test post
$string = qq~
<a href="testpage1.html">I don't close this tag<b> <br>
This is some interesting bold text in between</b> to make this test look real. It should not become part of the link the user intended to end earlier.
<b><a href="testpage2.html">and I do add bold overlapping b/href link that I do close</b></a> <br>
This is more interesting text in between
<i><a href="testpage3.html">This link</i> is at the end and is not closed, it has overlapping italics <br>
This line should not become part of the link
<a href="testpage1.html">QQ
~;
$string =~s/(\n¦\t¦\r)//g;
@strings = split(/<a/,$string);
foreach $between_ahref (@strings) {
# Do we have a closing a href tag here? If not add it, only link the first word
if ($between_ahref!~ /<\/a>/gis && $between_ahref =~ /^\s*\b(.*?)\b /) {
$firstword = $1;
$string =~ s/$firstword/$firstword<\/a>/is;
}
elsif ($between_ahref!~ /<\/a>/gis) { # it's the last word in the post tha misses a closing tag
$string =~ s/$between_ahref/$between_ahref<\/a>/;
}
}
print "$string";
[/perl]