Myself, I strip out beginning and end tags for "html", "body", "img", "link", "head", "script", "style", "object", "embed", and "applet", as well as removing any tag with a javascript URL or any onWhatever attribute defined.
Is this overkill? inadequate? You have another approach? I'm curious.
For most of the input I allow I just addslashes and strip away any characters that might mess things up. I have a few though where I allow pretty much anything and I deal with it when I actually display it.
Dingman, I think your approach is pretty good too.
If your site depends on user input, check everything, everywhere.
I really hate people who try to distort the page by adding custom HTML codes. If everyone did this, the page would be extremely distorted. A good example would be Neopets' noteboard.
Have you checked how your regular expressions handle HTML like <!-- <b>Perl</b> --> or <img src="ac.gif" alt="AC > PHP">? They might not really do what you expect.
Good point. The one that occurs to me first would produce 'Perl -->' or ' PHP">?' If I'm not allowing any HTML tags at all, though, I just run it through php's htmlentities() function. I figure for a user who was innocently typing 'P & !P --> false', that's what they wanted. And anyone who trys to insert malicious code gets exposed.
For places where you do want to allow some HTML, just not all, does anyone have a good method for checking to make sure tags are balanced? I can just see some yutz making half a page disappear because they didn't close a tag, and I don't have a check for it yet. Something stack-based, like your standard parenthesis checker with an additional check on the pop to make sure the thing you popped off was in fact the opening tag for the end tag that prompted you to pop it, perhaps?