Forum Moderators: coopster
Especially this bit:
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p>
echo strip_tags($text, '<p>');
?>
The above example will output:
Test paragraph. Other text
<p>Test paragraph.</p> Other text
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.
$search = array ("'<script[^>]*?>.*?</script>'si", // Strip out javascript
"'<[\/\!]*?[^<>]*?>'si", // Strip out html tags
"'([\r\n])[\s]+'", // Strip out white space
"'&(quot¦#34);'i", // Replace html entities
"'&(amp¦#38);'i",
"'&(lt¦#60);'i",
"'&(gt¦#62);'i",
"'&(nbsp¦#160);'i",
"'&(iexcl¦#161);'i",
"'&(cent¦#162);'i",
"'&(pound¦#163);'i",
"'&(copy¦#169);'i",
"'&#(\d+);'e"); // evaluate as php
$replace = array ("",
"",
"\\1",
"\"",
"&",
"<",
">",
" ",
chr(161),
chr(162),
chr(163),
chr(169),
"chr(\\1)");
$text = preg_replace ($search, $replace, $document);
It would have to be modified to suit.
[edited by: lmo4103 at 4:23 pm (utc) on Sep. 30, 2006]
An even bigger issue exists with onMouseOver, onClick, etc. javascript elements - you can still sneak a lot of javascript into a page protected by strip_tags. Then there's other XSS methods involving CSS abuse and other tricks.
Depending on the use, you might want to consider a bbcode library instead, or even writing a simple html->bbcode converter, then stripping all tags after it's run, _then_ running it through htmlspecialchars and back through bbcode->html again.
It's a pain, but it's really difficult to not let cross site scripting occur. By doing a lossy conversion to bbcode, you ensure that anything not matching your regex gets stripped in the stripped_tags stage. Anything remaining is converted to harmless HTML entities in htmlentities(), and the formatting is put back through your regex, which you know produces clean HTML.