Forum Moderators: coopster
I'm using strip_tags to clean up all the just-plain-outlawed tags; after that, however, I want to check the allowed tags to make sure no one's put something like, for example, <a onclick="eatBabies();"> or something.
I thought it would be so straight forward, but it's not. What do I need to do in order to leave the opening "<" and the tag name alone, but delete anything after the first occurrence of whitespace if it's, say, "on(click¦blur¦whatever)¦style" and leave the closing ">"? Everything I've tried has blown up in my face.
Many thanks for any assistance!
Regards,
Blake
I've set up a test page that basically looks something like this:
-----
$allowed = array('a', 'b', 'i', 'blockquote'); // Just an example!
// Only the bold and anchor tags should remain; the cite should be removed outright (it is), but the <a> should be sanitized.
$test = "<cite>This is a citation.</cite> <b>This is bold text.</b> <a href=\"javascript:void(0);\" onclick=\"eatBabies();\">Click here!</a>";
$test = trim(strip_tags($test), '<'.implode('>, <', $allowed).'>');
echo $test."<br />"; // Returns what you'd expect: cite is gone, b and a are intact -- a still has malicious code though.
$test = preg_replace('/(<[\/\!]*?[\d\w]+?)([\s]+?[\d\w]+[^<>][\s]*?)([^<>]>)/si', '${1}${3}', $test);
echo $test."<br />"; // Acts like the preg_replace didn't do anything (because it didn't, and I don't know why!)
-----
I've tried several variants on the search pattern, but nothing has worked correctly. Again, thanks for any assistance!
Regards,
Blake
[edited by: coopster at 6:29 pm (utc) on April 8, 2006]
[edit reason] Disable graphic smile faces [/edit]
$test = trim(strip_tags($test), '<'.implode('>, <', $allowed).'>');echo $test."<br />"; // Returns what you'd expect: cite is gone, b and a are intact -- a still has malicious code though.
Not quite, you have a parenthesis in the wrong spot there. I think you meant to have it as such:
$test = trim(strip_tags($test, '<'.implode('>, <', $allowed).'>')); $pattern = '/<([a-z!-]+)[^>]*(--)?>/is';
$test = preg_replace($pattern, "<$1>", $test);
$pattern = '/<([a-z]+)[^>]*>/is';I'm still not certain this is what you are going for though.
Here's the actual snippet I'm testing against:
-----
<?php
$allowed = array('a', 'b', 'blockquote', 'li', 'ol', 'strike', 'u', 'ul');
$string = "<cite>This is a citation</cite>. <a href=\"javascript:void(0);\" onclick=\"eatBabies('yes');\">This is anchored text</a>.";
$allowedTags = '<'.implode('>,<', $allowed).'>';
$tmpString = strip_tags($string, $allowedTags);
function doStuff($in) {
print_r($in) . "<br />\n";
}
preg_replace_callback('/(<[\/\!]*?[\w]+?)([^<>]*?)([^<>\"]*?>)/is', 'doStuff', $tmpString);
echo $tmpString . "<br />";
?>
-----
If you run this and look at what index 2 returns (in the first call), you'll see how it has both the href and the onclick, but all the other components (the opening and closing <>, the tag name [in this case, "a"]) are isolated...so I think I'm getting closer!
If I can find some way to allow the href component and excise anything else that ends up inside a tag (like onclick or style or something), I'll be where I want to be...
And yes, I'm using <a> as an example, but if someone tried to stick it in a <b> or something, I'd want that removed, too...
Phew! And it seems like such a trivial task...
Regards,
Blake
Oh yeah -- I know I'd have to actually assign $tmpString again ($tmpString = preg_replace_callback, etc.), but for testing, this works...well, shows me what I want to see, at least ;-) )
[edited by: coopster at 8:43 pm (utc) on April 8, 2006]
[edit reason] Disable graphic smile faces [/edit]
function removeAttributes($matches) {
$validAttributes = array(
'class',
'href',
'style'
);
return (in_array($matches[2], $validAttributes) ? ' ' . $matches[1] : '');
}
$pattern = '/\s*(([a-z]+)=[\'"]?[^\'">]+[\'">])/is';
$test = preg_replace_callback($pattern, 'removeAttributes', $test); name="value"pair and also slap a subpattern match on the
namepart only (note the second set of parentheses).
Now when we pass this to our callback all we have to do is make sure the
namewe grabbed is in our *valid attributes* list. If it is, it passes our test and we allow it through. If not, we pass back an empty string to remove it from our string.
You could do a lot more in this function now too. For instance, maybe we don't like that 'javascript' stuff in there. Peel it out.
function removeAttributes($matches) {
$validAttributes = array(
'class',
'href',
'style'
);
switch (true) {
case (!in_array($matches[2], $validAttributes)):
return '';
break;
case (stristr($matches[1], 'javascript')):
return '';
break;
default:
return ' ' . $matches[1];
}
}