preg_replace behaving badly?

Forum Moderators: coopster

Message Too Old, No Replies

preg_replace behaving badly?

I just want to prevent people from entering evil into allowable HTML tags!

Naphthalene

6:41 am on Apr 8, 2006 (gmt 0)

I'm relatively new to PHP -- and even newer to regex. I pretend to understand elements of it, but here's my latest problem.

I'm using strip_tags to clean up all the just-plain-outlawed tags; after that, however, I want to check the allowed tags to make sure no one's put something like, for example, <a onclick="eatBabies();"> or something.

I thought it would be so straight forward, but it's not. What do I need to do in order to leave the opening "<" and the tag name alone, but delete anything after the first occurrence of whitespace if it's, say, "on(click¦blur¦whatever)¦style" and leave the closing ">"? Everything I've tried has blown up in my face.

Many thanks for any assistance!

Regards,
Blake

coopster

2:19 pm on Apr 8, 2006 (gmt 0)

Welcome to WebmasterWorld, Naphthalene.

What have you tried so far? Let's see your best effort here and we'll help you figure out where it is failing you.

Naphthalene

6:19 pm on Apr 8, 2006 (gmt 0)

Thanks for the welcome, coopster! Here's some of what I've been trying...

I've set up a test page that basically looks something like this:

-----

$allowed = array('a', 'b', 'i', 'blockquote'); // Just an example!

// Only the bold and anchor tags should remain; the cite should be removed outright (it is), but the <a> should be sanitized.
$test = "<cite>This is a citation.</cite> This is bold text. <a href=\"javascript:void(0);\" onclick=\"eatBabies();\">Click here!</a>";

$test = trim(strip_tags($test), '<'.implode('>, <', $allowed).'>');

echo $test." "; // Returns what you'd expect: cite is gone, b and a are intact -- a still has malicious code though.

$test = preg_replace('/(<[\/\!]*?[\d\w]+?)([\s]+?[\d\w]+[^<>][\s]*?)([^<>]>)/si', '${1}${3}', $test);

echo $test." "; // Acts like the preg_replace didn't do anything (because it didn't, and I don't know why!)

-----

I've tried several variants on the search pattern, but nothing has worked correctly. Again, thanks for any assistance!

Regards,
Blake

[edited by: coopster at 6:29 pm (utc) on April 8, 2006]
[edit reason] Disable graphic smile faces [/edit]

Naphthalene

6:23 pm on Apr 8, 2006 (gmt 0)

Oops! Ignore the dollar sign before "implode" -- I got carried away! ;-)

coopster

6:53 pm on Apr 8, 2006 (gmt 0)

$test = trim(strip_tags($test), '<'.implode('>, <', $allowed).'>');
echo $test." "; // Returns what you'd expect: cite is gone, b and a are intact -- a still has malicious code though.

Not quite, you have a parenthesis in the wrong spot there. I think you meant to have it as such:

$test = trim(strip_tags($test, '<'.implode('>, <', $allowed).'>'));

Now it looks as though you are trying to not only scrub up <a> elements, but moreso than that. I'll assume that is the case from the way I understand the first message earlier. The next question is, do you want to just remove everything after the element name, meaning any and all attributes?

$pattern = '/<([a-z!-]+)[^>]*(--)?>/is'; 
$test = preg_replace($pattern, "<$1>", $test);

This pattern says to find any opening element (including a possible comment which won't matter because strip_tags is going to rip them out prior to this anyway) followed by zero or more of anything that is not the closing of the opening element (the > sign) followed by optional ending comment hyphens followed by the closing of the opening element. We could really rip the comment part out I think.

$pattern = '/<([a-z]+)[^>]*>/is';

I'm still not certain this is what you are going for though.

Naphthalene

7:57 pm on Apr 8, 2006 (gmt 0)

Hmm...That didn't quite do what I'd expected, either, but I appreciate the help!

Here's the actual snippet I'm testing against:
-----
<?php

$allowed = array('a', 'b', 'blockquote', 'li', 'ol', 'strike', 'u', 'ul');

$string = "<cite>This is a citation</cite>. <a href=\"javascript:void(0);\" onclick=\"eatBabies('yes');\">This is anchored text</a>.";

$allowedTags = '<'.implode('>,<', $allowed).'>';

$tmpString = strip_tags($string, $allowedTags);

function doStuff($in) {
print_r($in) . " \n";
}

preg_replace_callback('/(<[\/\!]*?[\w]+?)([^<>]*?)([^<>\"]*?>)/is', 'doStuff', $tmpString);

echo $tmpString . " ";

?>
-----

If you run this and look at what index 2 returns (in the first call), you'll see how it has both the href and the onclick, but all the other components (the opening and closing <>, the tag name [in this case, "a"]) are isolated...so I think I'm getting closer!

If I can find some way to allow the href component and excise anything else that ends up inside a tag (like onclick or style or something), I'll be where I want to be...

And yes, I'm using <a> as an example, but if someone tried to stick it in a or something, I'd want that removed, too...

Phew! And it seems like such a trivial task...

Regards,
Blake

Oh yeah -- I know I'd have to actually assign $tmpString again ($tmpString = preg_replace_callback, etc.), but for testing, this works...well, shows me what I want to see, at least ;-) )

[edited by: coopster at 8:43 pm (utc) on April 8, 2006]
[edit reason] Disable graphic smile faces [/edit]

coopster

10:14 pm on Apr 8, 2006 (gmt 0)

How about going with a callback? You could set an array of acceptable attributes and check for them as well. If they don't look good to you, trash them. You could always add to your callback function then too as opposed to having to deal with updating your regular expression all the time. Just grab your name/value pairs and go from there:

function removeAttributes($matches) { 
 $validAttributes = array( 
  'class', 
  'href', 
  'style' 
 ); 
 return (in_array($matches[2], $validAttributes) ? ' ' . $matches[1] : ''); 
} 
$pattern = '/\s*(([a-z]+)=[\'"]?[^\'">]+[\'">])/is'; 
$test = preg_replace_callback($pattern, 'removeAttributes', $test);

We'll match the whole

name="value"

pair and also slap a subpattern match on the

name

part only (note the second set of parentheses).

Now when we pass this to our callback all we have to do is make sure the

name

we grabbed is in our *valid attributes* list. If it is, it passes our test and we allow it through. If not, we pass back an empty string to remove it from our string.

You could do a lot more in this function now too. For instance, maybe we don't like that 'javascript' stuff in there. Peel it out.

function removeAttributes($matches) { 
 $validAttributes = array( 
  'class', 
  'href', 
  'style' 
 ); 
 switch (true) { 
 case (!in_array($matches[2], $validAttributes)): 
  return ''; 
  break; 
 case (stristr($matches[1], 'javascript')): 
  return ''; 
  break; 
 default: 
  return ' ' . $matches[1]; 
 } 
}