choice in regex, how do I filter for different cases in one line?

Forum Moderators: coopster

Message Too Old, No Replies

choice in regex, how do I filter for different cases in one line?

Do I use or, or is can I use a pipe on it?

Baruch Menachem

9:37 pm on Jun 8, 2009 (gmt 0)

I have a little comment form on my web page. I have yet to get a real user, but I have started to get some comment spam.

what I have now is each of the three bad cases
$badregex="/blogosword/";
$theifpat="/<a href/";
$worseregex="/javascript/";

and a single response to the bad cases

if ((preg_match("$badregex","$commentary" )) or (preg_match("$worseregex","$commentary")) or (preg_match("$theifpat","$commentary" )))
die ('#AO67FF: Radioactive ruby Gem error. Hazmat alert');

can I have all three cases in one pattern, and then have the pattern work.

I don't like it this way, as it makes things noticeably slower, I think

eelixduppy

12:49 am on Jun 9, 2009 (gmt 0)

The pipe symbol (Ś) is used for this purpose within the pattern. Eg:


$pattern = "/pattern1Śpattern2Śpattern3/";

Just note that the vertical bar (Ś) should be solid and not broken. The forum software breaks these when written.

Baruch Menachem

1:26 am on Jun 9, 2009 (gmt 0)

Thanks.

Annoying that the spammers have found me before anyone else.

I know the spam mutates like fruit flies to get around solutions like this, And it does look like regex does slow things down a little. What do you recommend as good patterns to block, and what is the reasonable limit of expressions to look out for?

eelixduppy

1:46 am on Jun 9, 2009 (gmt 0)

Well it depends on what kind of spam you are getting in your form. If the spam is actually typed in by a person then it is going to be difficult to determine if a comment is spam or not and the only thing you'd be able to go by are some keywords or badwords that you could filter out. If there is some sort of bot hijacking your form, however, then there are some techniques you can use to minimize them.

[webmasterworld.com...]

This thread goes over a few ideas and can be found in the PHP forum's library. If you do a search around the boards [google.com] you should find much more information about this that you can try to implement to see if it works for your case.

Baruch Menachem

2:05 pm on Jun 9, 2009 (gmt 0)

So far it looks hand typed. Which is why I did the "radioactive ruby alert" foolishness.

I read that link. It was really useful. I am ready for the bots when they march in and over the cliff of my hidden fields.

I hate dealing with captcha, so I don't want to go that route, even if I could at this time. And I am desperate for someone to actually look at the thing and comment, so I don't want to chase away the legitimate viewers.

Makes you wonder about the mentality of the people who do this. Why work so hard at being hated?

rocknbil

7:00 pm on Jun 9, 2009 (gmt 0)

what is the reasonable limit of expressions to look out for?

First, you're sort of on the right track with this:

$theifpat="/<a href/";
$worseregex="/javascript/";

But what about

<a href=
[a href=
%5B a href
[url=
[link =
scri+pt

And on and on, ad nauseum?

What you have here is a never-ending chase of trying to plug the holes as they arise.

The short story: store your "bad patterns" in an array; look for them and exit immediately if found, but don't stop there. It's always an easier task to only accept what you want instead of contstantly trying to stop what you don't want. Filter for what you want to allow, and out of whatever's left, swap out any potentially dangerous characters for harmless equivalents.

My philosophy is to first understand "the enemy," to find out what motivates them:

Makes you wonder about the mentality of the people who do this. Why work so hard at being hated?

I have always said that a good form processor has to do one thing: log all raw input data. This will be different than what you get in server logs. It's easy. Open a file in a secure non-public location, dump the input, THEN go on to cleanse. Review it regularly. After a time, particular patterns (as in, actions, not regexps) begin to form. This will lead to an answer to this question, over time.

The #1 reason for spamming forms IME: link drops. And it must work, or they wouldn't do it.

So I approach this as follows. Note that this uses methods for PHP 4 compatibility which doesn't have the advantage of 5's filter_input_array().

1. LOG raw input first. This is where you sift through input for potential malicious data. If found, return "no email sent" (or similar) to browser. If the spammers log responses from their attacks - which they must most surely do, the people paying them wouldn't pay them if there weren't some indication of effectiveness - and this simple response is a clue that your attacks won't work here, move on.

2. Instead of storing bad patterns in variables, I suggest using an array. This way, when you go to add a new one, it only needs to be added in one place. In the following example, once the spammers figured out their spam resource was gone, they began spamming the form with "good site, admin" for no other reason than to annoy the crap out of them.

Note the liberal use of \s* (zero or more white space characters) in the regular expressions, this thwarts a lot of "workaround" bad patterns, such as 1 = 1 even though the standard sql injection attack is 1=1.


$bad_patterns = Array ( 
 '\[\s*a\s*href.*\]*', 
 '\%5B\s*a\s*href.*(\%5B)*', 
 '\<\s*a\s*href.*\>*', 
 '\%3C\s*a\s*href.*(\%3E)*', 
 's\s*c\s*r\s*i\s*p\s*t', 
 // Some for sql injection 
 'drop\s+\w+', 
 'insert\s*into', 
 '\s*or\s*\d+\s*=*\s*\d*', 
 '\s*and\s*\d+\s*=*\s*\d*', 
 'update\s*', 
 'alter\s*', 
 // This list is actually 20 or so patterns, some 
 // removed to condense post 
 'good\s*site\s*,*\s*admin' 
);

I realize filtering bad patterns is a contradiction of the "accept only what you want" philosophy, but since most of "what you want" can be abused to form malicious patterns, some chasing will always be required. The intent is to minimize the amount of pattern-chasing.

I trap these in the logging routine. If found, there is no reason to go any further, don't need to know anything else.


$spam_in=0; 
foreach ($_POST as $key => $value) { 
 $input_content .= $key . ": " . $value . "\n"; 
 foreach ($bad_patterns as $v) { 
  if (preg_match("/$v/i",$_POST[$key])) {  
  $trap .= "SPAM: $value found in " . $key . " field.\n"; 
  $spam_in = 1; 
  } 
 $input_content .= "$key: $value\n"; 
 } 
} 
// write $input_content to file here 
// If $spam_in == 1, terminate with error message previously described.

3. Continue on with an aggressive cleansing, removing anything but what you want. Unless your form is an "add your url" form, there is no legitimate reason for a url or html of any kind in your public forms. Not one. Simply put: nothing but letters, numbers, and basic punctuation is allowed (caveat: there are legitimate uses for other characters, such as é, add these to this simplified example.)

The following is a bit loose, as I go on to remove @ from anything but an email field, exchange % for the word "percent":


$allowed = 'A-Z0-9"\'\%\.\,\$\@\!\(\)\=\-\_\&\;\s'; 
for each ($_POST as $key=>$value) { 
 $_POST[$key] = preg_replace("/[^$allowed]+/i",'',$value); 
}

That is: remove anything that is NOT in my $allowed pattern. There are functions in PHP that do this for you, personally, I like to see what my coding is doing instead of feeding it to a "black box."

Yes it's not the PHP'ish approach. Yes it's "a little more work." But I can see more clearly what it's doing, which is why I do it this way.

Alter as you wish.

Baruch Menachem

5:17 pm on Jun 17, 2009 (gmt 0)

Thanks, That was a cool tutorial. A bit above where I am right now. <saving page> But good ideas.