Fun with regex, matching when the user is trying to work around

Still playing with my profanity filter :-)

A common issue is with users trying to use alternate characters in place of recognized characters, trying to get around the filter. Examples including using @ instead of a, $ instead of s, ! or l (lowercase L) instead of i; ! or I (uppercase i) instead of L, and so on.

I have a ton of regex written to filter out profanity, and I duplicate the same things in each one of them:

[s\$]
[a\@]
[il!]

I recently had to deal with a new variation, and have spent most of my day modifying all of the filters to catch it :-/

As a long term fix, I'm thinking of a way to apply this variations at the beginning of the filter, so that I only have to modify them once.

My initial thought is:

1. Create an associative array with all of the potential workarounds; eg, '$' => 's', '@' => 'a', and so on.

2. Split the string by \s

3. Loop through the new array with the associative array, apply the workarounds, then perform a substitution if the new modified word matches a filtered word

Something like:

%badwords = (
  'cat' => 1,
  'dog' => 1
);

%workarounds = (
  '$' => 's',
  '@' => 'a',
  'l' => 'i',
  '!' => 'i',
  'j' => 'i',
  '+' => 't'
);

$text = 'this is a c@+';

@arr = split(' ', $text);

foreach $original (@arr) {
  $_ = $original;

  foreach $key(keys %workarounds) {
    if (s/\Q$key\E/$workarounds{$key}/i && exists($badwords{$_})) {
      $text =~ s/\Q$original\E/****/i;
    }
  }
}

print $text;
# this is a ****

This works, but given a post of 5,000 words and running each word over a ton of regex filters, this would be SUPER slow!

Any other suggestions?

Fun with regex, matching when the user is trying to work around

csdude55

lucy24

csdude55

lucy24

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week