Profanity Filter mostly finished, free to use and open for feedback

I've been working on this for almost 3 weeks, and I THINK it's all good now! It's successfully filtering on all of my tests, anyway.

Summary: I have a list of profane / vulgar terms with some simple regex patterns accepted. When a user submits data I run fields through this to convert any matches to ****.

I also try to prevent workarounds; eg, instead of typing "foo" they might type:

f0o
f-o-o
f<br>o<br>o

and so on.

To use, just modify @pattern to include your list of terms to filter, everything else should be ready to go. I allow for [...] in @pattern, but the script automatically checks for the alternates under %characterClasses so no need to duplicate it. I also allow for named groups and negative lookahead in @pattern.

Average processing time is 2.1129s over 1000 iterations. You can shave a little off of that by removing the line breaks in the expressions and then removing the /x modifier, but I left those here for readability. That's still a bit slower than I'd like, so please feel free to post back with any modifications to boost performance!

# Usage
# $filtered = filter($unfiltered);

sub filter {
 ($_) = @_;

 # Words to match
 #  here you can hard code your list as an array
 @pattern = (
  'foo',
  'bar',
  '(?<ONE>lorem)ipsum'
 );

 # create | delimited string, then add likely optional suffixes
 $pattern = '(?:' . join('|', @pattern) . ')(?:e[dr]|ing?|e?s|y)*';

 # Catch non-alpha and HTML tags
 #  if you use BB code, you might add [[^\]+] here
 $ph[0] = '(?:<.+?>|\W)*';

 # Letter alternatives
 %characterClasses = (
  'a' => '@*',
  'e' => '3',
  'i' => '!1y',
  'o' => '0',
  'u' => 'v',
  's' => '\$z',
  't' => '+'
 );

 $chars = join('', keys %characterClasses);

 $x = 1;

 # Convert [...] and (?!...) to ī$xī
 # add other things here that you choose to use in the
 # list of @pattern that you don't want to be modified
 while ($pattern =~ m{(
  # named groups
  \Q(?<\E.+?\Q>\E |

  # character classes
  \[.+?\]   |

  # negative lookahead
  \Q(?!\E.+?\Q)\E)}xg) {
   $ph[$x] = $1;

   $pattern =~ s/\Q$1\E/ī$xī/xg;
   $x++;
 }

 # Convert letter alternatives
 $pattern =~ s/([$chars])/[$1$characterClasses{$+}]/gi;

 #########################################
 # Substitute exact matches

 # Convert ī$xī back to original
 ($easy = $pattern) =~ s/ī(\d+)ī/$ph[$+]/g;

 # Substitute
 # [[:punct:]] =>
 #  [-!"#$%&'()*+,./:;<=>?@[\\\]^_`{|}~]
 s{(\b|^|[[:punct:]]|[\s\$])(?:$easy)(\b|[[:punct:]]|\s|$)}
  {$1$+{ONE}****$+}gi;


 #########################################
 # Substitute \W or <.+?> delimited

 # Convert new [...] to ī$xī
 #  done in this order to prevent nested [...[...]...]
 while ($pattern =~ m{(\[.+?\])}g) {
  $ph[$x] = $1;

  $pattern =~ s/\Q$1\E/ī$xī/g;
  $x++;
 }

 # Prepend delimiter to optional group
 $pattern =~ s{
  (
   (?:
    ī\d+ī  |
    [a-z]
   )
   # removed +, I don't remember if I need this
   [?*]?
  )
  (
   \Q(?:\E
    [^|]*?
   \)
  )
  ([?*])
  (
   \| |
   $
  )
 }
 {$1(?:ī0ī$2)$3$4}xsg;

 # Add delimiter if the letter or group isn't repeated or optional
 $pattern =~ s{
  (
   ī\d+ī |
   [a-z]
  )
  (?!
   \Q(?:\Eī0ī  |
   \*  |
   \?  |
   \+  |

   # removed ? after )
   \)\| |
   \)$
  )
 }
 {$1ī0ī}xsg;

 # Add delimiter if it IS repeated or optional, but not at the end of the list
 $pattern =~ s{
  (
   ī\d+ī |
   [a-z]
  )
  ([*?+])
  (?!
   \Q(?:\E[^)]+\)[?*]\|  |
   \Q(?:\E[^)]+\)[?*]\$  |
   \)*\|  |
   \)*$
  )
 }
 {(?:$1ī0ī)$2}xg;

 # Convert ī$xī back to original
 $pattern =~ s/ī(\d+)ī/$ph[$+]/g;

 # Final substitution, auto-fix trailing \W
 my $new;
 while (/(?:\b|^|\W|\$)($pattern)\b/xgi) {
  ($temp = $1) =~ s/\W+$//;
 
  if ($new) { $new .= '|'; }
  $new .= quotemeta($temp);
 }

 if ($new) {
  s/(\b|^|\W|\$)(?:$new)\b/$1****/gi;
 }

 return $_;
 1;
}

Profanity Filter mostly finished, free to use and open for feedback

csdude55

csdude55

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week