Still playing with my profanity filter :-)
A common issue is with users trying to use alternate characters in place of recognized characters, trying to get around the filter. Examples including using @ instead of a, $ instead of s, ! or l (lowercase L) instead of i; ! or I (uppercase i) instead of L, and so on.
I have a ton of regex written to filter out profanity, and I duplicate the same things in each one of them:
[s\$]
[a\@]
[il!]
I recently had to deal with a new variation, and have spent most of my day modifying all of the filters to catch it :-/
As a long term fix, I'm thinking of a way to apply this variations at the beginning of the filter, so that I only have to modify them once.
My initial thought is:
1. Create an associative array with all of the potential workarounds; eg, '$' => 's', '@' => 'a', and so on.
2. Split the string by \s
3. Loop through the new array with the associative array, apply the workarounds, then perform a substitution if the new modified word matches a filtered word
Something like:
%badwords = (
'cat' => 1,
'dog' => 1
);
%workarounds = (
'$' => 's',
'@' => 'a',
'l' => 'i',
'!' => 'i',
'j' => 'i',
'+' => 't'
);
$text = 'this is a c@+';
@arr = split(' ', $text);
foreach $original (@arr) {
$_ = $original;
foreach $key(keys %workarounds) {
if (s/\Q$key\E/$workarounds{$key}/i && exists($badwords{$_})) {
$text =~ s/\Q$original\E/****/i;
}
}
}
print $text;
# this is a ****
This works, but given a post of 5,000 words and running each word over a ton of regex filters, this would be SUPER slow!
Any other suggestions?