Forum Moderators: coopster & phranque

Message Too Old, No Replies

Profanity Filter mostly finished, free to use and open for feedback

         

csdude55

12:38 am on Nov 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've been working on this for almost 3 weeks, and I THINK it's all good now! It's successfully filtering on all of my tests, anyway.

Summary: I have a list of profane / vulgar terms with some simple regex patterns accepted. When a user submits data I run fields through this to convert any matches to ****.

I also try to prevent workarounds; eg, instead of typing "foo" they might type:

f0o
f-o-o
f<br>o<br>o

and so on.

To use, just modify @pattern to include your list of terms to filter, everything else should be ready to go. I allow for [...] in @pattern, but the script automatically checks for the alternates under %characterClasses so no need to duplicate it. I also allow for named groups and negative lookahead in @pattern.

Average processing time is 2.1129s over 1000 iterations. You can shave a little off of that by removing the line breaks in the expressions and then removing the /x modifier, but I left those here for readability. That's still a bit slower than I'd like, so please feel free to post back with any modifications to boost performance!

# Usage
# $filtered = filter($unfiltered);

sub filter {
($_) = @_;

# Words to match
# here you can hard code your list as an array
@pattern = (
'foo',
'bar',
'(?<ONE>lorem)ipsum'
);

# create | delimited string, then add likely optional suffixes
$pattern = '(?:' . join('|', @pattern) . ')(?:e[dr]|ing?|e?s|y)*';

# Catch non-alpha and HTML tags
# if you use BB code, you might add [[^\]+] here
$ph[0] = '(?:<.+?>|\W)*';

# Letter alternatives
%characterClasses = (
'a' => '@*',
'e' => '3',
'i' => '!1y',
'o' => '0',
'u' => 'v',
's' => '\$z',
't' => '+'
);

$chars = join('', keys %characterClasses);

$x = 1;

# Convert [...] and (?!...) to ï$xï
# add other things here that you choose to use in the
# list of @pattern that you don't want to be modified
while ($pattern =~ m{(
# named groups
\Q(?<\E.+?\Q>\E |

# character classes
\[.+?\] |

# negative lookahead
\Q(?!\E.+?\Q)\E)}xg) {
$ph[$x] = $1;

$pattern =~ s/\Q$1\E/ï$xï/xg;
$x++;
}

# Convert letter alternatives
$pattern =~ s/([$chars])/[$1$characterClasses{$+}]/gi;

#########################################
# Substitute exact matches

# Convert ï$xï back to original
($easy = $pattern) =~ s/ï(\d+)ï/$ph[$+]/g;

# Substitute
# [[:punct:]] =>
# [-!"#$%&'()*+,./:;<=>?@[\\\]^_`{|}~]
s{(\b|^|[[:punct:]]|[\s\$])(?:$easy)(\b|[[:punct:]]|\s|$)}
{$1$+{ONE}****$+}gi;


#########################################
# Substitute \W or <.+?> delimited

# Convert new [...] to ï$xï
# done in this order to prevent nested [...[...]...]
while ($pattern =~ m{(\[.+?\])}g) {
$ph[$x] = $1;

$pattern =~ s/\Q$1\E/ï$xï/g;
$x++;
}

# Prepend delimiter to optional group
$pattern =~ s{
(
(?:
ï\d+ï |
[a-z]
)
# removed +, I don't remember if I need this
[?*]?
)
(
\Q(?:\E
[^|]*?
\)
)
([?*])
(
\| |
$
)
}
{$1(?:ï0ï$2)$3$4}xsg;

# Add delimiter if the letter or group isn't repeated or optional
$pattern =~ s{
(
ï\d+ï |
[a-z]
)
(?!
\Q(?:\Eï0ï |
\* |
\? |
\+ |

# removed ? after )
\)\| |
\)$
)
}
{$1ï0ï}xsg;

# Add delimiter if it IS repeated or optional, but not at the end of the list
$pattern =~ s{
(
ï\d+ï |
[a-z]
)
([*?+])
(?!
\Q(?:\E[^)]+\)[?*]\| |
\Q(?:\E[^)]+\)[?*]\$ |
\)*\| |
\)*$
)
}
{(?:$1ï0ï)$2}xg;

# Convert ï$xï back to original
$pattern =~ s/ï(\d+)ï/$ph[$+]/g;

# Final substitution, auto-fix trailing \W
my $new;
while (/(?:\b|^|\W|\$)($pattern)\b/xgi) {
($temp = $1) =~ s/\W+$//;

if ($new) { $new .= '|'; }
$new .= quotemeta($temp);
}

if ($new) {
s/(\b|^|\W|\$)(?:$new)\b/$1****/gi;
}

return $_;
1;
}

csdude55

6:26 pm on Nov 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Eek, I forgot to copy a crucial step! I have this elsewhere in the script, but it needs to go somewhere before the first while() loop:

s/ï/i/g;


Since the function uses ï$xï as a placeholder, I convert all ï (with 2 dots, called an "umlaut") to i (no umlaut); essentially the same visually, but there's no chance of an accidental match.

If your users use umlauts then you can change the delimiter in ï$xï to something else that wouldn't likely be used that you can remove before it starts.