I've been working on this for almost 3 weeks, and I THINK it's all good now! It's successfully filtering on all of my tests, anyway.
Summary: I have a list of profane / vulgar terms with some simple regex patterns accepted. When a user submits data I run fields through this to convert any matches to ****.
I also try to prevent workarounds; eg, instead of typing "foo" they might type:
f0o
f-o-o
f<br>o<br>o
and so on.
To use, just modify @pattern to include your list of terms to filter, everything else should be ready to go. I allow for [...] in @pattern, but the script automatically checks for the alternates under %characterClasses so no need to duplicate it. I also allow for named groups and negative lookahead in @pattern.
Average processing time is 2.1129s over 1000 iterations. You can shave a little off of that by removing the line breaks in the expressions and then removing the /x modifier, but I left those here for readability. That's still a bit slower than I'd like, so please feel free to post back with any modifications to boost performance!
# Usage
# $filtered = filter($unfiltered);
sub filter {
($_) = @_;
# Words to match
# here you can hard code your list as an array
@pattern = (
'foo',
'bar',
'(?<ONE>lorem)ipsum'
);
# create | delimited string, then add likely optional suffixes
$pattern = '(?:' . join('|', @pattern) . ')(?:e[dr]|ing?|e?s|y)*';
# Catch non-alpha and HTML tags
# if you use BB code, you might add [[^\]+] here
$ph[0] = '(?:<.+?>|\W)*';
# Letter alternatives
%characterClasses = (
'a' => '@*',
'e' => '3',
'i' => '!1y',
'o' => '0',
'u' => 'v',
's' => '\$z',
't' => '+'
);
$chars = join('', keys %characterClasses);
$x = 1;
# Convert [...] and (?!...) to ï$xï
# add other things here that you choose to use in the
# list of @pattern that you don't want to be modified
while ($pattern =~ m{(
# named groups
\Q(?<\E.+?\Q>\E |
# character classes
\[.+?\] |
# negative lookahead
\Q(?!\E.+?\Q)\E)}xg) {
$ph[$x] = $1;
$pattern =~ s/\Q$1\E/ï$xï/xg;
$x++;
}
# Convert letter alternatives
$pattern =~ s/([$chars])/[$1$characterClasses{$+}]/gi;
#########################################
# Substitute exact matches
# Convert ï$xï back to original
($easy = $pattern) =~ s/ï(\d+)ï/$ph[$+]/g;
# Substitute
# [[:punct:]] =>
# [-!"#$%&'()*+,./:;<=>?@[\\\]^_`{|}~]
s{(\b|^|[[:punct:]]|[\s\$])(?:$easy)(\b|[[:punct:]]|\s|$)}
{$1$+{ONE}****$+}gi;
#########################################
# Substitute \W or <.+?> delimited
# Convert new [...] to ï$xï
# done in this order to prevent nested [...[...]...]
while ($pattern =~ m{(\[.+?\])}g) {
$ph[$x] = $1;
$pattern =~ s/\Q$1\E/ï$xï/g;
$x++;
}
# Prepend delimiter to optional group
$pattern =~ s{
(
(?:
ï\d+ï |
[a-z]
)
# removed +, I don't remember if I need this
[?*]?
)
(
\Q(?:\E
[^|]*?
\)
)
([?*])
(
\| |
$
)
}
{$1(?:ï0ï$2)$3$4}xsg;
# Add delimiter if the letter or group isn't repeated or optional
$pattern =~ s{
(
ï\d+ï |
[a-z]
)
(?!
\Q(?:\Eï0ï |
\* |
\? |
\+ |
# removed ? after )
\)\| |
\)$
)
}
{$1ï0ï}xsg;
# Add delimiter if it IS repeated or optional, but not at the end of the list
$pattern =~ s{
(
ï\d+ï |
[a-z]
)
([*?+])
(?!
\Q(?:\E[^)]+\)[?*]\| |
\Q(?:\E[^)]+\)[?*]\$ |
\)*\| |
\)*$
)
}
{(?:$1ï0ï)$2}xg;
# Convert ï$xï back to original
$pattern =~ s/ï(\d+)ï/$ph[$+]/g;
# Final substitution, auto-fix trailing \W
my $new;
while (/(?:\b|^|\W|\$)($pattern)\b/xgi) {
($temp = $1) =~ s/\W+$//;
if ($new) { $new .= '|'; }
$new .= quotemeta($temp);
}
if ($new) {
s/(\b|^|\W|\$)(?:$new)\b/$1****/gi;
}
return $_;
1;
}