Forum Moderators: coopster & phranque

Message Too Old, No Replies

Profanity filter: matching when using a special character

         

csdude55

7:49 am on Jul 6, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is becoming a challenge, I'm curious if you guys and gals have any suggestions or feedback.

I'm specifically working on a profanity filter for my message board, replacing bad words with ****. People occasionally try to get around the filter, though, so I'm trying to figure a way to intuitively filter when someone uses a special character in place of a real letter.

For example:

@ss
@$$
$h!+

Or worse:

s÷%t (in context the meaning is clear, but replacing the à with A just turns it to gibberish)

But I DON'T want to catch, for example:

@gmail
#foo
you're
No!This (no space after the !)

I'm already manually swapping some characters to letters, like so:

%asciiChars = (

# Upside down
'592' =>'a',
'596' =>'c',
'477' =>'e',
'607' =>'f',
'613' =>'h',
'305' =>'i',
'1592' =>'j',
'670' =>'k',
'1503' =>'l',
'623' =>'m',
'633' =>'r',
'647' =>'t',
'652' =>'v',
'653' =>'w',
'654' =>'y',

# Uppercase
'65' =>'A',
'66' =>'B',
'67' =>'C',
'68' =>'D',
'69' =>'E',
'70' =>'F',
'71' =>'G',
'72' =>'H',
'73' =>'I',
'74' =>'J',
'75' =>'K',
'76' =>'L',
'77' =>'M',
'78' =>'N',
'79' =>'O',
'80' =>'P',
'81' =>'Q',
'82' =>'R',
'83' =>'S',
'84' =>'T',
'85' =>'U',
'86' =>'V',
'87' =>'W',
'88' =>'X',
'89' =>'Y',
'90' =>'Z',

# Lowercase
'97' =>'a',
'98' =>'b',
'99' =>'c',
'100' =>'d',
'101' =>'e',
'102' =>'f',
'103' =>'g',
'104' =>'h',
'105' =>'i',
'106' =>'j',
'107' =>'k',
'108' =>'l',
'109' =>'m',
'110' =>'n',
'111' =>'o',
'112' =>'p',
'113' =>'q',
'114' =>'r',
'115' =>'s',
'116' =>'t',
'117' =>'u',
'118' =>'v',
'119' =>'w',
'120' =>'x',
'121' =>'y',
'122' =>'z',

# Special chars
'263' =>'c',
'347' =>'s'
);

foreach $key (keys %asciiChars) {
$mod = '&#' . $key . ';';
$text =~ s/$mod/$asciiChars{$key}/gi;
}


And I tried this tonight but it threw an error, so I need to play with it a little:

$text =~ s/ï/i/;
$text =~ s/ö/o/;
$text =~ s/[š$]/s/;
$text =~ s/¥/y/;


Before I keep going down this rabbit hole, trying to find every possible variation and swapping it, can you guys suggest a better way to find when the user is trying to get around the filter?

fishmonger

4:04 pm on Jul 6, 2019 (gmt 0)

5+ Year Member



I have not written or used and profanity filters, but this module might help you.
Regexp::Common::profanity -- provide regexes for profanity [metacpan.org ]

typomaniac

8:33 pm on Jul 6, 2019 (gmt 0)

10+ Year Member Top Contributors Of The Month



How about a filter which points out the words that need to be removed before the post can be made?

lucy24

9:00 pm on Jul 6, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How about a filter which points out the words that need to be removed before the post can be made?
Do you mean as an entire alternative to ### substitution, without resorting to “this post is being held for moderation” * ? Then you're just encouraging the user to go hunting through their lookalike characters in order to be clever and slip ### past the filter. But that's a whole different thread.


* So that by the time it's approved and posted, the whole discussion has moved on, and you feel like an idiot for saying something that would have fit right in last week.

typomaniac

9:41 pm on Jul 6, 2019 (gmt 0)

10+ Year Member Top Contributors Of The Month



No, its like when you submit a form and you use a word you shouldn't use, upon submission you are informed that you said a nono, it will tell you which form field and what words need to be fixed. like
The following word(s) were found in username field and must be removed before you can submit:
Foo, bar, etc

typomaniac

10:05 pm on Jul 6, 2019 (gmt 0)

10+ Year Member Top Contributors Of The Month



Here it is. I have removed some words because I did not want to offend anyone.

In @badwordsa is the list of words that no matter how they appear in a sentence or try to hide them in another word, you will find that sequence of words only appear once in the dictionary which means that array doesn't have to contain every word in the book.

In @badwordsb are the words that might appear in another word, for example the word ass also appears in pass, bass, amass, etc, so the combinations must be listed one by one, which presents not problem. I know it is possible to make this work better and welcome any input.

The best part is NO bad words make it to the post so that the reader doesn't have to use his/her imagination trying to figure out what the word actually was (yes, there are many people who waste time pondering). That said, here it is:

my @badwordsa=('#*$!','damn','piss');
my @badwordsb=('#*$!','asswipe','asskisser','asskiss','kissass','kiss ass','hell');

foreach $c(@noninsertables){ #check personal (alias) name for fowl language
foreach $d(@insertables){
if (($name =~ m/\b$c\b/im || s/\W//g)||
($name =~ m/$d/im)){
$ec3 = 1;
$te++;
}
}
}

if($ec3 == 1){print" <li>&nbsp;The Following Bad Words Were Found In The Name Field And Must Be Removed:</li>\n";}#end alias
foreach $d(@noninsertables){
if($name =~ m/\b$d\b/im){
print" <span class=\"l\"><span class=\"rs\">$d</span></span>";}}
foreach $c(@insertables){
if($name =~ m/$c/im){
print" <span class=\"l\"><span class=\"rs\">$c</span></span> ";}}



I tried to find the original post where I had asked for help putting this together but I'm pretty sure the real credit goes to phranque or rocknbill, can't remember exactly but they have set me straight on more than one thing in perl and only right that credit goes where its due

typomaniac

10:37 pm on Jul 6, 2019 (gmt 0)

10+ Year Member Top Contributors Of The Month



Forgot to mention, repeat for each field using different vars for each field. $ec was supposed to mean error code so I just added a different number to the end for each field.

Same @array used for all fields

csdude55

7:54 pm on Jul 7, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Like @lucy24 said, though, my problem lies with people that actively try to get around it. I guess they just insist that their point can't be made without the profanity or vulgarity? I currently filter it with ****, which is fine for the wide majority, but there are always a handful of people that insist on using unexpected characters.

And if I alerted them to something being filtered, it would just make it easier for them to keep trying until they get it go through. These people would make a game of it.

For example, today I had "g0d@mm" to pass.The user intentionally swapped the o with a 0, the a with an @, and then added a second m instead of an n.

A few days ago, another user intentionally submitted "Þµš$¥".

So I guess that what I need is a list of every possible character that could be used to look like another character, convert each of them to the corresponding alpha character, then filter for profanity, then convert it back?

lucy24

9:39 pm on Jul 7, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You could block entire unicode ranges, like the characters whose sole function is to look like some other letter. (Back when you could see search queries, I would see these in searchers from Japan who apparently laboriously typed in one lookalike after another, and then the poor search engine had to figure out what they meant.) Sorry, users, you're not allowed to say anything in Greek or Russian.

No matter what you do, you'll get false positives, like when the present site throws a fit if you have occasion to enter any common Thai name. (Not the ones that end in -korn. The ones with the other consonant.)

It may be more useful to look at patterns. There are really not many legitimate configurations even of {alphabetic} {non-alphabetic} and still fewer of {non-alphabetic} {alphabetic} in that order, while {alphabetic} {non-alphabetic} {alphabetic} can be reduced to a short list of specific patterns. Adjacent letters and numbers is another easy one. (I check for those when proofreading ebooks, though there it's to catch OCR glitches rather than willful malice.)

typomaniac

7:36 am on Jul 8, 2019 (gmt 0)

10+ Year Member Top Contributors Of The Month



So, how do we make a filter that won't allow misspelled dictionary words. Oh my, that means people would have to act civilized.

typomaniac

7:59 pm on Jul 8, 2019 (gmt 0)

10+ Year Member Top Contributors Of The Month



Do you have a reason for not disallowing special characters, other than in an email field ?

csdude55

1:44 am on Jul 9, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The only applicable time that I see them is when someone copies an article from a news site that has a different encoding. So ‘ ’ “ ” are relatively common. And of course, regular punctuation.

Can you suggest a way that I could search my database for special characters, to see how often they're actually used? And if they're rare enough, is there an easier way to block them than:

if ($text =~ /[^\w\s\n\r!@#$%\^&\*\(\)_\+\-=\[\]"';:<>\?,\.\/]) {
# allowed
}

else {
# contains special character
}

lucy24

3:02 am on Jul 9, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<tangent>
\s\n\r
That's redundant. \n and \r (also \t, nbsp and so on) are all \s characters. And, further along, _ (lowline) is a \w character.
</tangent>

You shouldn't need to escape anything other than brackets [ ] (“square brackets” in some dialects) and hyphen - inside grouping brackets. And \ backslash, though I don't think you have one. (And probably don’t need one, since it doesn't occur naturally.) Not sure about non-initial ^ though there you could always play it safe.

Where you have ' " you should really have ' " “ ” ‘ ’ to allow for curly quotes. Some people have their systems set up to substitute them automatically; other people--ahem! cough-cough!--enter them manually. The same probably applies to — (dash) substituted for -- double hyphen. Oh, and … for ... three dots.