Forum Moderators: phranque

Message Too Old, No Replies

More fun with regex, match \W but not the space at the end of the word

         

csdude55

10:49 pm on Oct 31, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is in Perl, but the issue is really with the regex. Here's my script:

# Catch non-alpha and HTML tags
$temp[0] = '(?:\W(?!\s\b)|<.+?>)*';

$str = 'foo s z ball';

# also want to match:
# $str = 'foo sz ball';
# $str = 'foo s-z ball';
# $str = 'foo s<br>z ball';

$pattern =
'(?:'.
'[s\$z]'. $temp[0] .
'[s\$z]'. $temp[0] .
')';

$str =~ s/\b$pattern\b/****/;

print $str;


I want the result to be "foo **** ball", but $temp[0] is matching the space after s-z and making it "foo ****ball".

I know that I could remove $temp[0] from the second [s\$\z], but $pattern is actually being built in a loop and going down that road was causing more problems than it helped.

Instead, I'm hoping you can suggest a modification to $temp[0] so that it would stop before matching that last space?

csdude55

4:40 am on Nov 1, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@lucy24 made a good point in another thread, but I wanted to copy it here for posterity:

And this
(?:\W(?!\s\b)
(et cetera). Since \W and \s are by definition non-word characters, the form \s\b would have no meaning.

My original was simply:

(?:\W|<.+?>)*

but then it was catching the last whitespace at the end of the word when I didn't want it to. My (?!\s\b) was an attempt to stop that, but now I see that it's not fixing that hiccup, either.

This comes closer:

(?:[^\w\s]|<.+?>)*

or more accurately, the other way around:

(?:<.+?>|[^\w\s])*

This accurately matches "sz", "s-z", and "s<br>z" and doesn't match the trailing whitespace, but if there's a whitespace in between the letters ("s z") then it doesn't match.

How do I say "match \W* unless it's at the end of the word"?

[edited by: phranque at 5:37 am (utc) on Nov 1, 2021]
[edit reason] disable graphic smile faces [/edit]

csdude55

12:52 am on Nov 2, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So, I have something close to a solution. More of a workaround, I guess, but still.

First, I went back to:

(?:<.+?>|\W)*

Then I export (?) the results of the pattern to an array.

Then I loop through the array and remove any trailing whitespace, then add it to a new |-delimited string.

Then I do a substitution on that new pattern.

# Catch non-alpha and HTML tags
$temp[0] = '(?:<.+?>|\W)*';

$str = 'foo s z ball';

# also want to match:
# $str = 'foo sz ball';
# $str = 'foo s-z ball';
# $str = 'foo s<br>z ball';

$pattern =
'(?:'.
'[s\$z]'. $temp[0] .
'[s\$z]'. $temp[0] .
')';

while (/\b($pattern)\b/xgi) {
($temp = $1) =~ s/\W+$//;

if ($new) { $new .= '|'; }
$new .= quotemeta($temp);
}

$str =~ s/\b(?:$new)\b/****/;

print $str;


The original script would run 1000 iterations in 1.5691s, and this new one with a while() loop takes 1.7771s.

csdude55

5:13 am on Nov 2, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oooooh, I was so close!

First, minor typo in that code above:

# Catch non-alpha and HTML tags
# changing this from $temp to $ph, since I
# use the variable $temp later
$ph[0] = '(?:<.+?>|\W)*';

$str = 'foo s z ball';

# also want to match:
# $str = 'foo sz ball';
# $str = 'foo s-z ball';
# $str = 'foo s<br>z ball';

$pattern =
'(?:'.
'[s\$z]'. $ph[0] .
'[s\$z]'. $ph[0] .
')';

# forgot to say "$str =~" here; my test
# was using $_, but I'd changed it to
# $str for the post so it wouldn't
# be confusing and ended up
# making it MORE confusing! LOL
while ($str =~ /\b($pattern)\b/xgi) {
($temp = $1) =~ s/\W+$//;

if ($new) { $new .= '|'; }
$new .= quotemeta($temp);
}

if ($new) {
$str =~ s/\b(?:$new)\b/****/;
}

print $str;


But it's all good EXCEPT for when they use non-alphanumeric symbols :'-(

For example, let's say that we have the following:

$str = 'foo #$$ ball';

$pattern =
'(?:'.
'[e#]'. $ph[0] .
'[s\$z]'. $ph[0] .
'[s\$z]'. $ph[0] .
')';


This time, #$$ doesn't match because the first (?:<.+?>|\W)* gobbles up all of the $, leaving nothing for [s\$z] to match.

And it doesn't help that, as far as I can tell, regex doesn't recognize \b before and after the \W characters, so even this doesn't match:

$str =~ s/[e#][s\$z][s\$z]/****/;

So far, the "fix" is to double down:

# Catch non-alpha and HTML tags
$ph[0] = '(?:<.+?>|\W)*';

$str = 'foo #$$ ball';

$pattern =
'(?:'.
'[e#]' .
'[s\$z]'.
'[s\$z]'.
')';

$str =~
# match if it's surrounded with \b, \s, or
# common punctuation
s{(\b|^|[\s.,;:'"]|\$)(?:$pattern)(\b|[\s.,;:'"]|$)}
{$1****$2}gi;

# then move on to test again if there's a \W in there
$pattern =
'(?:'.
'[s\$z]'. $ph[0] .
'[s\$z]'. $ph[0] .
')';

while ($str =~ /\b($pattern)\b/xgi) {
($temp = $1) =~ s/\W+$//;

if ($new) { $new .= '|'; }
$new .= quotemeta($temp);
}

if ($new) {
$str =~ s/\b(?:$new)\b/****/;
}

print $str;

lucy24

5:36 pm on Nov 2, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



EXCEPT for when they use non-alphanumeric symbols
Yes, you might need to replace \W with [^\w$@] et cetera, retaining the non-alphanumerics that are most popularly substituted so you can use them as part of your word-specific tests.

otoh, I wouldn’t bother screening for locutions like #$$ because, heck, they’re already self-censored. And there’s a limit to variant spellings, otherwise you’ll find yourself inadvertently stomping on people using words like “feckless” * or, as on the present site, Thai surnames containing a certain common syllable. (In fact, Preview Post reveals that the present site also takes offense at my first example, which is--hmph!--a perfectly legitimate word.)


* I learned once in a linguistics class that English used to have the whole package of words in the form /f/+ short vowel + /k/ but over the centuries they fell away one by one. There’s a technical term, which I’ve forgotten.

csdude55

7:33 pm on Nov 2, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



otoh, I wouldn’t bother screening for locutions like #$$ because, heck, they’re already self-censored.

Over the years my focus has been "whatever King Google wants", but as Google RPM has decreased (and ad blockers have increased) I'm trying to focus more on direct-sale ads. This are photo-text, so most ad blockers let them slide, too.

But oh. My. Gah. You would be shocked at the hoops I'm expected to jump through!

"I see you have horoscopes. Well, that's Satanic and I can't possibly endorse a Satanic site."

"You once had a guy selling vintage Playboys in your classifieds. I can't possibly endorse a #*$!o site."

"I see people calling one another names like 'stupid' and 'idiot'. I can't possibly support a hate filled site like that."

"Someone said 'kiss my #$$', I can't possibly support such a vulgar site!"

These same people advertise on Fakebook (a site designed to make fun of teenage girls, and still filled with unchecked profanity, groups designed to make fun of people, and pages for terrorists organizations), they advertise on billboards (that have signs for strip clubs)... but can't see their own hypocrisy.

So I find myself going down the rabbit hole, trying to please a bunch of hypocrites that probably aren't going to advertise, anyway. But the alternative is to give up and not even try, so I guess I have to follow it to the end either way.

Anyway.

I think I'm about an inch away from being done with my script, I'm hoping to post the final working script in the Perl forum later this evening :-)