Forum Moderators: coopster & phranque

Message Too Old, No Replies

More fun with regex: \b throwing off results when string begins with $

         

csdude55

8:45 pm on Oct 16, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



OK, @lucy24, time to use your regex skills again! LOL Anyone can answer, of course, I just know that Lucy loves her some regex :-P

What I'm finding is that when a word begins with $, my regex isn't matching it like I'm expecting. Examples:

# First test
$str = '$';

$str =~ s/\b\$/s/g;

print "Result: $str";
# Result: $

# Second test, using \Q .. \E to escape the $
$str = '$';

$str =~ s/\b\Q$\E/s/g;

print "Result: $str\n";
# Result: $

# Third test, text before $
$str = 'this is a dollar sign $';

$str =~ s/\b\$/s/g;

print "Result: $str";
# Result: this is a dollar sign $

# Fourth test, remove \b from pattern
$str = '$';

$str =~ s/\$/s/g;

print "Result: $str";
# Result: s


I expected ALL of them to match and replace, but the \b is making it not match.

Thoughts?

lucy24

12:30 am on Oct 17, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



when a word begins with $
In the land of RegEx, a word can't begin with $ because $ (i.e. \$) is a non-word character. The actual “word” would then begin at whatever comes immediately after the $ character.

You'd have to express the pattern as ($|\s)\$ or, to cover all possibilities, ($|[^\w$])\$

Yikes.

csdude55

6:59 am on Oct 17, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you mean ^ instead of the first $? Eg, replacing \b with (^|[^\w\$])?

$str =~ s/(^|[^\w\$])\$/$1s/g;

Preliminary tests seem to work...

If that's what you meant, is this a safe replacement for \b in any context? Because in my real script I'd have to use it to match against 70+ patterns.

lucy24

4:38 pm on Oct 17, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you mean ^ instead of the first $?
Yes, I do, thank you. Oops. Brain fart.

Finding substitutes for \b is probably context-specific. On one side would definitely be ^ (or $ at--ahem, cough-cough--the end of a word) while on the other side you might need to think about what non-word characters can actually occur, and whether you want to count them. In particular, if your “words” include negative numbers, then you would need to add that to your exclusions and inclusions:

(^|[^\w$-])[\w$-]

Other things that might potentially arise are currencies like £ or €. And, conversely, if something can end in ¢ then you need further business at the other end of your string. And then you’ve got decimal numbers, where . comes in the middle of a string, and ....

Hence, “context-specific”. We’re not still talking about profanity filters, are we?

In some of these situations, \d might be a better option than \w.