Forum Moderators: coopster & phranque

Message Too Old, No Replies

Combining 3 regex to one

         

csdude55

7:33 pm on Oct 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I currently have a function with 3 regex in a row and I'm wondering if there's a way to compress it to a single one.

The steps:

1. I have a string ($data) that's joins user-submitted $name and $text

2. check to see if $data matches specific words or phrases, and if so then I remove those words so that they won't match later (mainly used when someone has a name like "dick cheney" that I don't want to match the pattern that's trying to filter "dick")

3. create a | delimited pattern from MySQL

4. check to see if $data matches that pattern

5. make an exception if $name is not a specific word / phrase OR the matched pattern's group name is not a specific word

The actual code:

$data = join(' ', $name, $text);
# Result: "foo this is foo's test"

if (
$name =~ m{^(
foo |
bar |
lorem |
ipsum
)$}x
) { $data =~ s/$1 ?//gi; }
# "foo" matches so remove it
# Result: "this is 's test"

($pattern) = $dbh->selectrow_array("SELECT GROUP_CONCAT(contains SEPARATOR '|') FROM tableA");
# Result: something like $pattern = "this|that|(?<TOT>the.other.thing)"

if ($data =~ /($pattern)/si &&
(
$name ne 'blah' ||

# don't let the (keys %+)[0] scare you, it's just the group name in the pattern match
(keys %+)[0] ne 'TOT'
)
) {
# do something
}


I tend to find double-negatives confusing to read, so for the sake of clarification the if ($data =~ /($pattern)/si section could be more easily read as:

if (
$name eq 'blah' &&
(keys %+)[0] eq 'TOT'
) {
# do nothing
}

elsif ($data =~ /($pattern)/si) {
# do something
}


Can you suggest a way to do this without doing 3 separate regexes?

lucy24

8:23 pm on Oct 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do the elements you’re checking against always come in the same order? If yes, you can express the pattern as (this|that|other).+?(more|and|additional). It may not be worth it, though.

Some RegEx dialects do have a variety of concatenation options, though I think they tend to involve single characters (“teststring IS a \w word character but IS NOT one of a short specified list”, that kind of thing). And, again, it may really not be any faster than letting your perl code handle the options in sequence. Especially when you come back next year, look at your code and have to figure out what the heck you meant.

csdude55

9:12 pm on Oct 12, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



After some poking around, I might have come up with a solution with a while() statement.

First I recognized that the original if() condition didn't have to come before creating $pattern, and after moving it next to the second condition it was clearer that I could do it after matching $data =~ /($pattern)/ if I could make the condition ignore any reference to $name and rerun the match. That led me to:

($pattern) = $dbh->selectrow_array("SELECT GROUP_CONCAT(contains SEPARATOR '|') FROM tableA");
# Result: something like $pattern = "this|that|(?<TOT>the.other.thing)"

while ($data =~ /($pattern)/i) {
# Put the conditions here
if (
(
$name eq 'blah' &&
(keys %+)[0] eq 'TOT'
) ||

# removed the () from here, matching the $1 from $pattern instead
$name =~ m{^
foo |
bar |
lorem |
ipsum
$}x
) {
$data =~ s/$1 ?//gi;
}
else {
# do something

last;
}
}


This actually works better for me. In the original, if "blah" made a comment that matched "the.other.thing" and then matched "that", he would have been given a free pass when he shouldn't have.

I usually avoid while() statements because of the potential of an infinite loop, so I should probably add a safety net, too:

($pattern) = $dbh->selectrow_array("SELECT GROUP_CONCAT(contains SEPARATOR '|') FROM tableA");
# Result: something like $pattern = "this|that|(?<TOT>the.other.thing)"

$count = split('\|', $pattern);
# Result: close to the number of delimited patterns, maybe a little high if there are nested |

$x = 0;
while ($data =~ /($pattern)/i) {
$x++;

if (
# safety net
$x < $count &&

# Put the conditions here
(
(
$name eq 'blah' &&
(keys %+)[0] eq 'TOT'
) ||

# removed the () from here, matching the $1 from $pattern instead
$name =~ m{^
foo |
bar |
lorem |
ipsum
$}x
)
) {
$data =~ s/$1 ?//gi;
}
else {
# do something

last;
}
}


Thoughts?

lucy24

10:07 pm on Oct 12, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I usually avoid while() statements because of the potential of an infinite loop
It sounds like you're doing the same thing I resort to when I don't trust a While or Until type of condition. (i.e. always ;)) Conceptually:

counter = 0
WHILE {blahblah AND counter < some-large-but-not-unreasonable-number}
counter++
do stuff