Forum Moderators: phranque
$string = '(?:
# word one
mï0ï
r
# or
|
# word two
mï0ï
(?:
dï0ï
)+
rï0ï
)+
# word three
(?:
a |
b
)+'; $string = '(?:
(?:
# word one
mï0ï
r
# or
|
# word two
mï0ï
(?:
dï0ï
)+
r
)
ï0ï
)+
# word three
(?:
a |
b
)+'; @more =
$pattern =~ m{(
\Q(?:\E
(?:
(?>
[^()]+
)
|
(?1)
)*
\Q)\E
)}xsg; @more =
$pattern =~ m{(
\Q(?:\E
(?:
(?>
[^()]+
)
|
(?1)
)*
\Qï0ï)\E
)}xsg; @more =
$pattern =~ m{(
\Q(?:\E
(?:
(?!
\( |
\)
)+
|
(?1)
)*
\Q)\E
)}xg; I tried using .* before and after \( and \) in every variationWell, that’s definitely something to avoid if at all possible, since .* can represent absolutely anything, including the string you’re trying to match.
The description I found says that it "recurses to bracket 1 and tries again", but I have no clue where "bracket 1" is or how to define it.Can you remember where you found this? It doesn't seem to be on the linked regex dot info page. Lacking context, I would assume “bracket 1” corresponds to whatever \1 (or $1) would be if you were capturing, typically the first open-parenthesis from the left.
This savings will be vital when your alternatives contain repeated tokens (not to mention repeated groups) that lead to catastrophic backtracking.I like this language. It’s like when Apache docs say “unintended consequences” and you just know it really means the world as we know it will come to a crashing halt.
@more =
$pattern =~ m{(
\Q(?:\E
(
(?:
(?>[^()]+)
|
(?1)
)*
)
ï0ï\)
)}xg; Can you remember where you found this? It doesn't seem to be on the linked regex dot info page. Lacking context, I would assume “bracket 1” corresponds to whatever \1 (or $1) would be if you were capturing, typically the first open-parenthesis from the left.
/
^ # start of line
( # start capture buffer 1
< # match an opening angle bracket
(?: # match one of:
(?> # don't backtrack over the inside of this group
[^<>]+ # one or more non angle brackets
) # end non backtracking group
| # ... or ...
(?1) # recurse to bracket 1 and try it again
)* # 0 or more times.
> # match a closing angle bracket
) # end capture buffer one
$ # end of line
/x What are some example strings that you’re trying to match or not match? Looking only at the RegEx, I'm getting held up on things like, why isn't
(?:mï0ïr|mï0ï(?:dï0ï)+r)
simply
(?:mï0ï(?:dï0ï)*r)
$pattern = '(?:mï0ïï3ïï0ïï4ïï0ïhï0ïï5ïï0ïr|mï0ïï6ïï0ï(?:dï0ï)+ï5ïï0ïrï0ï)?(?:f|pï0ïhï0ï)(?:ï1ïï0ï)+(?:(?:cï0ï)?k|qï0ï)+(?:ï6ïï0ïï5ï|r|ï4ïï0ïï7ïï0ïrï0ïd|cï0ïn|jï0ïï3ïï0ïï5ï|ï2ïï0ï(?:ï8ï|ï6ïï0ïrï0ï)dï0ïï5ïï0ïn)*';
$pattern =~ s{(
\Q(?:\E
(
(?:
[^()]+
|
(?1)
)*
)
ï0ï\)
)}
{(?:(?:$2)ï0ï)}x;
print $pattern; [edited by: phranque at 7:56 pm (utc) on Oct 31, 2021]
[edit reason] disable graphic smile faces [/edit]
$pattern = 'ï3ï+(?:ï9ï|ï10ï(?:ï7ï)?ï7ï|hï6ï)?ï8ï|(?:ï7ï)?';
# should return anything that starts with (?: and ends with a matching )?
@more =
$pattern =~ m{(
\Q(?:\E
(?:
(?>[^()]+) |
(?1)
)*
\)\?
)}xsg;
for (@more) {
print "$_\n";
}
# Returns:
# (?:ï9ï|ï10ï(?:ï7ï)?ï7ï|hï6ï)?
# (?:ï7ï)? $pattern = 'ï3ï+(?:ï9ï|ï10ï(?:ï7ï)?ï7ï|hï6ï)?ï8ï|(?:ï7ï)?';
# should return anything that starts with (?: and ends with a matching ï)?
@more =
$pattern =~ m{(
\Q(?:\E
(?:
(?>[^()]+) |
(?1)
)*
ï\)
)}xsg;
for (@more) {
print "$_\n";
} Holy ###. What do all those ï (diacritic which doesn’t occur in most languages I can think of, and is exceedingly rare in the rest, like “oï” indicating two syllables rather than the usual diphthong) represent in real life?
[edited by: phranque at 5:34 am (utc) on Nov 1, 2021]
[edit reason] disable graphic smile faces [/edit]
$pattern = 'ï3ï+(?:ï9ï|ï10ï(?:ï7ï)?ï7ï|hï6ï)?ï8ï|(?:ï7ï)?';
# should return anything that starts with (?: and ends with a matching )?
@more =
$pattern =~ m{(
\Q(?:\E
(?:
(?>[^()]+) |
(?1)
)*
\)\?
)}xsg;
for (@more) {
print "$_\n";
}
# Returns:
# (?:ï9ï|ï10ï(?:ï7ï)?ï7ï|hï6ï)?
# (?:ï7ï)? ï3ï+(?:ï9ï|ï10ï
(et cetera). That part would never match, because ï+ has already gobbled up all the ï so there would be nothing left for the ï lookahead. (?:\W(?!\s\b)
(et cetera). Since \W and \s are by definition non-word characters, the form \s\b would have no meaning. don't waste unnecessary timeTime spent playing with Regular Expressions is never wasted ;)
That part would never match, because ï+ has already gobbled up all the ï so there would be nothing left for the ï lookahead.
And this
(?:\W(?!\s\b)
(et cetera). Since \W and \s are by definition non-word characters, the form \s\b would have no meaning.
Final, disheartening thought: The time you spend devising all these tests and making everything foolproof . . . will be easily matched by a small subset of forum users moving heaven and earth to devise methods of bypassing the filters.
[edited by: phranque at 5:35 am (utc) on Nov 1, 2021]
[edit reason] disable graphic smile faces [/edit]
Time spent playing with Regular Expressions is never wasted ;)
<.+?>It might be preferable to say <[^>]+>But either way you have to consider people saying, er, < (form I actually use in Disqus-based forums because they’re coded to auto-convert anything in <angle brackets> whether it’s an attested html tag or not). but not match any trailing whitespace or punctuationI don’t know if it would work to do it as packages instead. If your set of all possible intervening characters is \q--locution invented at random to represent
([\W_]|<[^<>]+>)--then you’re looking at How long is your list of Bad Words? I mean the underlying words, not their disguises.
I think it may become necessary to strip away any and all <blahblah> first, because otherwise the “blahblah”--which most often will consist of nothing but word characters--will merge into the surrounding word:
foo<i>bar</i>
>>
fooibari
creating a false negative.
[edited by: phranque at 4:44 am (utc) on Nov 2, 2021]
[edit reason] disable graphic smile faces [/edit]
</?\w+>