Forum Moderators: coopster & phranque

Message Too Old, No Replies

Fun with regex, matching a ", ', or nothing

         

csdude55

6:00 am on Dec 13, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



When I'm matching against HTML code, I'll typically use something like whatever=("|')foo\1.

But there are a few times where there's no opening " or ', and then everything falls apart.

For example:

$_ = qq~
<div onMouseOver="this.style.background = 'white';"
onMouseOut=this.style.background = 'black'
onClick="console.log('test');">
~;

s#(?:onclick|onmouseover|onmouseout)=("|')?.*?\1##gsi;


This will remove the onMouseOver and onClick, but not the onMouseOut.

If I make the \1 optional with \1? then ("|')?.*?\1? matches nothing, leaving me with:

<div this.style.background= 'white';"
this.style.background='black'
console.log('test');">

So how do I make it give me the result I'm expecting?

Note that the onFoo here is just an example; in practice I also match styles, classes, width, id, etc.

phranque

7:30 am on Dec 13, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i think this will require two substitutions - one for when the attribute value is enclosed in quotes and another otherwise.

the "unquoted field boundary" case will mean you are looking for any possible next attribute name or a '>' (end-of-tag) so good luck with that one...

lucy24

5:56 pm on Dec 13, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is there a reason the onMouseOut line can’t be punctuated the same as the other two?

csdude55

8:32 pm on Dec 13, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@lucy24, I'm dealing with text that's submitted by users, and sometimes copied from other websites. It's pasted into a contenteditable, so whatever styling the site uses will come through, too. I have a pretty wide variety of messes that, if not handled correctly, can mess up the entire page.

I seriously have 120 regexes in the processing script!

i think this will require two substitutions - one for when the attribute value is enclosed in quotes and another otherwise.

@phranque, following your lead there, I've started writing a regex that will enforce a " or ' to surround the value, and then future regexes won't have to worry about it.

But, of course, it's getting complicated.

Here's where I am so far, and it seems to work with all of my tests:

# I discovered that this.style.background = 'white' (with whitespaces) doesn't
# work anyway, so I'm assuming that the value does NOT have a whitespace. But
# a line COULD end with a whitespace and a \n, so I'm allowing \s*. It's also very
# likely that the content will have more than one element like this, so I'm testing
# with 2 DIV elements
$_ = qq~
<div onMouseOver=this.style.background='white'
onMouseOut = this.style.background="black"
onClick=
"console.log('test');">

<div onClick =console.log('test');>
~;

# count how many = are in the string as a safety next for WHILE()
$count = () = /=/g;

$x = $y = 0;

# I have to do another loop for EVERY regex inside of an element, or it will
# miss an element with 2 or more attributes to match
while (
$x <= $count &&
m#<([^>]*?)=\s*([\w\d].*?)[\s>]#gsi) {
$attrName = $1;
$attrVal = $2;

$patt = $1 . $2;

$qte = ($attrVal =~ /"/) ? "'" : '"';
$repl = $attrName . $qte . $attrVal . $qte;

s#\Q$patt\E#$repl#gsi;

$x++;
}

while (
$y <= $count &&
m#<(?:[^>]*?)((?:onclick|onmouseover|onmouseout)\s*=\s*("|').*?\2\s*)(?:[^>]*?>)#gsi) {
$patt = $1;
s#\Q$patt\E##gsi;
$y++;
}

print;

So much for my pretty little one-liner! LOL I'm totally open to suggestions on improvements.

lucy24

8:53 pm on Dec 13, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've started writing a regex that will enforce a " or ' to surround the value, and then future regexes won't have to worry about it
Yup. I've found that “two steps forward, one step back” often ends up being the most straightforward and painless solution. Spend hours tearing out your hair to arrive at a one-line approach, or just dash off two lines and be done with it.

csdude55

4:58 am on Dec 14, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm lucky that

(a) I moved all of my displaying scripts to PHP, and just use Perl for processing. So 120 regexes aren't tragic, it's only a little slow after they submit.

(b) Perl has really improved their regex speed, so even 120 of them are relatively fast :-) Having a few of them in a loop is a little dangerous, though. Someone could submit something with hundreds of elements in it, which would result in tens of thousands of regexes being ran!

In practice, though, I run most of the same regexes via JavaScript's onPaste, so this is really just a safety net in case something is missed anyway.

phranque

5:49 am on Dec 14, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



@phranque, following your lead there, I've started writing a regex that will enforce a " or ' to surround the value, and then future regexes won't have to worry about it.

normalizing the input seems like the best solution here...

csdude55

6:26 am on Dec 14, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Admittedly, coding it WAS fun! Albeit more time consuming than I'd intended :-/ If I'm being honest, I'm using it as an excuse to procrastinate on the other work I should be doing, anyway.