Regex, finding multiple non-repeating instances within a string

Forum Moderators: open

Message Too Old, No Replies

Regex, finding multiple non-repeating instances within a string

csdude55

3:12 am on Jul 11, 2018 (gmt 0)

I'm trying to remove any data-XXXX that occurs within tags, but the problem is that I can have multiple versions, and not repeating.

For example, I might have:

<div id="whatever" data-foo style="margin: 50px" data-bar="bar" class="example" data-foo-bar="this is an example">an example</div>

How can I remove all instances of data-XXXX in this, while saving the <div, class, id, and style? I'm getting lost in a regex like this:

var a = $('#comment').html();

a = a.replace(/<([^>]*) data-[\S]+(=("|')\S+\3)*/gim, '<$1');

but of course that doesn't work, anyway, because it only catches one of the data-XXXX instead of all of them.

lucy24

5:21 am on Jul 11, 2018 (gmt 0)

I don't think this is really a RegEx question*; it's more of an individual-coding-style question.

data-foo style="margin: 50px" data-bar="bar" class="example" data-foo-bar="this is an example"

Assuming there are other possibilities, like name=blahblah or id=blahblah, each separate one is

(data(?:-\w+)+)(?: \w+)? ?= ?"[^"]*"

which you replace with whatever-it-is. But since you don't know how many there will be in any given statement, you'll need to put them inside a "while" loop:
expr = what-I-said-above
while (expr.test(a))
{ a.replace.blahblah here }
If there were always exactly the same number of matches, you could do it in one fell swoop. Or two or three fell swoops, if there might be exactly 4 or exactly 3 or exactly 2 but never any more or less.

Edit: Oh, wait. Are they always consecutive, and you want to simply get rid of them? Then it becomes

(( data(?:-\w+)+)(?: \w+)? ?= ?"[^"]*")+)

and it really is one fell swoop. But this won't work if there can be other stuff mixed in with the data-blahblah pieces. In your example, is "style" (following on a data-thingy) part of what you're getting rid of, while "class" (not following on a data-thingy) stays behind?

I should probably not have tried answering this question so close to bedtime :(

* Insert personal witticism ad lib.

csdude55

12:27 am on Jul 12, 2018 (gmt 0)

Thanks, Lucy, the while loop is what I was trying to think of :-) There's no rhyme or reason to the order of how data-XXXX attributes can come in, and the user can copy from other sites so I have no idea what they'll contain. But sometimes I get an error that messes everything up, like:

<div data-foo="this is a big ol mess>this doesn't show because the tag is missing the ending double-quote</div>

The best resolution is to strip those attributes, since they'll have no value on my page anyway.

Unless there's a recursive modifier or something, then I think the loop is the only option.

I'm confused by some of the regex, do you mind explaining some of this?

// this group matches /1, the ?: makes the -\w+ non-capturing. I get this
(data(?:-\w+)+)

// non-capturing the next \w, but why the space before \w or after the second 
// question mark?
(?: \w+)? 

// this one loses me... I don't understand what the ? before the = does. The 
// second ? makes the second whitespace match 0 or 1 time, right?
?= ? 

// I get this one; " followed by anything not a " until it gets to the next ". But 
// mine gets more complicated, I can have " or ' so I use ("|') and then \1 to
// find the end
"[^"]*"

I should probably not have tried answering this question so close to bedtime :(

That's my MO, too... I work on something all day until just before bed, then I give up and ask for help. You guys probably think I'm an idiot, I'm sure most of my posts are pure gibberish by 3am! LOL

lucy24

4:02 am on Jul 12, 2018 (gmt 0)

Gosh. Something really took offense at your [ code ] markup there, didn't it.

What I said:

(data(?:-\w+)+)(?: \w+)? ?= ?"[^"]*"

What you asked:

why the space before \w

Oh, ###, I should have said
(?: \w+)*
with asterisk rather than ? because there might hypothetically be more than one. Or there might be none. The space is because that's how the "words" are separated, for example in
data-foo style=blahblah
the ( \w+) captures the " style" including its preceding space.

or after the second question mark?
I don't understand what the ? before the = does. The second ? makes the second whitespace match 0 or 1 time, right?
?= ?

The first space is actually not after the question mark but before the = sign. This is a CYA kind of rule: it might be any of
style=blahblah
style =blahblah
style= blahblah
style = blahblah
all syntactically permissible and all caught by
style ?= ?blahblah
In fact you could even say " *" both times. (Personal coding style again. I always put spaces around my equals signs, because my personal reading comfort is worth more than all those saved bytes ;))

But mine gets more complicated, I can have " or ' so I use ("|') and then \1 to find the end
"[^"]*"

Yes, I see. So it becomes

(['"])[^"'>]*\1

assuming you can be confident the quotation marks always match even if the second one may be absent entirely. The possible missing close-quote makes an entirely different pattern, though:

(['"])[^"'>]*(?=>)

where the last thing inside the parentheses is a literal > character. But what happens if there's a missing quotation mark somewhere other than the last element inside the < > markup? It's definitely safe to say [^"'>] because then you know that, no matter what, your RegEx won't continue merrily capturing past the end of the markup.

But it kinda looks like you need to test two patterns. Javascript is pretty minimalist when it comes to lookaheads, and I kinda doubt you're allowed to say

(['"])(?:[^"'>]*(?=>)|[^"'>]*\1)

though you can certainly try it and see if your code explodes. It means: after the first quote, continue until you either hit a matching quote, or something immediately followed by a close-bracket > (which can't be captured, so you have to use a lookahead).

Oh, and of course it won't really be \1 but some other number, depending on how many preceding parentheses there are, unless you go back and put ?: no-capture markup inside every blessed expression along the way. You're just deleting the whole thing, right? So you don't really need to capture anything. But all those ?: are sometimes more clutter than it's worth.

I don't think there is any difference between
('|")
and
(["'])
so use whichever form you are more likely to understand when you look at it a year from now.