Negative lookahead, why isn't this working? - Webmaster General forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Negative lookahead, why isn't this working?

csdude55

5:00 am on Jul 24, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I know it's late at night and my brain is fried... I'm hoping you guys can look at this and tell me what I'm doing wrong?

This is in Javascript, but I'm putting it in General because it's really a general regexp question and not Javascript specific...


var text = '<font face="Comic Sans MS, cursive"><span class="rel-huge"><span class="rel-small"><span class="rel-large">Yeah</span></span></span></font><br>';

var font_match = /<span class=("|')rel-[^\1]+\1>\s*(<span class=("|')rel-[^\3]+\3>)((?!<\/span>)*)<\/span><\/span>/gi;
if (font_match.test(text))
 alert('1');

The problem is with ((?!<\/span>)*). I'm specifically trying to match anything that's not the ending </span>... isn't this right?

If I use ([^<]*) or (.*) then it matches, so I know that the problem is with the negative lookahead. But I need to specifically match </span> because it's possible that there could be other tags in there.

[edited by: phranque at 7:21 am (utc) on Jul 24, 2018]
[edit reason] Disable graphic smile faces for this post [/edit]

lucy24

6:50 am on Jul 24, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

it's really a general regexp question and not Javascript specific...

Maybe yes, maybe no. Javascript is pretty limited in its RegEx functionality: above all, it can't do lookbehinds.

Are you sure about any of it? When I try

<span class=("|')rel-[^\1]+\1>\s*(<span class=("|')rel-[^\3]+\3>)((?!<\/span>)*)<\/span><\/span>

in my text editor it raises an “invalid RegEx” warning at three different points: the two
[^\1] [^\3]
and also the mysterious * which doesn't seem to apply to anything but the

(?!<\/span>)*

lookahead. Even if I sidestep the quotes issue by making two options

<span class="rel-[^"]+">\s*(<span class="rel-[^"]">)((?!<\/span>)*)<\/span><\/span>
<span class='rel-[^']+'>\s*(<span class='rel-[^']'>)((?!<\/span>)*)<\/span><\/span>

there's still the * to account for. What is is supposed to do?

Reconstructing what would be captured by this pattern, I get

<span class="rel-blahblah"> (<span class="rel-blahblah">)</span></span>

... which doesn't make sense, because the </span> occurs at exactly the point where the RegEx says that "</span>" must not occur. There needs to be something immediately before the lookahead:

<span class="rel-[^"]+">\s*(<span class="rel-[^"]">)(.*(?!<\/span>))<\/span><\/span>

Incidentally, why is it "rel-[^"]" ? Do you need to exclude classes that start in "rel" with no following hyphen, as well as classes that start in anything other than "rel"?

I'm also not sure the * is in the right place at all. Are you trying to allow for any number of nests?

csdude55

7:39 am on Jul 24, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Maybe yes, maybe no. Javascript is pretty limited in its RegEx functionality: above all, it can't do lookbehinds.

Well shoot. Are you sure about that? I was basing some of my coding off of this:

[rexegg.com...]

It said that inline modifiers weren't recognized in Javascript, but I thought lookarounds were?

in my text editor it raises an “invalid RegEx” warning at three different points: the two
[^\1] [^\3]

I'm not sure why that's throwing an error... I've been working on it in JSFiddle, and what I posted above was a direct copy and paste.

Why would those be errors? Isn't ("|')rel-[^\1]+\1 saying "a double- or single-quote, followed by rel-, followed by anything that's not a matching double- or single-quote, until you get to the first matching double- or single-quote"?

and also the mysterious * which doesn't seem to apply to anything but the
(?!<\/span>)*

Here's where I might be messing up. I'm trying to treat a string the same way you would treat a negative character class.

In my mind, (?!B) would be the same as [^B]... is that right? But where you can't use a string in place of B in the [ ], you can use a string in the negative lookahead.

So in this case, I want to match anything that's not </span>, until you get to the first matching </span>.

Am I correct in understanding that (.*(?!<\/span>)) is the magic trick that does that?

Incidentally, why is it "rel-[^"]" ? Do you need to exclude classes that start in "rel" with no following hyphen, as well as classes that start in anything other than "rel"?

I'm still playing with the contenteditable, and I'm doing my best to let the user change font sizes based on percentages, so that it's relative to the reader's default font size (which they set on the site).

The problem is that when the author selects text and changes the font size for that selection, window.getSelection() doesn't catch the surrounding tags. So there's the potential of having an unlimited number of nested tags, while the author is inadvertently trying to make it look right on his end. I've been trying for about a week now to figure out how to make it catch the surrounding tags, but now I'm giving up and just trying to repair the damage after it's done.

I created several classes of my own, like "rel-tiny", "rel-small", and so on. The "rel-" marker only applies to these font sizes, so I can safely remove the surrounding <span class="rel-whatever"> tags after they change the font size, and just leave the innermost tag.

[edited by: not2easy at 1:52 pm (utc) on Jul 24, 2018]
[edit reason] Disable graphic smile faces for this post [/edit]

lucy24

6:09 pm on Jul 24, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Why would those be errors?

I'm simply not sure you’re allowed to use a capture in this context. Evidently in SubEthaEdit you’re not; I haven't tried it in other RegEx engines.

So in this case, I want to match anything that's not </span>, until you get to the first matching </span>.

Am I correct in understanding that (.*(?!<\/span>)) is the magic trick that does that?

On sober consideration, probably not, because it's still saying “match stuff that is not immediately followed by </span>, and then grab a series of </span>”.

I can't remember if Javascript recognizes the .*? or .+? structure, meaning “stop as soon as you can” (where the RegEx default is to go on for as long as you can). The difference would be this. Given

<span class="rel-blahblah"> <span class="rel-blahblah"><span class="rel-blahblah">blahblah</span></span></span>

then this pattern

<span class="rel-[^"]+">.*(<\/span>)+

would capture

<span class="rel-blahblah"> <span class="rel-blahblah"><span class="rel-blahblah">blahblah</span></span></span>

while the pattern with ?

<span class="rel-[^"]+">.*?(<\/span>)+

(dammit, Forums, I SAID “Disable graphic smileys!”) would capture only

<span class="rel-blahblah"> <span class="rel-blahblah"><span class="rel-blahblah">blahblah</span></span></span>

--and I hope you are not color blind, because this was the best way I could find to illustrate the difference.

csdude55

12:11 am on Jul 25, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I'm simply not sure you’re allowed to use a capture in this context. Evidently in SubEthaEdit you’re not; I haven't tried it in other RegEx engines.

I just now ran the whole thing through ValidateJavascript.com and didn't have any notable errors there, so I think that might just be SubEthaEdit (one I've never used before).

I hope so, anyway, cause I used that style a LOT! LOL When people paste data, the browser often changed the " to '... and sometimes it adds ="" or ='' unnecessarily... and sometimes removes the quotes altogether. So I have to make allowances for all kinds of nonsense.

On sober consideration, probably not, because it's still saying “match stuff that is not immediately followed by </span>, and then grab a series of </span>”.

Ahh, I see. I'm guessing there's really not anything similar to a negative class set for strings, then :-(

I can't remember if Javascript recognizes the .*? or .+? structure, meaning “stop as soon as you can” (where the RegEx default is to go on for as long as you can).

Yes and know... the . doesn't match newlines in Javascript, and there's no /s modifier, so it's more common to use [\s\S]* or [\s\S]+.

The difference would be this.

I gotcha.

The whole thing is getting complicated and bug-prone, no matter which way I go! I'm concerned since this method matches the first <span> and first </span, instead of the first <span> and LAST </span>... so it doesn't necessarily match up. But maybe that won't matter, since I'm removing nested <span> tags as soon as the child is added.

csdude55

3:07 am on Jul 31, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Lucy, I'm still struggling here... this is a different bit of code, but basically the same subject, so I thought I'd just tag on here.

I'm trying to limit repeating characters, UNLESS they're within an HTML tag (I don't want a link to break). So this is where I am:

var a = "This is aaaaaaaaaaaaa <a href='http:\/\/example.com/bbbbbbbbbbbb'>teeeeeeeest<\/a>";

// I want to match aaaaaa and eeeeee, but not bbbbbbb
var match = /((?!<.*)((.)\3{2,})(?!>))/g;

while (match.test(a))
 a = a.replace(match, '$3$3');

I'm trying to return:
This is aa <a href='http:\/\/example.com/bbbbbbbbbbbb'>teest<\/a>

but instead it's still catching the bbbbbbb:
This is aa <a href='http:\/\/example.com/bb'>teest<\/a>

Any suggestions on how to make it ignore text that's between < >?

lucy24

3:27 am on Jul 31, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I might do it the other way around: look for things that are not inside html tags. Leaving off the /m flag* means things will continue to work across line breaks, because from everything I've heard so far about your users' input, it is not safe to assume there will never be line breaks in weird places. The [^<>] is just-for-insurance; it really means [^<].

/((?:^|>)[^<>]*?)([^<>])\2{2,}/g

I ended up with \2 rather than \3. Are you allowed to use the open-ended {2,} construction in javascript? If not, say something like {2,20} instead. Then again, you could say
([^<>])\2\2+
--again meaning “three or more of the same character”--with exactly the same result.

* I find this terminology horribly confusing, because in SubEthaEdit “multiline” means “don’t stop at line breaks”, i.e. the precise opposite of what it means in javascript.

csdude55

5:11 am on Jul 31, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Thanks for the help, Lucy!

When you said this:

(?:^|>)

did you mean this:

(?:<|>)

?

If not, what does the ^ in a group class represent? Or is it literal?

Are you allowed to use the open-ended {2,} construction in javascript?

I'm pretty sure it's OK. I've used it in the onPaste script for several years with no problems. But in the live script I limit the repeat to 6, not 2... I just did 2 for testing.

The code you posted is getting pretty close! But I'm still hitting a speed bump. The exact code I have is:


var match = /((?:(<|>))[^<>]*?)([^<>])\2{2,}/g;

while (match.test(a))
 a = a.replace(match, '$2$2');

# Result
This is aaaaaaaaaaaaa bb'eest

For testing, I did this in place of the while():

a = a.replace(match, 
 function($match, $1, $2, $3, $4, $5) {
  var ret =
   '\n' +
   '1 - ' + $1 + '\n' +
   '2 - ' + $2 + '\n' +
   '\n';

  return ret;
 });
});

# Result
This is aaaaaaaaaaaaa 
1 - <a href="http://example.com/
2 - b
">teeeeeeest</a>

So I think it's backwards, it's only changing what's in the < >.

lucy24

5:33 am on Jul 31, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

what does the ^ in a group class represent? Or is it literal?

Neither: it’s an anchor. When you wake up tomorrow morning, it will be with a resounding “D’oh!” as you remember that you have known this all along ;)

Here it means: begin your text search from a close-bracket--i.e. after the most recent html markup--or from the very beginning of the whole string, assuming your test strings don't always obligatorily start with an html tag. (If they do always, without exception, start with an html tag, then a simple > will suffice. But nothing is ever simple on your site, is it.)

Anyway, that's why the form has to be pipe-delimited; it's not a character class.

(?:^|>)

is not even remotely the same as

(?:[^>])

which in turn is not the same as

(?:[\^>])

csdude55

7:47 am on Jul 31, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

When you wake up tomorrow morning, it will be with a resounding “D’oh!” as you remember that you have known this all along ;)

Oohhhhhhhhhh! Duh :-P So "beginning of the string" or ">"... makes total sense now... sheesh, I need to go to bed! lol

So for those reading, here's the final script:

var match = /((?:^|>)[^<>]*?)([^<>])\2{2,}/;

while (match.test(a))
 a = a.replace(match, '$1$2$2');

I have another similar one where I only want to change the contents if it's not surrounded by an <a href=...>...</a> tag. It's kinda complicated, too, but I'm gonna see if I can use what you showed me to figure it out tomorrow. Thanks, Lucy!

lucy24

5:00 pm on Jul 31, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Unfortunately I have now realized that this pattern breaks if your test string does start with HTML markup, so it has to be

(?:^[^<]|>)

(i.e. initial character but only if it is not an opening bracket) ... handsomely illustrating the two different syntactic meanings of the non-literal ^ character :)