Another regex question. match anything EXCEPT

Forum Moderators: open

Message Too Old, No Replies

Another regex question. match anything EXCEPT

csdude55

2:08 am on Aug 15, 2018 (gmt 0)

Just another minor stumbling block...

I'm currently matching and removing data-, role, cite, and itxt attributes from elements that are pasted, using:

var data_match = /(<[^>]+)(?:data-[\w-]+|role|cite|itxt[\w-]*)=("|')[\s\S]+?\2([^>]*>)/gim;

buuuuut, now I have a few exceptions that I want to allow; specifically, data-quote and data-em.

My instinct is to change data-quote and data-em to data%2Dquote and data%2Dem, respectively, then run the regex above, then switch the %2D back to a hyphen. But before I do, is there any way to modify the regex to simply ignore those two values?

lucy24

3:15 am on Aug 15, 2018 (gmt 0)

My instinct is to change data-quote and data-em to data%2Dquote and data%2Dem, respectively, then run the regex above, then switch the %2D back to a hyphen.

You’d be surprised at how often it really does end up simplest to go two steps forward and one back.

Option B is the negative lookahead:
data-(?!em\b|quote\b)
The \b may or may not be necessary, depending on your possible data-blahblah values. Note the position of the pipe in the lookahead.

csdude55

12:13 am on Aug 16, 2018 (gmt 0)

Thanks, Lucy! I swear I'm never gonna get the hang of look-arounds, they trip me up every time :-(

I'm working on a similar-but-different regex, too... do you mind looking at this and tell me if I'm doing it right? It seems to work the way I'm wanting, but I'm nervous before putting it on a live site.

In this one I'm trying to find URLs and convert them to a link, but only if they're not already in a tag.

I originally just removed the link tag altogether, like:

replace(/<a href=("|')(\S+)\1[^>]*>[\s\S]+</a>/gi, '$2')

but there's the occasional time when someone copies an article with a link like <a href="http://www.foo.com">bar</a>, and then it just makes a mess. So I want to leave those intact (although I'll probably add a "_blank" target).

I've also had a few issues with sites that have URLs like http://www.foo.com/www.bar.com/, so while I want to fix links that are missing the http:// protocol, I also need to make sure that I don't catch links that are actually part of the REQUEST_URI.

So here's what I made, loosely based on some info you gave me before:

var a = 'www.example.com';

var www_match = /((?:^|[^<]|>)[^<>]*?[^\b\w]*?)(www\.\S+(\b|$))/gi;
a = a.replace(www_match, 'https://$2');

var link_match = /((?:^|[^<]|>)[^<>]*?)(https?:\/\/\S+)(\b|$)/gi;
a = a.replace(link_match, '<a href="$2" target="_blank">$2</a>');

Do you see any reason why this would catch URLs other than what I've intended?

lucy24

4:01 am on Aug 16, 2018 (gmt 0)

(?:^|[^<]|>)

I think an extra pipe sneaked in there. Wasn’t this the one where the object is to find strings only outside of <markup>? If so it needs to be

(?:^[^<]|>)

i.e. start each fresh search at the very beginning of your test string--but only if it doesn't start right in with <markup>, hence the ^[^<] locution--and then every time the <markup> closes.

csdude55

4:47 am on Aug 16, 2018 (gmt 0)

You're right, that's where we were talking about it before... when I was trying to match and limit repeating characters outside of a tag.

But in this one, when I don't include the pipe it doesn't match. Is it OK to link to a fiddle here? Assuming so:

[jsfiddle.net...]

csdude55

7:13 am on Aug 16, 2018 (gmt 0)

Update, I had to modify www_match to include the / and requiring 1 or more times instead of 0 or more:

var www_match = /((?:^|[^<]|>)[^<>]*?[^\b\w\/]+?)(www\.\S+(\b|$))/gi;

But since I'm trying to better learn lookarounds, am I correct that this would be better? It appears to work correctly in the fiddle:

(?<!https?:\/\/)

Ie:

var www_match = /((?:^|[^<]|>)[^<>]*?(?<!https?:\/\/))(www\.\S+(\b|$))/gi;

lucy24

6:00 pm on Aug 16, 2018 (gmt 0)

Can you use ? in a lookbehind? (SubEthaEdit won’t let me: I have to say separately http:// and https:// because a lookbehind, unlike a lookahead, has to be of fixed length. Could be worse: some flavors require everything to be of fixed length.) I remember we talked before about lookbehinds in Javascript; I was under the impression you can’t use them [regular-expressions.info] (if the fragment gets eaten, scroll down to Important Notes). In practice I guess that means: be sure to try it out in MSIE < current-version to be safe.

I remain uneasy about this

(?:^|[^<]|>)

because then what happens if the very first thing in your test string is <markup>? You'd then be capturing exactly what you don't want to capture. The [^<>] is meant to protect you, but do some further experimenting to make sure it really does.

When a Regular Expression itself contains literal slashes, I sometimes find it more readable to use the “new RegExp” formulation instead. (If it contains both slashes and quotation marks, you’re SOL ;) )

I think (\b|$) may be redundant, because “end of string” itself counts as a \b.

csdude55

7:16 pm on Aug 17, 2018 (gmt 0)

Can you use ? in a lookbehind? (SubEthaEdit won’t let me: I have to say separately http:// and https:// because a lookbehind, unlike a lookahead, has to be of fixed length. Could be worse: some flavors require everything to be of fixed length.)

You're right, I remember dealing with that one before... I'm not quite sure why it was working, but I did find that if I changed it to http: then it didn't match anymore.

I tried changing it to:

(?:(?<!https)|(?<!http):\/\/)

but that didn't match at all, so... blah.

I remember we talked before about lookbehinds in Javascript; I was under the impression you can’t use them [regular-expressions.info] (if the fragment gets eaten, scroll down to Important Notes). In practice I guess that means: be sure to try it out in MSIE < current-version to be safe.

Very good point, I've been testing in Chrome but IE is a real pain. I don't know how many times I've thought that everything was great, then I test on IE9 and have to virtually start over >:-(

I remain uneasy about this
(?:^|[^<]|>)
because then what happens if the very first thing in your test string is <markup>? You'd then be capturing exactly what you don't want to capture. The [^<>] is meant to protect you, but do some further experimenting to make sure it really does.

Excellent point, too. I'm not sure why it's not matching at all without the pipe, though.

But you're right, with the pipe it's not matching when it's the first thing in the text, either.

I think I've written a method that works without using lookarounds... it might be slower, but it works. More in a second...

When a Regular Expression itself contains literal slashes, I sometimes find it more readable to use the “new RegExp” formulation instead. (If it contains both slashes and quotation marks, you’re SOL ;) )

True, I've used "new RegExp" when I need a variable in the regex, but that's all.

I'd gotten so used to use ### as the delimiter in Perl that know it's hard for me to go back to using / and escaping it!

I think (\b|$) may be redundant, because “end of string” itself counts as a \b.

Great point there, too! And it did work the same without the $.

So, after working on it all night, in a last minute fit of rage I decided to give up on the lookarounds altogether and try an alternative. This might be a bit slower, but it seems to work.

The concept is to replace all <a...></a> tags with something unique, and instead save them to an array. Then I can go through and add the protocol and link tags without worrying about whether it's already in a link tag, and then go back and put the original links back in, untouched.

// Remove all targets; I'm mainly working with pasted data that can originate from
// other sites, so I don't want to accidentally have a target that conflicts with
// something I'm already using
var target_remove_match = /(<a[^>]*) target=("|')\w*\2/gi;
a = a.replace(target_remove_match, '$1');

// Replace all links with a placeholder; it's notable that I get a warning on the
// "function", though, and while I think it can be safely ignored, it's worth knowing:
// Functions declared within loops referencing an outer scoped variable may lead to 
// confusing semantics
var remove_link = /(<a[^>]+>[\s\S]+?<\/a>)/gi;
var save_link = [];
var x = 0;

while (remove_link.test(a))
 a = a.replace(remove_link, function($match, $1) {
 x++;
 save_link[x] = $1;

 // this could be anything unique as long as it has the x variable somewhere. Since
 // I'm not using <q> for emojis anymore, I'm thinking about using '<q>' + x + '</q>'
 // as a placeholder instead of '::chr(' + x + ')::'
 return '::chr(' + x + ')::';
 });

// Convert www to https://www; do I need to allow for more than ^|>|\s?
var www_match = /(^|>|\s)(www\.\S+\b)/gi;
a = a.replace(www_match, '$1https://$2');

// Add A element
var link_match = /(^|>|\s)(https?:\/\/(\S+))\b/gi;
a = a.replace(link_match, '$1<a href="$2">$3</a>');

// Reinstate links
var reinstate_link = /::chr\(([0-9]+)\)::/gi;
while (reinstate_link.test(a))
 a = a.replace(reinstate_link, function($match, $1) {
 return save_link[$1];
 });

// Add _blank target to all A elements
var target_match = /(<a [^>]*)>/gi;
a = a.replace(target_match, '$1 target="_blank">');

And here's the working fiddle; click the button and you'll see all of the conversions that I could think of for testing:

[jsfiddle.net...]

lucy24

8:21 pm on Aug 17, 2018 (gmt 0)

(?:(?<!https)|(?<!http):\/\/)

When a lookaround involves multiple options, put the whole thing in the same lookaround markup, and separate the options with a pipe:

(?<!http|https)

It looks unnerving when you're not used to it, but that's the syntax.
(?=ab|cd) = followed by ab OR followed by cd
(?!ab|cd) = followed by NEITHER ab NOR cd
and so on.

The concept is to replace all <a...></a> tags with something unique, and instead save them to an array.

This sounds like another version of when you throw up your hands and decide that “two steps forward, one back” works out to less trouble in the long run.

csdude55

11:25 pm on Aug 17, 2018 (gmt 0)

When a lookaround involves multiple options, put the whole thing in the same lookaround markup, and separate the options with a pipe

Ohhhh, I see. I thought that since it was fixed-length then I couldn't use the pipe, either. But cool, I'll remember that :-)

This sounds like another version of when you throw up your hands and decide that “two steps forward, one back” works out to less trouble in the long run.

Haha, exactly! I've been working on what was supposed to be 5 lines of code for 3 days now, and it still wasn't right. I came up with that variation at around 3am and had it banged out and functional by 3:30, so even though using 2 loops and an array of unknown size might be a bit slower than the lookarounds, at least it works on all of the browsers I could test and I understand it...

This is one more thing added to my list of things to come back to when everything else is done...