Forum Moderators: phranque

Message Too Old, No Replies

Regex not matching as expected

         

csdude55

6:58 pm on Aug 28, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This question is not language specific, so I'm putting it under General.

I'm confused as to why this regex stops after the first match:

// string to match
str = '?id=123&h=456&s=789';

// regex to substitute
exp = /(\?|&)(h|z|s)=[^&]+&?/gi;

// result
?id=123&s=789


But this one matches everything (which is what I expected from the first):

// string to match
str = '?id=123&h=456&s=789';

// regex to substitute
exp = /(\?|&)(h|z|s)=[^&]+/gi;

// result
?id=123&&


The difference is that the second one excludes the trailing &?, so I end up with unnecessary & in places.

But what REALLY throws me off is that this one matches everything, too:

// string to match
str = '?h=456&id=123&s=789';

// regex to substitute
exp = /(\?|&)(h|z|s)=[^&]+&?/gi;

// result
?id=123&


The only difference between this third one and the first one is that I reverse the placement of "h" and "id" in "str". But in practice this is the query string, so I can't enforce that.

What am I misunderstanding here?

lucy24

8:12 pm on Aug 28, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



When you say “RegEx to substitute”, do you mean “RegEx to delete”, or has this been simplified for posting purposes? If the latter, I strongly suspect the problem is in the substitution rule. In particular, those extra ampersands shouldn't be cropping up.

so I end up with unnecessary & in places
I don't see why, since the pattern is, or should be, “question mark or ampersand plus the rest of the parameter-plus-value pair”. You definitely don't want to have & at both ends of the pattern--keeping in mind that Regular Expressions are greedy by nature, and will grab a & if they see one--or you could end with
?id=123s=789

Can I assume it's just personal coding style to say
(\?|&)
and (h|z|s) rather than [?&] and [hzs]? There don't seem to be any captures. But, again, something may have been left out for posting purposes.

I suggest going back and explaining what you're actually trying to do: strip part of the query? replace something with something else? (A question that is much more often asked in the Apache subforum: Yes, we definitely to see what you've tried, but we also need to see, in plain English, what the rule is intended to do.)

:: dammit, Forums, I SAID disable graphic smileys (and why the heck does it think an ampersand is a semicolon?) ::

And let us stipulate that this is not happening in BBEdit, which has its own particular rules about ampersands in substitution patterns.

phranque

11:40 pm on Aug 28, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



This question is not language specific...

...the substitution rule

i would also like to see how str and exp are used in code to get the resulting string.

csdude55

4:45 am on Aug 29, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've run across this stumbling block in Perl, PHP, and JavaScript, and what happens is I try it using the /g flag, it doesn't work as expected, I change it to a loop, it works the way I originally expected, then I move on. So now I'm trying to figure out why that is.

In this case, I'm using JavaScript. I'm creating a variable based on location.path + location.search that I use for the name for localStorage, but I want to remove a handful of variables that change regularly but I would still want the page to match.

So the real script that's working is:

var qs = location.search,
qsMatch = /(\?|&)(?:h|z|s|start(?:view)|return_here)=[^&]+(?:&(?:amp;)?)?/gi;

while (qsMatch.test(qs))
qs = qs.replace(qsMatch, '$1');

var saveName = location.pathname + qs
.replace(/[?&]+$/, '');


but in retrospect, since I'm using the while() loop here, I guess I really don't need the /g flag.

I took that to JSFiddle so that I could play around and figure it out, which is where I was using simplified code that I posted originally.

I think that I'm going to abandon this particular script in favor of URLSearchParams(), though, so at this point I'm just trying to learn why it's doing what it's doing for the sake of education.

lucy24

5:13 am on Aug 29, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is the idea to remove part of the query string, but keep selected bits? It now occurs to me that you can't delete the initial ? --assuming one of the bits you're deleting happens to be the first parameter--because then you're left with a non-functional query string. So you'd need to not pick up the initial & or ?, but instead pick up the trailing & if present:
string = string.replace(/\b[hzs]=[^&]+&?/gi,"")

Now the trailing &? becomes appropriate, because the final parameter in any query string won't have a & after it, while all the others do. The \b at the beginning of the pattern is to ensure you're not mangling some harmless parameter like "push" or "biz", but only matching exact letters. The same still holds if you have a more complicated set of options:
\b(h|z|s|start(?:view)?|return_here)=
and so on, assuming you meant
(?:view)?
where "view" is optional. If so you could even say
s(?:tart(?:view)?)?
though I don't suppose the saved picosecond is really worth it.

csdude55

5:49 pm on Aug 30, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is the idea to remove part of the query string, but keep selected bits?

That's correct.

In this case, I'm using localStorage to store whatever they've typed so that if they leave the page and come back then it won't be lost. But these listed variables could change during that time even though the user didn't realize it.

For example, they might be in the message board, and I use the "h" param to match the timestamp of the last post. This way, if they click to read a thread a second time, it won't accidentally be loaded from cache. So they might be typing up a reply while someone else is replying, then click on Back for some reason. At that point the "h" param will have changed, so if they click to go back to the thread and the "h" param is still there then they'll lose whatever they typed.

So I want to store the URL without that specific variable.

Using \b worked perfectly, thanks! I'm not entirely sure that I understand WHY it worked when (\?|&) didn't, but knowing is half the battle :-)

lucy24

8:39 pm on Aug 30, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm not entirely sure that I understand WHY it worked
Anchors do have their uses :) It's like the difference between
blahblah\n
and
blahblah$
The first version means “this stuff PLUS the following line break”, while the second means “this stuff IF it comes at the end of a line”.

\b can be used at either end of a pattern. It means “word boundary”: the immediately adjacent character is a non-word character, i.e. anything other than alphanumeric or _ lowline. It can't always be used, since things like - hyphen or ’ apostrophe are also non-word characters although they might easily occur as part of, er, words. And, worse, some RegEx engines (looking at you, recent versions of SubEthaEdit) don't recognize non-ASCII letters as word characters. But in the present situation, where you control your parameter names, it’s a very useful tool.