Forum Moderators: phranque

Message Too Old, No Replies

Regex not matching as expected

         

csdude55

1:15 am on Oct 11, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm doing this in Perl, but I don't think the language matters too much. Here's the code:

$_ = 'https://www.example.com/whatever.html?utm_source=blah1';

$pattern = '(\?[^&]*)(&(amp;)?)?(utm_\w+|source)=[^&]*(&(amp;)?)?';

while (m#$pattern#gi) {
s#$pattern#$1$2#gi;
}

print;


I'm expecting it to print https://www.example.com/whatever.html?, but instead I'm getting https://www.example.com/whatever.html?utm_ (note the utm_ in the query string).

I originally thought that the problem was in the order of operations, so I tried reversing the pattern to (source|utm_\w+). But that had no impact, I still had the same problem. Which makes sense THEN, because it would have removed "source" before "utm_source", and then "utm_" wouldn't match anything.

I tried changing $_ to subshare=blah1 and the pattern to (sub.+|share), but that left me with ?sub at the end of $_. So it's not the "\w", the "_", or the specific words "utm" or "source".

If I remove |source from the pattern then it deletes utm_source=blah1 as expected.

If I change $_ from utm_source=blah1 to utm_campaign=blah1 or source=blah1 then it works as expected. So it's not the (\?[^&]*)(&(amp;)?)? that's throwing it off.

Can you guys and gals suggest what's messing me up here? I've tried every variation I can think of!

[edited by: phranque at 1:58 pm (utc) on Oct 11, 2019]
[edit reason] disable graphic smile faces [/edit]

lucy24

1:39 am on Oct 11, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think you're running afoul of Regular Expressions' essential greediness. Given the chance, the pattern
[^&]*)(&(amp;)?)?(utm_\w+|source)
(using [ code ] markup to suppress unwanted smileyfaces) will match as far as it can, meaning everything up to and including “utm_” because it can still grab a “source” after that.

csdude55

3:10 am on Oct 11, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You're so smart, lucy <3

I guess I really need the regex to match ONLY if it's preceded by ?, &, or &amp;. This seems to work:

$pattern = '(\?|&(amp;)?)?(utm_\w+|source)=[^&]*(&(amp;)?)?';

while (m#$pattern#i) {
s#$pattern#$1#i;
}


I also found that I can remove source from the pattern and do it separately. But that feels unnecessarily slow and complicated... something that, in a few years, I'll look back and this and be totally confused.

$pattern = '(\?[^&]*)(&(amp;)?)?utm_\w+=[^&]*(&(amp;)?)?';

while (m#$pattern#i) {
s#$pattern#$1$2#i;
}

s#(\?[^&]*)(&(amp;)?)?source=[^&]*(&(amp;)?)?#$1$2#i;


Do you see any flaw with using the first one? Someone mentioned before that a string like http://www.example.com/some-path?utm_source=123&parameter2=xyz would result in http://www.example.com/some-path&parameter2=xyz (removing the ? but leaving the &), but in my 10 minutes of testing I don't have that problem.

[edited by: phranque at 1:59 pm (utc) on Oct 11, 2019]
[edit reason] disable graphic smile faces [/edit]

tangor

5:29 am on Oct 11, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As a wee dolt in things regex I want to thank all who play the game for both queries/problems and solutions/answers...

I learn more here than out of any book or online tutorial.

Sincerely, thanks!

csdude55

12:48 pm on Oct 11, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Same here... I have several books on it, but learned more from @lucy24 and a few others on here :-)

lucy24

5:47 pm on Oct 11, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If we were allowed to use smileys, I would here insert this one [cosgan.de].

csdude, it's often easier to answer questions if they're approached from the other direction: What, precisely, are you trying to do? What result do you want to obtain? I say this more often in the Apache subforum--but really it's the identical question, because there it tends to be a matter of formulating a RewriteRule to achieve suchandsuch result, and this in turn comes down to formulating the appropriate Regular Expression.

csdude55

6:55 pm on Oct 11, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's the same thing we talked about a couple of weeks ago... users submit a comment or whatever that includes a link, and I convert the link to <a href='whatever'>whatever</a>. But I want to remove tracking parameters because it causes confusion with advertisers; eg, someone sees something on Facebook, shares it to my site, 100 people click on the link on my site, and the tracking parameter makes them think that they had 101 clicks from Facebook.

lucy24

1:27 am on Oct 12, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Got it.

#1 you need to remove the parameter, rather than just let your perl/cgi/php/whatever ignore it (as they'd do by default, lacking instructions about what to do with it), because if GA sees the parameter at all, it will take actions you don't want it to take.

#2 the utm_source=blahblah parameter can potentially
a) be the only parameter
b) be followed by other parameters
c) come in the middle of a list of parameters
d) come last in the list
which creates three possible configurations (two, if your RegEx engine doesn't mind empty captures):
1a
\?utm_source=[^&]*$
1b
\?utm_source=[^&]*&(more-stuff-to-capture)
or alternatively
1a + 1b = 1
\?utm_source=[^&]*($|&more-stuff-to-capture)
2
\?(preliminary-stuff-to-capture)&utm_source=[^&]*(optionally-more-stuff-again)

It's #2 that's troublesome because you really want to avoid saying
\?(.+?)&utm_source=[^&]*(optionally-more-stuff-again)
with the dreaded non-final .* or .+ pattern. How much do you know about what other parameters can occur? Even if you know only that no other parameter name can ever start in “u” that would be a big help, but GA is just laden with utm-blablah parameters isn't it?

csdude55

4:01 am on Oct 12, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For real! It becomes a bit of an issue, but it's one that's been making me lose money so I have to work on it :-( Remember the good ol' days when they just looked at the referrer? LOL

In this case, though, I'm printing the link to a database, so I can just manipulate it as a string before storing it. I'm not so concerned with what I see through my own URL, it's the click-through from my site to other sites that I have to worry about.

This is my final code so far:

@arr = (
'https://www.example.com/whatever.html?utm_content=blah4',

'https://www.example.com/whatever.html?something=abc&amp;utm_source=123&amp;utm_medium=blah2&amp;utm_campaign=blah3&amp;utm_content=blah4&foo=bar',

'https://www.example.com/whatever.html?utm_source=123&amp;utm_medium=blah2&amp;utm_campaign=blah3&amp;utm_content=blah4&foo=bar',

'https://www.example.com/whatever.html?utm_source=123&amp;something=abc&amp;utm_medium=blah2&amp;utm_campaign=blah3&amp;utm_content=blah4&foo=bar',

'https://www.example.com/whatever.html?something=abc&amp;utm_source=123&amp;utm_medium=blah2&amp;utm_campaign=blah3&amp;utm_content=blah4',

'https://www.example.com/whatever.html?utm_source=123&amp;utm_medium=blah2&amp;utm_campaign=blah3&amp;utm_content=blah4',

'https://www.example.com/whatever.html?utm_source=123&amp;ocid=boo&utm_medium=blah2&amp;utm_campaign=blah3&amp;utm_content=blah4'
);

$pattern = '(\?|&(amp;)?)(utm_\w+|(c|oc|trk|gcl|fbcl|mkw|pgr|pta)id|refer+er|share|mkt_tok|(sub|_)*source|usqp|ref_(src|url)|mtrref|gw[ht]|refcode[0-9]*)=[^&]*(&(amp;)?)?';

foreach (@arr) {
print $_ . "\n";

while (m#$pattern#i) {
s#$pattern#$1#i;
}

# just in case, remove repeating &
s#(&(amp;)?){2,}#&#;

# remove trailing ? or &
s#(\?|&(amp;)?)+$##;

print ' -> ' . $_ . "\n\n";
}

## Results
https://www.example.com/whatever.html?utm_content=blah4
-> https://www.example.com/whatever.html

https://www.example.com/whatever.html?something=abc&utm_source=123&utm_medium=blah2&utm_campaign=blah3&utm_content=blah4&foo=bar
-> https://www.example.com/whatever.html?something=abc&foo=bar

https://www.example.com/whatever.html?utm_source=123&utm_medium=blah2&utm_campaign=blah3&utm_content=blah4&foo=bar
-> https://www.example.com/whatever.html?foo=bar

https://www.example.com/whatever.html?utm_source=123&something=abc&utm_medium=blah2&utm_campaign=blah3&utm_content=blah4&foo=bar
-> https://www.example.com/whatever.html?something=abc&foo=bar

https://www.example.com/whatever.html?something=abc&utm_source=123&utm_medium=blah2&utm_campaign=blah3&utm_content=blah4
-> https://www.example.com/whatever.html?something=abc

https://www.example.com/whatever.html?utm_source=123&utm_medium=blah2&utm_campaign=blah3&utm_content=blah4
-> https://www.example.com/whatever.html

https://www.example.com/whatever.html?utm_source=123&ocid=boo&utm_medium=blah2&utm_campaign=blah3&utm_content=blah4
-> https://www.example.com/whatever.html


I tested with every variation that you mentioned and that I could think of, and so far they've all come out looking right :-)

lucy24

4:39 pm on Oct 12, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Whew!