Welcome to WebmasterWorld Guest from 3.81.29.226

Forum Moderators: phranque

Regex not matching as expected

     
1:15 am on Oct 11, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1205
votes: 120


I'm doing this in Perl, but I don't think the language matters too much. Here's the code:

$_ = 'https://www.example.com/whatever.html?utm_source=blah1';

$pattern = '(\?[^&]*)(&(amp;)?)?(utm_\w+|source)=[^&]*(&(amp;)?)?';

while (m#$pattern#gi) {
s#$pattern#$1$2#gi;
}

print;


I'm expecting it to print https://www.example.com/whatever.html?, but instead I'm getting https://www.example.com/whatever.html?utm_ (note the utm_ in the query string).

I originally thought that the problem was in the order of operations, so I tried reversing the pattern to (source|utm_\w+). But that had no impact, I still had the same problem. Which makes sense THEN, because it would have removed "source" before "utm_source", and then "utm_" wouldn't match anything.

I tried changing $_ to subshare=blah1 and the pattern to (sub.+|share), but that left me with ?sub at the end of $_. So it's not the "\w", the "_", or the specific words "utm" or "source".

If I remove |source from the pattern then it deletes utm_source=blah1 as expected.

If I change $_ from utm_source=blah1 to utm_campaign=blah1 or source=blah1 then it works as expected. So it's not the (\?[^&]*)(&(amp;)?)? that's throwing it off.

Can you guys and gals suggest what's messing me up here? I've tried every variation I can think of!

[edited by: phranque at 1:58 pm (utc) on Oct 11, 2019]
[edit reason] disable graphic smile faces [/edit]

1:39 am on Oct 11, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


I think you're running afoul of Regular Expressions' essential greediness. Given the chance, the pattern
[^&]*)(&(amp;)?)?(utm_\w+|source)
(using [ code ] markup to suppress unwanted smileyfaces) will match as far as it can, meaning everything up to and including “utm_” because it can still grab a “source” after that.
3:10 am on Oct 11, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1205
votes: 120


You're so smart, lucy <3

I guess I really need the regex to match ONLY if it's preceded by ?, &, or &amp;. This seems to work:

$pattern = '(\?|&(amp;)?)?(utm_\w+|source)=[^&]*(&(amp;)?)?';

while (m#$pattern#i) {
s#$pattern#$1#i;
}


I also found that I can remove source from the pattern and do it separately. But that feels unnecessarily slow and complicated... something that, in a few years, I'll look back and this and be totally confused.

$pattern = '(\?[^&]*)(&(amp;)?)?utm_\w+=[^&]*(&(amp;)?)?';

while (m#$pattern#i) {
s#$pattern#$1$2#i;
}

s#(\?[^&]*)(&(amp;)?)?source=[^&]*(&(amp;)?)?#$1$2#i;


Do you see any flaw with using the first one? Someone mentioned before that a string like http://www.example.com/some-path?utm_source=123&parameter2=xyz would result in http://www.example.com/some-path&parameter2=xyz (removing the ? but leaving the &), but in my 10 minutes of testing I don't have that problem.

[edited by: phranque at 1:59 pm (utc) on Oct 11, 2019]
[edit reason] disable graphic smile faces [/edit]

5:29 am on Oct 11, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10563
votes: 1123


As a wee dolt in things regex I want to thank all who play the game for both queries/problems and solutions/answers...

I learn more here than out of any book or online tutorial.

Sincerely, thanks!
12:48 pm on Oct 11, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1205
votes: 120


Same here... I have several books on it, but learned more from @lucy24 and a few others on here :-)
5:47 pm on Oct 11, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


If we were allowed to use smileys, I would here insert this one [cosgan.de].

csdude, it's often easier to answer questions if they're approached from the other direction: What, precisely, are you trying to do? What result do you want to obtain? I say this more often in the Apache subforum--but really it's the identical question, because there it tends to be a matter of formulating a RewriteRule to achieve suchandsuch result, and this in turn comes down to formulating the appropriate Regular Expression.
6:55 pm on Oct 11, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1205
votes: 120


It's the same thing we talked about a couple of weeks ago... users submit a comment or whatever that includes a link, and I convert the link to <a href='whatever'>whatever</a>. But I want to remove tracking parameters because it causes confusion with advertisers; eg, someone sees something on Facebook, shares it to my site, 100 people click on the link on my site, and the tracking parameter makes them think that they had 101 clicks from Facebook.
1:27 am on Oct 12, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


Got it.

#1 you need to remove the parameter, rather than just let your perl/cgi/php/whatever ignore it (as they'd do by default, lacking instructions about what to do with it), because if GA sees the parameter at all, it will take actions you don't want it to take.

#2 the utm_source=blahblah parameter can potentially
a) be the only parameter
b) be followed by other parameters
c) come in the middle of a list of parameters
d) come last in the list
which creates three possible configurations (two, if your RegEx engine doesn't mind empty captures):
1a
\?utm_source=[^&]*$
1b
\?utm_source=[^&]*&(more-stuff-to-capture)
or alternatively
1a + 1b = 1
\?utm_source=[^&]*($|&more-stuff-to-capture)
2
\?(preliminary-stuff-to-capture)&utm_source=[^&]*(optionally-more-stuff-again)

It's #2 that's troublesome because you really want to avoid saying
\?(.+?)&utm_source=[^&]*(optionally-more-stuff-again)
with the dreaded non-final .* or .+ pattern. How much do you know about what other parameters can occur? Even if you know only that no other parameter name can ever start in “u” that would be a big help, but GA is just laden with utm-blablah parameters isn't it?
4:01 am on Oct 12, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1205
votes: 120


For real! It becomes a bit of an issue, but it's one that's been making me lose money so I have to work on it :-( Remember the good ol' days when they just looked at the referrer? LOL

In this case, though, I'm printing the link to a database, so I can just manipulate it as a string before storing it. I'm not so concerned with what I see through my own URL, it's the click-through from my site to other sites that I have to worry about.

This is my final code so far:

@arr = (
'https://www.example.com/whatever.html?utm_content=blah4',

'https://www.example.com/whatever.html?something=abc&amp;utm_source=123&amp;utm_medium=blah2&amp;utm_campaign=blah3&amp;utm_content=blah4&foo=bar',

'https://www.example.com/whatever.html?utm_source=123&amp;utm_medium=blah2&amp;utm_campaign=blah3&amp;utm_content=blah4&foo=bar',

'https://www.example.com/whatever.html?utm_source=123&amp;something=abc&amp;utm_medium=blah2&amp;utm_campaign=blah3&amp;utm_content=blah4&foo=bar',

'https://www.example.com/whatever.html?something=abc&amp;utm_source=123&amp;utm_medium=blah2&amp;utm_campaign=blah3&amp;utm_content=blah4',

'https://www.example.com/whatever.html?utm_source=123&amp;utm_medium=blah2&amp;utm_campaign=blah3&amp;utm_content=blah4',

'https://www.example.com/whatever.html?utm_source=123&amp;ocid=boo&utm_medium=blah2&amp;utm_campaign=blah3&amp;utm_content=blah4'
);

$pattern = '(\?|&(amp;)?)(utm_\w+|(c|oc|trk|gcl|fbcl|mkw|pgr|pta)id|refer+er|share|mkt_tok|(sub|_)*source|usqp|ref_(src|url)|mtrref|gw[ht]|refcode[0-9]*)=[^&]*(&(amp;)?)?';

foreach (@arr) {
print $_ . "\n";

while (m#$pattern#i) {
s#$pattern#$1#i;
}

# just in case, remove repeating &
s#(&(amp;)?){2,}#&#;

# remove trailing ? or &
s#(\?|&(amp;)?)+$##;

print ' -> ' . $_ . "\n\n";
}

## Results
https://www.example.com/whatever.html?utm_content=blah4
-> https://www.example.com/whatever.html

https://www.example.com/whatever.html?something=abc&utm_source=123&utm_medium=blah2&utm_campaign=blah3&utm_content=blah4&foo=bar
-> https://www.example.com/whatever.html?something=abc&foo=bar

https://www.example.com/whatever.html?utm_source=123&utm_medium=blah2&utm_campaign=blah3&utm_content=blah4&foo=bar
-> https://www.example.com/whatever.html?foo=bar

https://www.example.com/whatever.html?utm_source=123&something=abc&utm_medium=blah2&utm_campaign=blah3&utm_content=blah4&foo=bar
-> https://www.example.com/whatever.html?something=abc&foo=bar

https://www.example.com/whatever.html?something=abc&utm_source=123&utm_medium=blah2&utm_campaign=blah3&utm_content=blah4
-> https://www.example.com/whatever.html?something=abc

https://www.example.com/whatever.html?utm_source=123&utm_medium=blah2&utm_campaign=blah3&utm_content=blah4
-> https://www.example.com/whatever.html

https://www.example.com/whatever.html?utm_source=123&ocid=boo&utm_medium=blah2&utm_campaign=blah3&utm_content=blah4
-> https://www.example.com/whatever.html


I tested with every variation that you mentioned and that I could think of, and so far they've all come out looking right :-)
4:39 pm on Oct 12, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


Whew!