Tracking IDs

Forum Moderators: phranque

Message Too Old, No Replies

Tracking IDs

maybe a list of ids to remove

csdude55

2:16 pm on Oct 2, 2019 (gmt 0)

When someone submits an URL to be saved in my database, I typically remove tracking IDs because:

1. They take up space unnecessarily; and

2. If someone is clicking a link on MY site then having a referrer from another site is misleading, and may hurt me financially.

Any interest in keeping a list of tracking ID parameters that can safely be removed?

Here are the ones I'm filtering (in regex format):

utm_\w+?
cid
ocid
trkid
gclid
fbclid
refer+er
share
mkt_tok
mkwid
pgrid
ptaid
_*source
amp_\w+?
usqp
ref_src
ref_url
mtrref
gwh
gwt
subsource
refcode[0-9]*

In Perl the actual regex looks like:

s#(\?|&(amp;)?)(utm_\w+?|...|refcode[0-9]*)=[^&]*#$1#gi

phranque

10:37 pm on Oct 2, 2019 (gmt 0)

I typically remove tracking IDs

how are you checking to insure it is actually a tracking parameter?
i once worked on a cms where the cid parameter was the "content id".

s#(\?|&(amp;)?)(utm_\w+?|...|refcode[0-9]*)=[^&]*#$1#gi

have you tested this on urls with several parameters in the query string?
if i'm reading this correctly, this:
http://www.example.com/some-path?cid=123&parameter2=xyz
will result in this:
http://www.example.com/some-path&parameter2=xyz

phranque

10:44 pm on Oct 2, 2019 (gmt 0)

this may be of interest:
(Google Analytics) Measurement Protocol Parameter Reference [developers.google.com]

lucy24

11:48 pm on Oct 2, 2019 (gmt 0)

refer+er
ref_src
ref_url
refcode[0-9]*

=
ref(er+er|_src|_url|code[0-9]*)
Sorry. I just couldn�t help myself.

What�s the significance of the two \w+? constructions? The ? (�capture the smallest number possible that still enables a match, leaving room for other stuff to follow�) would seem to be superfluous, since = is in any case a non-word character.

Yes, I do realize this is not actually what you asked. But I know the feeling: if someone comes in with, say, a Facebook-generated URL that appends a query string, I will ruthlessly delete the query (which, honestly, FB itself should have done in the first place, since it�s in no way needed for analytics).

csdude55

1:03 am on Oct 3, 2019 (gmt 0)

how are you checking to insure it is actually a tracking parameter?
i once worked on a cms where the cid parameter was the "content id".

Good point, and I guess I really don't have a way. Going through about 11 years of data, though, I don't have any links where cid WASN'T a tracking ID, but that doesn't mean that it's always going to be that way. Especially since Wordpress has gotten so much popular!

What do you suggest as an alternative? I don't really want to do a CURL after each removal to make sure it still gives a 200 status code, and even if it did that wouldn't necessarily mean that it's retrieving the original page. Or do you think that cid is the only dangerous one out of the list?

have you tested this on urls with several parameters in the query string?
if i'm reading this correctly, this:
http://www.example.com/some-path?cid=123&parameter2=xyz
will result in this:
http://www.example.com/some-path&parameter2=xyz

Eek! Nope, I hadn't tested that. Do you have an alternate code in mind, other than fixing it after the fact? Like:

s#(\?|&(amp;)?)(utm_\w+?|...|refcode[0-9]*)=[^&]*#?$1#gi;
s#(?)|(?&)#?#;

ref(er+er|_src|_url|code[0-9]*)
Sorry. I just couldn�t help myself.

Haha, I know! I almost went through it and simplified it for the post, but I decided to keep it more readable :-)

What�s the significance of the two \w+? constructions? The ? (�capture the smallest number possible that still enables a match, leaving room for other stuff to follow�) would seem to be superfluous, since = is in any case a non-word character.

Just bad coding on my part :-P You're right, they didn't cause a problem so I didn't even notice, but they're totally unnecessary.

tangor

6:13 am on Oct 3, 2019 (gmt 0)

Any interest in keeping a list of tracking ID parameters that can safely be removed?

Are you dealing with tracking ids, or just providing the actual true URL and all the rest is just a bad dream nightmare you should wake up from and say "whew!"

Else just disallow any URLS and don't look back.

Sometimes less is actually more.

csdude55

2:30 pm on Oct 3, 2019 (gmt 0)

Thoughts on modified code?

s#(\?[^&]*)(&(amp;)?)?(utm_\w+|...|refcode[0-9]*)=[^&]*#?$1#gi;

Are you dealing with tracking ids, or just providing the actual true URL and all the rest is just a bad dream nightmare you should wake up from and say "whew!"

@tangor, it started when I spoke to a local advertiser that noted more of his clicks came from Facebook than me. But looking at his data, he was confused because of the fbclid tag. Someone had copied a link from Facebook to my site, and even though he had 100 clicks on MY site it looked like he had 100 from Facebook instead of 1.

So I started trying to remove all unnecessary parameters to prevent that confusion.

I need the links there, I just need them to know that their clicks came from me.

tangor

1:27 am on Oct 4, 2019 (gmt 0)

We're talking the same thing, just using different language ... and my initial reply was more humor than solution. I strip everything after the shared/submitted active url (though I track the added elsewhere for other information purposes).

csdude55

3:19 pm on Oct 4, 2019 (gmt 0)

Well, just an update here... amp_\w+ shouldn't be on the list. I'm only seeing it in links like this:

[www-foo-bar.cdn.ampproject.org...]

So I don't think that these are tracking IDs, but have something to do with AMP itself.

But in every reference in my database, it looks like the link doesn't work at all, I just get a Google error! I don't know what's up with that, but for now I'm taking out the AMP link, like so:

s#https?://[\w-]+\.cdn\.ampproject\.org/v/s/(.*?)/amp/?.*#$1#i;

I get the feeling I'm on the edge of a much, much deeper rabbit hole than I intended...

tangor

3:38 am on Oct 5, 2019 (gmt 0)

Keep track of the breadcrumbs ... that's how you can get back! :)

^{The adventures never end!}

csdude55

3:48 am on Oct 5, 2019 (gmt 0)

Right! LOL

Well, for future readers, here's the code I've finally ended up with. I use URI::Find to find URLs in the string, which creates the array @_. This array has 2 indexes in it that seem to always both be the link, so in my code I modify one to show and one for the coded link. I can show this if anyone is interested.

But here's how the filter we're talking about here looks:

$pattern = '(\?[^&]*)(&(amp;)?)?(utm_\w+|...|refcode[0-9]*)=[^&]*&?';

foreach (@_) {

 # 10/4/19, AMP links don't work, not sure why, so let's fix them here
 s#https?://[\w-]+\.cdn\.ampproject\.org/v/s/(http.*?)/amp/?.*#$1#i;

 # not sure that I should use /g here, I might be making it a tad slower for no reason
 while (m#$pattern#gi) {
  $_ =~ s#$pattern#$1$2#gi;
 }

 # shouldn't have any repeating &, but just in case
 s#(&(amp;)?&(amp;)?)+#&#;

 # might as well remove any trailing ? or &
 s#(\?|&(amp;)?)+$##;
}

lucy24

3:23 pm on Oct 5, 2019 (gmt 0)

# not sure that I should use /g here, I might be making it a tad slower for no reason
while (m#$pattern#gi)

�while� and �g� does seem redundant. If it�s a perfectly coded /g/ there should be no need for the repeating iterations of �while�, and contrariwise if �while� is found to be the only way to do it, then the /g/ would seem to be superfluous (though I don�t know if in this case it would actually affect processing speed).

Tracking IDs

maybe a list of ids to remove

csdude55

phranque

phranque

lucy24

csdude55

tangor

csdude55

tangor

csdude55

tangor

csdude55

lucy24

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week