Welcome to WebmasterWorld Guest from 3.227.249.234

Forum Moderators: phranque

Tracking IDs

maybe a list of ids to remove

     
2:16 pm on Oct 2, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1193
votes: 119


When someone submits an URL to be saved in my database, I typically remove tracking IDs because:

1. They take up space unnecessarily; and

2. If someone is clicking a link on MY site then having a referrer from another site is misleading, and may hurt me financially.

Any interest in keeping a list of tracking ID parameters that can safely be removed?

Here are the ones I'm filtering (in regex format):

utm_\w+?
cid
ocid
trkid
gclid
fbclid
refer+er
share
mkt_tok
mkwid
pgrid
ptaid
_*source
amp_\w+?
usqp
ref_src
ref_url
mtrref
gwh
gwt
subsource
refcode[0-9]*


In Perl the actual regex looks like:

s#(\?|&(amp;)?)(utm_\w+?|...|refcode[0-9]*)=[^&]*#$1#gi
10:37 pm on Oct 2, 2019 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11842
votes: 242


I typically remove tracking IDs

how are you checking to insure it is actually a tracking parameter?
i once worked on a cms where the cid parameter was the "content id".

s#(\?|&(amp;)?)(utm_\w+?|...|refcode[0-9]*)=[^&]*#$1#gi

have you tested this on urls with several parameters in the query string?
if i'm reading this correctly, this:
http://www.example.com/some-path?cid=123&parameter2=xyz
will result in this:
http://www.example.com/some-path&parameter2=xyz
10:44 pm on Oct 2, 2019 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11842
votes: 242


this may be of interest:
(Google Analytics) Measurement Protocol Parameter Reference [developers.google.com]
11:48 pm on Oct 2, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15871
votes: 869


refer+er
ref_src
ref_url
refcode[0-9]*
=
ref(er+er|_src|_url|code[0-9]*)
Sorry. I just couldn’t help myself.

What’s the significance of the two \w+? constructions? The ? (“capture the smallest number possible that still enables a match, leaving room for other stuff to follow”) would seem to be superfluous, since = is in any case a non-word character.

Yes, I do realize this is not actually what you asked. But I know the feeling: if someone comes in with, say, a Facebook-generated URL that appends a query string, I will ruthlessly delete the query (which, honestly, FB itself should have done in the first place, since it’s in no way needed for analytics).
1:03 am on Oct 3, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1193
votes: 119


how are you checking to insure it is actually a tracking parameter?
i once worked on a cms where the cid parameter was the "content id".

Good point, and I guess I really don't have a way. Going through about 11 years of data, though, I don't have any links where cid WASN'T a tracking ID, but that doesn't mean that it's always going to be that way. Especially since Wordpress has gotten so much popular!

What do you suggest as an alternative? I don't really want to do a CURL after each removal to make sure it still gives a 200 status code, and even if it did that wouldn't necessarily mean that it's retrieving the original page. Or do you think that cid is the only dangerous one out of the list?

have you tested this on urls with several parameters in the query string?
if i'm reading this correctly, this:
http://www.example.com/some-path?cid=123&parameter2=xyz
will result in this:
http://www.example.com/some-path&parameter2=xyz

Eek! Nope, I hadn't tested that. Do you have an alternate code in mind, other than fixing it after the fact? Like:

s#(\?|&(amp;)?)(utm_\w+?|...|refcode[0-9]*)=[^&]*#?$1#gi;
s#(?)|(?&)#?#;


ref(er+er|_src|_url|code[0-9]*)
Sorry. I just couldn’t help myself.

Haha, I know! I almost went through it and simplified it for the post, but I decided to keep it more readable :-)

What’s the significance of the two \w+? constructions? The ? (“capture the smallest number possible that still enables a match, leaving room for other stuff to follow”) would seem to be superfluous, since = is in any case a non-word character.

Just bad coding on my part :-P You're right, they didn't cause a problem so I didn't even notice, but they're totally unnecessary.
6:13 am on Oct 3, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10457
votes: 1091


Any interest in keeping a list of tracking ID parameters that can safely be removed?


Are you dealing with tracking ids, or just providing the actual true URL and all the rest is just a bad dream nightmare you should wake up from and say "whew!"

Else just disallow any URLS and don't look back.

Sometimes less is actually more.
2:30 pm on Oct 3, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1193
votes: 119


Thoughts on modified code?

s#(\?[^&]*)(&(amp;)?)?(utm_\w+|...|refcode[0-9]*)=[^&]*#?$1#gi;



Are you dealing with tracking ids, or just providing the actual true URL and all the rest is just a bad dream nightmare you should wake up from and say "whew!"

@tangor, it started when I spoke to a local advertiser that noted more of his clicks came from Facebook than me. But looking at his data, he was confused because of the fbclid tag. Someone had copied a link from Facebook to my site, and even though he had 100 clicks on MY site it looked like he had 100 from Facebook instead of 1.

So I started trying to remove all unnecessary parameters to prevent that confusion.

I need the links there, I just need them to know that their clicks came from me.
1:27 am on Oct 4, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10457
votes: 1091


We're talking the same thing, just using different language ... and my initial reply was more humor than solution. I strip everything after the shared/submitted active url (though I track the added elsewhere for other information purposes).
3:19 pm on Oct 4, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1193
votes: 119


Well, just an update here... amp_\w+ shouldn't be on the list. I'm only seeing it in links like this:

[www-foo-bar.cdn.ampproject.org...]

So I don't think that these are tracking IDs, but have something to do with AMP itself.

But in every reference in my database, it looks like the link doesn't work at all, I just get a Google error! I don't know what's up with that, but for now I'm taking out the AMP link, like so:

s#https?://[\w-]+\.cdn\.ampproject\.org/v/s/(.*?)/amp/?.*#$1#i;


I get the feeling I'm on the edge of a much, much deeper rabbit hole than I intended...
3:38 am on Oct 5, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10457
votes: 1091


Keep track of the breadcrumbs ... that's how you can get back! :)

The adventures never end!
3:48 am on Oct 5, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1193
votes: 119


Right! LOL

Well, for future readers, here's the code I've finally ended up with. I use URI::Find to find URLs in the string, which creates the array @_. This array has 2 indexes in it that seem to always both be the link, so in my code I modify one to show and one for the coded link. I can show this if anyone is interested.

But here's how the filter we're talking about here looks:

$pattern = '(\?[^&]*)(&(amp;)?)?(utm_\w+|...|refcode[0-9]*)=[^&]*&?';

foreach (@_) {

# 10/4/19, AMP links don't work, not sure why, so let's fix them here
s#https?://[\w-]+\.cdn\.ampproject\.org/v/s/(http.*?)/amp/?.*#$1#i;

# not sure that I should use /g here, I might be making it a tad slower for no reason
while (m#$pattern#gi) {
$_ =~ s#$pattern#$1$2#gi;
}

# shouldn't have any repeating &, but just in case
s#(&(amp;)?&(amp;)?)+#&#;

# might as well remove any trailing ? or &
s#(\?|&(amp;)?)+$##;
}
3:23 pm on Oct 5, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15871
votes: 869


# not sure that I should use /g here, I might be making it a tad slower for no reason
while (m#$pattern#gi)
“while” and “g” does seem redundant. If it’s a perfectly coded /g/ there should be no need for the repeating iterations of “while”, and contrariwise if “while” is found to be the only way to do it, then the /g/ would seem to be superfluous (though I don’t know if in this case it would actually affect processing speed).