Welcome to WebmasterWorld Guest from 3.81.29.226

Forum Moderators: phranque

Dealing with web addresses that use delimiter other than ? and &

     
2:48 am on Oct 29, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Mar 15, 2013
posts: 1205
votes: 120


I've been removing tracking IDs from website addresses that users post to my message boards and classifieds, which has admittedly gotten WAY more complicated than I intended. But I've recently run across a new one, so I'm curious how you guys and gals would suggest dealing with it.

In this example, the link looked like:

https://example.com/foo/bar|pcrid|391022977133|pkw||pmt||pdv|m|slid||product||pgrid|78378217177|ptaid||&pgrid=78378217177&ptaid=&source=WFP2019-DD-NATL-GD-US-BCON&subsource=78378217177---391022977133&refcode=WFP2019-DD-NATL-GD-US-BCON&refcode2=78378217177---391022977133&utm_source=Google&utm_campaign=WFP2019-DD-NATL-GD-US-BCON&utm_term=-391022977133&utm_medium=Display&gclid=EAIaIQobChMIpMLsjry95QIVQqFRCh3slA_eEAEYASAAEgLLc_D_BwE


I use Perl's URI::Find to find links in the text and convert it to a <a href=...>...</a> tag, but it doesn't recognize the | delimiter so I end up with:

<a href="https://example.com/foo/bar">https://example.com/foo/bar</a>|pcrid|391022977133|pkw||pmt||pdv|m|slid||product||pgrid|78378217177|ptaid||&pgrid=78378217177&ptaid=&source=WFP2019-DD-NATL-GD-US-BCON&subsource=78378217177---391022977133&refcode=WFP2019-DD-NATL-GD-US-BCON&refcode2=78378217177---391022977133&utm_source=Google&utm_campaign=WFP2019-DD-NATL-GD-US-BCON&utm_term=-391022977133&utm_medium=Display&gclid=EAIaIQobChMIpMLsjry95QIVQqFRCh3slA_eEAEYASAAEgLLc_D_BwE


And since the rest of that isn't recognized as part of the link, my system doesn't remove any of those parameters, including the parts that are actually delimited by & (and it would usually remove all of them).

I'm kind of at a loss on how to handle this one. I could use a regex to find http, followed by anything that's not a space, until it gets to a |, and then remove everything after and including that |. That's a bit dangerous since someone could realistically use a | in the parameter value that I wouldn't want to remove, though.

I guess that the regex would look something like:

$text =~ s#\b(https?://[^\s])\|[^\s]*\b#$1#i;


What do you all think?
4:15 am on Oct 29, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


I say: Crikey.
[^\s]
==
\S
unless there's something I am overlooking.

Did you mean [^\s]+ (i.e. \S+) ? That was my own question mark, but in fact you'd need to say
\S+?\|
in order to stop as soon as possible, i.e. before the first | character if there's more than one of them.

Seems like you'd want something like
[\w/.,~-]+
for the path part, to constrain it to things that can reasonably occur. If you could be certain that the | is the only no-no that will ever show up, the pattern would be more like
[^\s?|]+([?|]blahblah)?
where ? and | don't need to be escaped inside grouping brackets (but it does no harm if you do escape them).

I suppose you have already considered the possibility of telling your code to disregard (don't make links from) URLs that don't follow the rules :( Your forums members really are an unruly bunch aren't they.

Disclaimer: I am about to disappear for at least 24 hours, possibly longer (thank you very much, PG&E), so if I said something hopelessly misleading you will have to remain misled.
6:03 am on Oct 29, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10563
votes: 1123


Yikes! 24 hours without lucy24!

(get those folks out there to solve the power problems!)
5:30 pm on Oct 29, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


No such luck. We've been reprieved for 12 hours. Turns out the utility company has a service region called “Humboldt”, so named because it includes parts of Lake and Mendocino counties--but not Humboldt--so you never know whether they're talking about us or somewhere 200 miles to the south. And neither, apparently, do they themselves know.

The problem is not power supply. The problem is intentional shutoffs during times of extreme fire risk, necessitated by years of neglect so that the only way they can avoid a repeat of, say, last year's Paradise disaster is to shut off electricity to large parts of the state. Including--because of the way the grid is laid out--areas whose direct risk is approximately zero. The casino up the road is probably making a killing, as the rancheria* has its own micro-grid.

We now return to our Regularly scheduled Expressions.


* Like a reservation, only it belongs to more than one tribe. I don't know if they exist outside of California.