I'm going down the rabbit hole a little.
Many years ago I built a simple function to convert
http://www.example.com to
<a href='http://www.example.com' target='_new'>http://www.example.com</a>. I used the URI:::Find module to isolate links, then regex to convert it.
Over time, things got more complicated. Browsers started sending weird things, copied data from other sites would have unexpected code, etc. So my simple function has gotten pretty complex. So now I'm trying to rebuild it and make it a little more friendly towards future "additions".
Here's where I've gotten:
# emulate user-submitted data
$_ = qq~
<a href='https://www.example.com?utm_foo=jhkdf989874jhkkdjhk12&utm_bar=yuiytuwer786' title="something">Example</a>
<a href="https://www.lorem.com" target="_blank">https://www.ipsum.com</a>
http://www.foo.com
https://bar.com
www.new.com
~;
# Convert www to http://www, assuming that the destination can apply https if applicable
s#\b(?<!://)www\.([a-z])#http://www\.$1#gi;
# Remove optional attributes from an existing A HREF, then add the TARGET back in
s#<a[^>]* href=(["'])([^\1]*?)\1[^>]*>#<a href='$2' target='_new'>$2</a>#gi;
# Add A HREF if it's not already there
while (m#\b(https?://[\w.~!*();:@&=+$,/?\\\#%[\\\]'"-]+)\b#gi) {
# $org is the URL I'm working with in this cycle
$org = $1;
# $modified is equal to $org, minus any utm_, ocid, trkid, gclid, fbid, data-, role, cite, or itxt
# params. This also removes a trailing ? or &
$modified = $org
=~ s#\b(?:utm_\w+?|ocid|trkid|gclid|fbid|data-[\w-]+|role|cite|itxt[\w-]*)=[^&]+(?:&(?:amp;)?)?##gir
=~ s#[?&]$##r;
# if the displayed URL is too long, shorten it and put a ... in the middle; eg,
# <a href='https://www.blahblahblahblahblahblah.com' target='_new'>https://www.blahb...ahblah.com</a>
$repl = length($modified) > 30 ?
substr($modified, 0, 17) . '...' . substr($modified, -10) :
$modified;
# do the final replacement
s#\Q$org\E#<a href='$modified' target='_new'>$repl</a>#gi;
}
print;
I'm hitting 2 snags:
1. This becomes an infinite loop because that last replacement still plugs in something that matches the while() condition.
2. In retrospect, when the text is something like
<a href='https://www.example.com?utm_foo=jhkdf989874jhkkdjhk12&utm_bar=yuiytuwer786' title="something">Example</a>, I'd rather just remove the optional attributes and keep the content; eg, I want the final result to be
<a href='https://www.example.com' target='_new'>Example</a>.
So I guess that I have 2 questions:
1. Can you suggest a way to make the while() loop move on to the next result after the replacement, instead of starting back at the beginning?
2. Can you suggest a modification to the while() condition regex to make it know whether the URL is inside of a tag?
The solution I'm working on would put all of the links in an array and then put them back in the right place in a second loop. That's getting to be a lot more complex than intended, though, so I thought I'd ask for a second set of eyes before going much further.