Converting URL to link in user submitted string

I'm going down the rabbit hole a little.

Many years ago I built a simple function to convert http://www.example.com to <a href='http://www.example.com' target='_new'>http://www.example.com</a>. I used the URI:::Find module to isolate links, then regex to convert it.

Over time, things got more complicated. Browsers started sending weird things, copied data from other sites would have unexpected code, etc. So my simple function has gotten pretty complex. So now I'm trying to rebuild it and make it a little more friendly towards future "additions".

Here's where I've gotten:

# emulate user-submitted data
$_ = qq~
<a href='https://www.example.com?utm_foo=jhkdf989874jhkkdjhk12&utm_bar=yuiytuwer786' title="something">Example</a>
<a href="https://www.lorem.com" target="_blank">https://www.ipsum.com</a>
http://www.foo.com
https://bar.com
www.new.com
~;

# Convert www to http://www, assuming that the destination can apply https if applicable
s#\b(?<!://)www\.([a-z])#http://www\.$1#gi;

# Remove optional attributes from an existing A HREF, then add the TARGET back in
s#<a[^>]* href=(["'])([^\1]*?)\1[^>]*>#<a href='$2' target='_new'>$2</a>#gi;

# Add A HREF if it's not already there
while (m#\b(https?://[\w.~!*();:@&=+$,/?\\\#%[\\\]'"-]+)\b#gi) {
 # $org is the URL I'm working with in this cycle
 $org = $1;

 # $modified is equal to $org, minus any utm_, ocid, trkid, gclid, fbid, data-, role, cite, or itxt
 # params. This also removes a trailing ? or &
 $modified = $org
  =~ s#\b(?:utm_\w+?|ocid|trkid|gclid|fbid|data-[\w-]+|role|cite|itxt[\w-]*)=[^&]+(?:&(?:amp;)?)?##gir
  =~ s#[?&]$##r;

 # if the displayed URL is too long, shorten it and put a ... in the middle; eg,
 # <a href='https://www.blahblahblahblahblahblah.com' target='_new'>https://www.blahb...ahblah.com</a>
 $repl = length($modified) > 30 ?
  substr($modified, 0, 17) . '...' . substr($modified, -10) :
  $modified;

 # do the final replacement
 s#\Q$org\E#<a href='$modified' target='_new'>$repl</a>#gi;
}

print;

I'm hitting 2 snags:

1. This becomes an infinite loop because that last replacement still plugs in something that matches the while() condition.

2. In retrospect, when the text is something like <a href='https://www.example.com?utm_foo=jhkdf989874jhkkdjhk12&utm_bar=yuiytuwer786' title="something">Example</a>, I'd rather just remove the optional attributes and keep the content; eg, I want the final result to be <a href='https://www.example.com' target='_new'>Example</a>.

So I guess that I have 2 questions:

1. Can you suggest a way to make the while() loop move on to the next result after the replacement, instead of starting back at the beginning?

2. Can you suggest a modification to the while() condition regex to make it know whether the URL is inside of a tag?

The solution I'm working on would put all of the links in an array and then put them back in the right place in a second loop. That's getting to be a lot more complex than intended, though, so I thought I'd ask for a second set of eyes before going much further.

# Convert www to http://www, assuming that the destination can apply https if applicable s#\b(?<!://)www\.([a-z])#http://www\.$1#gi; my @arr; $x = 0; # pattern that should match any URL string; I thought about using https?://\S+, but I dunno $urlPattern = qr#https?://[\w.~!*();:@&=+$,/?\\\#%[\\\]'"-]+#; # find existing A HREF tags, add them to @arr, then replace them with a placeholder while (m#(<a href=("|')($urlPattern)\2[^>]*>(.*?)</a>)#gsi) { ($url, $show) = removeParams($3, $4); # modify existing tags to approved format, stripping unwanted attributes $arr[$x] = "<a href='$url' target='_new'>$show</a>"; # I remove umlauts from the text earlier, so I can safely use � as a delimiter # without worrying about it being in the text s#\Q$1\E#�$x�#g; $x++; } # Modify URLs without tags while (m#\b($urlPattern)\b#gi) { ($modified) = removeParams($1); # if displayed URL is longer than 40 characters, use ... to make it shorter # eg, https://www.blahb...ahblah.com $repl = "<a href='$modified' target='_new'>" . (length($modified) > 40 ? substr($modified, 0, 27) . '...' . substr($modified, -10) : $modified) . "</a>"; $arr[$x] = $repl; s#\Q$1\E#�$x�#g; $x++; } # convert placeholders back to the expected text for ($i = 0; $i <= $#arr; $i++) { s#�$i�#$arr[$i]#gi; } # this function removes unwanted params sub removeParams { return # I know that I could make it shorter using (oc|trk|gc|fb)id, but I left it # like this and used /x for the post for readability map s#\b (?: utm_\w+? | ocid | trkid | gclid | fbid | data-[\w-]+ | role | cite | itxt[\w-]* )=[^&]+(?:&(?:amp;)?)?##gxir # remove trailing ? or & =~ s#[?&]$##r , @_; }

# pattern that should match any URL string; might could just use https?://\S+ $urlPattern = qr#https?://[\w.~!*();:@&=+$,/?\\\#%[\\\]'"-]+#; # Convert www to http://www, assuming that the destination can apply https if applicable s#\b(?<!://)www\.([a-z])#http://www\.$1#gi; # How many HTTP is in the text? This is a safety net my $totalLinks = () = m#$urlPattern#g; my @arr; $x = 0; # find existing IMG tags, add them to @arr, then replace them with a placeholder while ($x <= $totalLinks && m#(<img[^>]+>)#gi) { $arr[$x] = $1; s#\Q$1\E#�$x�#g; $x++; } # find existing A HREF tags, add them to @arr, then replace them with a placeholder while ($x <= $totalLinks && m#(<a href=("|')($urlPattern)\2[^>]*>(.*?)</a>)#gsi) { ($url, $show) = removeParams($3, $4); # modify existing tags to approved format, stripping unwanted attributes $arr[$x] = "<a href='$url' target='_new'>$show</a>"; # I remove umlauts from the text earlier, so I can safely use � as a delimiter # without worrying about it being in the text s#\Q$1\E#�$x�#g; $x++; } # Modify URLs without tags while ($x <= $totalLinks && m#\b($urlPattern)\b#gi) { ($modified) = removeParams($1); # if displayed URL is longer than 40 characters, use ... to make it shorter # eg, https://www.blahb...ahblah.com $repl = "<a href='$modified' target='_new'>" . (length($modified) > 40 ? substr($modified, 0, 27) . '...' . substr($modified, -10) : $modified) . "</a>"; $arr[$x] = $repl; s#\Q$1\E#�$x�#g; $x++; } # convert placeholders back to the expected text for ($i = 0; $i <= $#arr; $i++) { s#�$i�#$arr[$i]#gi; } # this function removes unwanted params sub removeParams { return # could make it shorter using (oc|trk|gc|fb)id, but I left it # like this and used /x for readability map s#\b (?: utm_\w+? | ocid | trkid | gclid | fbid | data-[\w-]+ | role | cite | itxt[\w-]* )=[^&]+(?:&(?:amp;)?)?##gxir # remove trailing ? or & =~ s#[?&]$##r , @_; }

Converting URL to link in user submitted string

csdude55

csdude55

lucy24

csdude55

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week