Forum Moderators: coopster & phranque

Message Too Old, No Replies

Converting URL to link in user submitted string

         

csdude55

1:00 am on Dec 4, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm going down the rabbit hole a little.

Many years ago I built a simple function to convert http://www.example.com to <a href='http://www.example.com' target='_new'>http://www.example.com</a>. I used the URI:::Find module to isolate links, then regex to convert it.

Over time, things got more complicated. Browsers started sending weird things, copied data from other sites would have unexpected code, etc. So my simple function has gotten pretty complex. So now I'm trying to rebuild it and make it a little more friendly towards future "additions".

Here's where I've gotten:

# emulate user-submitted data
$_ = qq~
<a href='https://www.example.com?utm_foo=jhkdf989874jhkkdjhk12&utm_bar=yuiytuwer786' title="something">Example</a>
<a href="https://www.lorem.com" target="_blank">https://www.ipsum.com</a>
http://www.foo.com
https://bar.com
www.new.com
~;

# Convert www to http://www, assuming that the destination can apply https if applicable
s#\b(?<!://)www\.([a-z])#http://www\.$1#gi;

# Remove optional attributes from an existing A HREF, then add the TARGET back in
s#<a[^>]* href=(["'])([^\1]*?)\1[^>]*>#<a href='$2' target='_new'>$2</a>#gi;

# Add A HREF if it's not already there
while (m#\b(https?://[\w.~!*();:@&=+$,/?\\\#%[\\\]'"-]+)\b#gi) {
# $org is the URL I'm working with in this cycle
$org = $1;

# $modified is equal to $org, minus any utm_, ocid, trkid, gclid, fbid, data-, role, cite, or itxt
# params. This also removes a trailing ? or &
$modified = $org
=~ s#\b(?:utm_\w+?|ocid|trkid|gclid|fbid|data-[\w-]+|role|cite|itxt[\w-]*)=[^&]+(?:&(?:amp;)?)?##gir
=~ s#[?&]$##r;

# if the displayed URL is too long, shorten it and put a ... in the middle; eg,
# <a href='https://www.blahblahblahblahblahblah.com' target='_new'>https://www.blahb...ahblah.com</a>
$repl = length($modified) > 30 ?
substr($modified, 0, 17) . '...' . substr($modified, -10) :
$modified;

# do the final replacement
s#\Q$org\E#<a href='$modified' target='_new'>$repl</a>#gi;
}

print;


I'm hitting 2 snags:

1. This becomes an infinite loop because that last replacement still plugs in something that matches the while() condition.

2. In retrospect, when the text is something like <a href='https://www.example.com?utm_foo=jhkdf989874jhkkdjhk12&utm_bar=yuiytuwer786' title="something">Example</a>, I'd rather just remove the optional attributes and keep the content; eg, I want the final result to be <a href='https://www.example.com' target='_new'>Example</a>.

So I guess that I have 2 questions:

1. Can you suggest a way to make the while() loop move on to the next result after the replacement, instead of starting back at the beginning?

2. Can you suggest a modification to the while() condition regex to make it know whether the URL is inside of a tag?

The solution I'm working on would put all of the links in an array and then put them back in the right place in a second loop. That's getting to be a lot more complex than intended, though, so I thought I'd ask for a second set of eyes before going much further.

csdude55

8:30 pm on Dec 4, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, I went on down the rabbit hole and made this:

# Convert www to http://www, assuming that the destination can apply https if applicable
s#\b(?<!://)www\.([a-z])#http://www\.$1#gi;

my @arr;
$x = 0;

# pattern that should match any URL string; I thought about using https?://\S+, but I dunno
$urlPattern = qr#https?://[\w.~!*();:@&=+$,/?\\\#%[\\\]'"-]+#;

# find existing A HREF tags, add them to @arr, then replace them with a placeholder
while (m#(<a href=("|')($urlPattern)\2[^>]*>(.*?)</a>)#gsi) {
($url, $show) = removeParams($3, $4);

# modify existing tags to approved format, stripping unwanted attributes
$arr[$x] = "<a href='$url' target='_new'>$show</a>";

# I remove umlauts from the text earlier, so I can safely use ï as a delimiter
# without worrying about it being in the text
s#\Q$1\E#ï$xï#g;

$x++;
}

# Modify URLs without tags
while (m#\b($urlPattern)\b#gi) {
($modified) = removeParams($1);

# if displayed URL is longer than 40 characters, use ... to make it shorter
# eg, https://www.blahb...ahblah.com
$repl = "<a href='$modified' target='_new'>" .
(length($modified) > 40 ?
substr($modified, 0, 27) . '...' . substr($modified, -10) :
$modified) .
"</a>";

$arr[$x] = $repl;
s#\Q$1\E#ï$xï#g;

$x++;
}

# convert placeholders back to the expected text
for ($i = 0; $i <= $#arr; $i++) {
s#ï$iï#$arr[$i]#gi;
}

# this function removes unwanted params
sub removeParams {
return
# I know that I could make it shorter using (oc|trk|gc|fb)id, but I left it
# like this and used /x for the post for readability
map s#\b
(?:
utm_\w+? |
ocid |
trkid |
gclid |
fbid |
data-[\w-]+ |
role |
cite |
itxt[\w-]*
)=[^&]+(?:&(?:amp;)?)?##gxir

# remove trailing ? or &
=~ s#[?&]$##r
, @_;
}

I'd appreciate any feedback, especially if you see weaknesses in the regex that might catch the wrong thing (or miss the right thing).

I'm not sure about processing time, but my original had bugs that I had to repair with later regexes AND the new one is at least 33% smaller (not including the size of the URI::Find module). So in theory, at least, it should be "better".

lucy24

5:43 pm on Dec 5, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'd look at alternatives to a WHILE loop because of its inherent perils*. Since the various changes can safely be done in a fixed order, you could instead try a series of separate IF conditions: IF it doesn't start in <a href, IF it contains certain non-allowed characters, IF it's longer than 30 characters and so on.

* I am so uneasy about WHILE (or UNTIL) that when setting up something for local use I tend to protect myself with
counter=0
while(condition AND counter < some-large-number) {
counter++
do stuff }

csdude55

6:54 pm on Dec 5, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I rarely use a WHILE loop, either. But since I'm dealing with user-submitted text, there's no way to know how many links I'm dealing with in advance.

Based on your suggestion, though, I added a regex to count the number of HTTP in the text, then use it to limit the WHILE. I'll show my work below.

While playing with it, I also added an exclusion for IMG tags. These shouldn't actually appear in my text anywhere since they're supposed to be managed using onPaste() in JavaScript, so this is just a safety net to make sure this never happens:

<img src="<a href="https://www.example.com/blah.jpg" target="_new">https://www.example.com/blah.jpg</a>">

I already strip <script>, <link>, <form>, <iframe>, and <object> tags earlier in the script. I should probably strip <map> and <area>, too. Other than those and what I have here, I can't think of any other time that an HTTP would show up in copied text. Can you?


My code at this point is:

# pattern that should match any URL string; might could just use https?://\S+
$urlPattern = qr#https?://[\w.~!*();:@&=+$,/?\\\#%[\\\]'"-]+#;

# Convert www to http://www, assuming that the destination can apply https if applicable
s#\b(?<!://)www\.([a-z])#http://www\.$1#gi;

# How many HTTP is in the text? This is a safety net
my $totalLinks = () = m#$urlPattern#g;

my @arr;
$x = 0;

# find existing IMG tags, add them to @arr, then replace them with a placeholder
while ($x <= $totalLinks && m#(<img[^>]+>)#gi) {
$arr[$x] = $1;
s#\Q$1\E#ï$xï#g;

$x++;
}

# find existing A HREF tags, add them to @arr, then replace them with a placeholder
while ($x <= $totalLinks && m#(<a href=("|')($urlPattern)\2[^>]*>(.*?)</a>)#gsi) {
($url, $show) = removeParams($3, $4);

# modify existing tags to approved format, stripping unwanted attributes
$arr[$x] = "<a href='$url' target='_new'>$show</a>";

# I remove umlauts from the text earlier, so I can safely use ï as a delimiter
# without worrying about it being in the text
s#\Q$1\E#ï$xï#g;

$x++;
}

# Modify URLs without tags
while ($x <= $totalLinks && m#\b($urlPattern)\b#gi) {
($modified) = removeParams($1);

# if displayed URL is longer than 40 characters, use ... to make it shorter
# eg, https://www.blahb...ahblah.com
$repl = "<a href='$modified' target='_new'>" .
(length($modified) > 40 ?
substr($modified, 0, 27) . '...' . substr($modified, -10) :
$modified) .
"</a>";

$arr[$x] = $repl;
s#\Q$1\E#ï$xï#g;

$x++;
}

# convert placeholders back to the expected text
for ($i = 0; $i <= $#arr; $i++) {
s#ï$iï#$arr[$i]#gi;
}

# this function removes unwanted params
sub removeParams {
return
# could make it shorter using (oc|trk|gc|fb)id, but I left it
# like this and used /x for readability
map s#\b
(?:
utm_\w+? |
ocid |
trkid |
gclid |
fbid |
data-[\w-]+ |
role |
cite |
itxt[\w-]*
)=[^&]+(?:&(?:amp;)?)?##gxir

# remove trailing ? or &
=~ s#[?&]$##r
, @_;
}