Perl pattern match for url

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Perl pattern match for url

mrealty

10:12 pm on Jul 21, 2009 (gmt 0)

Am trying to remove urls from text strings in PERL. I have the following but it does not seem to work:

$remarks =~ s/www\.\s+\.com//gi;

In English, I want to look for www. then I want to delete the www. and everything after it until I hit a space (but not including the space).

It's not even deleting a simple occurrence of www.example.com

Best way to do this?

Kind regards.

RudyS

5:32 pm on Jul 22, 2009 (gmt 0)

$example = 'want to remove the URL www.example.com and leave the space';
$example =~ s/www\..*\.com\s/ /gi;

print "$example";
print "\n$&";

Gibble

5:55 pm on Jul 22, 2009 (gmt 0)

I recommend you get the tool called expresso to help you write regular expressions

your regex of /www\.\s+\.com/
matches www.(whitespace).com

It also has a few expressions in the library such as this one for URLs:
(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*

Which since you aren't starting your match with "protocol://", but with "www."
can be changed to
/www.(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*/

mrealty

11:50 pm on Jul 22, 2009 (gmt 0)

Gibble: Dude, thank you so much for that (expresso). Before I posted my question, I did a search for "pattern matching utility" and didn't come up with anything...nor had I heard of one. I just downloaded expresso and now I'm on my way to pretending like I've got a PhD in pattern matching. :-) That program is awesome. I can test, build, and analyze complex expressions now. And now I don't have to come back here asking these newb pattern matching questions anymore.

I actually wanted something a little more complex than what I asked in my opening post. I wanted all of these matched and deleted:

http://example.com
http://www.example.com
www.example.com
example.com
example.com/something.html
example.com/123/something

So I copied all of those into the "sample text" window and kept plugging away at the expression until it showed that it hit them all. Guess I'm a little too excited. :-)
Thanks a ton.

With kind regards.

Gibble

1:52 pm on Jul 23, 2009 (gmt 0)

don't forget about the
http://domain.tld/page.ext?querystring&value

[edited by: phranque at 9:08 pm (utc) on July 23, 2009]
[edit reason] unlinked url [/edit]

mrealty

5:49 pm on Jul 24, 2009 (gmt 0)

Gibble:

I've put together a masterful regex. Slight problem though, it works in expresso but when I put it in perl, the interpreter chokes on it.

$remarks =~ s/(http:\/\/¦ftp:\/\/)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)\/*[\w\.?=%&=\-@/$,]*//gi;

PERL error:
"Unmatched [ in regex; marked by <-- HERE in m/(http://¦ftp://)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)/*[ <-- HERE \w\.?=%&=\-@/ at remark.pl line 71."

If I put this much of it in perl, it works fine.

$remarks =~ s/(http:\/\/¦ftp:\/\/)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)\/*//gi;

So the problem is after that. It's not an unmatched [ as it says. I tried escaping the $ but then it proceeds and gives a concatenation error while it's running my script so that's not it either.

Ideas?

Gibble

5:58 pm on Jul 24, 2009 (gmt 0)

I think you need to escape the ? in [\w\.?=%&=\-@/$,], so use [\w\.\?=%&=\-@/$,]

But, rather, since most anything is legal in the querystring I'd change it from [\w\.?=%&=\-@/$,] to [^\s] meaning anything but whitespace

phranque

6:10 pm on Jul 24, 2009 (gmt 0)

it's probably the dash in the square brackets.
escape it or move it to just after the opening bracket.
perl regular expression CHARACTER CLASSES [perldoc.perl.org]

mrealty

6:27 pm on Jul 24, 2009 (gmt 0)

Gibble:

I escaped the ? and perl gave an error:

"Unmatched [ in regex; marked by <-- HERE in m/(http://�ftp://)?(\.?\w+-*\w+)+\.(com�net�org�gov�edu)/*[ <-- HERE \w\.\?=%&=\-@/ at remark.pl line 63."

(Ignore the line number - I've added some debugging statements)

phranque: The dash is escaped in my example.

Ok. After trying what Gibble said (use the [^\s]*) it worked beautifully. To summarize, download expresso so you can do complex regex stuff (my issue was with perl, not regex). But if you're lazy and just want what works for 99.9% of url pattern matching in a text string using perl, do this:

$remarks =~ s/(http:\/\/�ftp:\/\/)?(\.?\w+-*\w+)+\.(com�net�org�gov�edu)\/*[^\s]*//gi;

As an added bonus, if you want to strip email addresses from your string, you'd need to do it before you run the above regex. I borrowed this one from expresso's library:

$remarks =~ s/([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)�(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}�[0-9]{1,3})//g;

Expresso is awesome - and free too!

Thanks again for all the help.

Gibble

6:43 pm on Jul 24, 2009 (gmt 0)

doesn't look like that email regex allows a + sign before the @, it should, it's a valid character in an email address

mrealty

7:07 pm on Jul 24, 2009 (gmt 0)

gibble: you are absolutely correct. Let me qualify: that email match that was taken from the library will match 99.9% of typical email addresses out there. :-) According to the RFC, here are the allowable characters in the local part:

!#$%&'*+-/=?^_`.{�}~

Add to that, you have more rules on the placement and repetition of each, like no double dot in the local part, and you can't have the dot right before the @ symbol, just to mention a few. My needs don't require such a high degree of accuracy. If someone's needs require a higher degree of accuracy, then that someone would probably not come here for answers, they'd start at the RFC.

We should start a new thread and call it, "My pattern matching is more bullet proof than yours" and have some fun with it. :-)

With kind regards.

[edit]
To satisfy curiosity and the RFC3696 Sec 3 for the email local part, here is what I found:

$combination = �[a-zA-Z0-9!#$%&\'*+\/=?^`{¦}~.-]�;
$localpart = �($combination(?:\.$combination)?)+�;

[end high degree of accuracy]