Welcome to WebmasterWorld Guest from 54.242.224.250

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Perl pattern match for url

     
10:12 pm on Jul 21, 2009 (gmt 0)

New User

5+ Year Member

joined:May 13, 2009
posts:21
votes: 0


Am trying to remove urls from text strings in PERL. I have the following but it does not seem to work:

$remarks =~ s/www\.\s+\.com//gi;

In English, I want to look for www. then I want to delete the www. and everything after it until I hit a space (but not including the space).

It's not even deleting a simple occurrence of www.example.com

Best way to do this?

Kind regards.

5:32 pm on July 22, 2009 (gmt 0)

New User

5+ Year Member

joined:Apr 5, 2008
posts:33
votes: 0


$example = 'want to remove the URL www.example.com and leave the space';
$example =~ s/www\..*\.com\s/ /gi;

print "$example";
print "\n$&";

5:55 pm on July 22, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 13, 2002
posts:662
votes: 0


I recommend you get the tool called expresso to help you write regular expressions

your regex of /www\.\s+\.com/
matches www.(whitespace).com

It also has a few expressions in the library such as this one for URLs:
(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*

Which since you aren't starting your match with "protocol://", but with "www."
can be changed to
/www.(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*/

11:50 pm on July 22, 2009 (gmt 0)

New User

5+ Year Member

joined:May 13, 2009
posts:21
votes: 0


Gibble: Dude, thank you so much for that (expresso). Before I posted my question, I did a search for "pattern matching utility" and didn't come up with anything...nor had I heard of one. I just downloaded expresso and now I'm on my way to pretending like I've got a PhD in pattern matching. :-) That program is awesome. I can test, build, and analyze complex expressions now. And now I don't have to come back here asking these newb pattern matching questions anymore.

I actually wanted something a little more complex than what I asked in my opening post. I wanted all of these matched and deleted:

http://example.com
http://www.example.com
www.example.com
example.com
example.com/something.html
example.com/123/something

So I copied all of those into the "sample text" window and kept plugging away at the expression until it showed that it hit them all. Guess I'm a little too excited. :-)
Thanks a ton.

With kind regards.

1:52 pm on July 23, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 13, 2002
posts:662
votes: 0


don't forget about the
http://domain.tld/page.ext?querystring&value

[edited by: phranque at 9:08 pm (utc) on July 23, 2009]
[edit reason] unlinked url [/edit]

5:49 pm on July 24, 2009 (gmt 0)

New User

5+ Year Member

joined:May 13, 2009
posts:21
votes: 0


Gibble:

I've put together a masterful regex. Slight problem though, it works in expresso but when I put it in perl, the interpreter chokes on it.

$remarks =~ s/(http:\/\/¦ftp:\/\/)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)\/*[\w\.?=%&=\-@/$,]*//gi;

PERL error:
"Unmatched [ in regex; marked by <-- HERE in m/(http://¦ftp://)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)/*[ <-- HERE \w\.?=%&=\-@/ at remark.pl line 71."

If I put this much of it in perl, it works fine.

$remarks =~ s/(http:\/\/¦ftp:\/\/)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)\/*//gi;

So the problem is after that. It's not an unmatched [ as it says. I tried escaping the $ but then it proceeds and gives a concatenation error while it's running my script so that's not it either.

Ideas?

5:58 pm on July 24, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 13, 2002
posts:662
votes: 0


I think you need to escape the ? in [\w\.?=%&=\-@/$,], so use [\w\.\?=%&=\-@/$,]

But, rather, since most anything is legal in the querystring I'd change it from [\w\.?=%&=\-@/$,] to [^\s] meaning anything but whitespace

6:10 pm on July 24, 2009 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10563
votes: 15


it's probably the dash in the square brackets.
escape it or move it to just after the opening bracket.
perl regular expression CHARACTER CLASSES [perldoc.perl.org]
6:27 pm on July 24, 2009 (gmt 0)

New User

5+ Year Member

joined:May 13, 2009
posts:21
votes: 0


Gibble:

I escaped the ? and perl gave an error:

"Unmatched [ in regex; marked by <-- HERE in m/(http://¦ftp://)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)/*[ <-- HERE \w\.\?=%&=\-@/ at remark.pl line 63."

(Ignore the line number - I've added some debugging statements)

phranque: The dash is escaped in my example.

Ok. After trying what Gibble said (use the [^\s]*) it worked beautifully. To summarize, download expresso so you can do complex regex stuff (my issue was with perl, not regex). But if you're lazy and just want what works for 99.9% of url pattern matching in a text string using perl, do this:

$remarks =~ s/(http:\/\/¦ftp:\/\/)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)\/*[^\s]*//gi;

As an added bonus, if you want to strip email addresses from your string, you'd need to do it before you run the above regex. I borrowed this one from expresso's library:

$remarks =~ s/([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)¦(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}¦[0-9]{1,3})//g;

Expresso is awesome - and free too!

Thanks again for all the help.

6:43 pm on July 24, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 13, 2002
posts:662
votes: 0


doesn't look like that email regex allows a + sign before the @, it should, it's a valid character in an email address
7:07 pm on July 24, 2009 (gmt 0)

New User

5+ Year Member

joined:May 13, 2009
posts:21
votes: 0


gibble: you are absolutely correct. Let me qualify: that email match that was taken from the library will match 99.9% of typical email addresses out there. :-) According to the RFC, here are the allowable characters in the local part:

!#$%&'*+-/=?^_`.{¦}~

Add to that, you have more rules on the placement and repetition of each, like no double dot in the local part, and you can't have the dot right before the @ symbol, just to mention a few. My needs don't require such a high degree of accuracy. If someone's needs require a higher degree of accuracy, then that someone would probably not come here for answers, they'd start at the RFC.

We should start a new thread and call it, "My pattern matching is more bullet proof than yours" and have some fun with it. :-)

With kind regards.

[edit]
To satisfy curiosity and the RFC3696 Sec 3 for the email local part, here is what I found:

$combination = ‘[a-zA-Z0-9!#$%&\'*+\/=?^`{¦}~.-]‘;
$localpart = “($combination(?:\.$combination)?)+”;

[end high degree of accuracy]

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members