$remarks =~ s/www\.\s+\.com//gi;
In English, I want to look for www. then I want to delete the www. and everything after it until I hit a space (but not including the space).
It's not even deleting a simple occurrence of www.example.com
Best way to do this?
Kind regards.
your regex of /www\.\s+\.com/
matches www.(whitespace).com
It also has a few expressions in the library such as this one for URLs:
(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*
Which since you aren't starting your match with "protocol://", but with "www."
can be changed to
/www.(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*/
I actually wanted something a little more complex than what I asked in my opening post. I wanted all of these matched and deleted:
http://example.com
http://www.example.com
www.example.com
example.com
example.com/something.html
example.com/123/something
So I copied all of those into the "sample text" window and kept plugging away at the expression until it showed that it hit them all. Guess I'm a little too excited. :-)
Thanks a ton.
With kind regards.
I've put together a masterful regex. Slight problem though, it works in expresso but when I put it in perl, the interpreter chokes on it.
$remarks =~ s/(http:\/\/¦ftp:\/\/)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)\/*[\w\.?=%&=\-@/$,]*//gi;
PERL error:
"Unmatched [ in regex; marked by <-- HERE in m/(http://¦ftp://)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)/*[ <-- HERE \w\.?=%&=\-@/ at remark.pl line 71."
If I put this much of it in perl, it works fine.
$remarks =~ s/(http:\/\/¦ftp:\/\/)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)\/*//gi;
So the problem is after that. It's not an unmatched [ as it says. I tried escaping the $ but then it proceeds and gives a concatenation error while it's running my script so that's not it either.
Ideas?
I escaped the ? and perl gave an error:
"Unmatched [ in regex; marked by <-- HERE in m/(http://¦ftp://)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)/*[ <-- HERE \w\.\?=%&=\-@/ at remark.pl line 63."
(Ignore the line number - I've added some debugging statements)
phranque: The dash is escaped in my example.
Ok. After trying what Gibble said (use the [^\s]*) it worked beautifully. To summarize, download expresso so you can do complex regex stuff (my issue was with perl, not regex). But if you're lazy and just want what works for 99.9% of url pattern matching in a text string using perl, do this:
$remarks =~ s/(http:\/\/¦ftp:\/\/)?(\.?\w+-*\w+)+\.(com¦net¦org¦gov¦edu)\/*[^\s]*//gi;
As an added bonus, if you want to strip email addresses from your string, you'd need to do it before you run the above regex. I borrowed this one from expresso's library:
$remarks =~ s/([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)¦(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}¦[0-9]{1,3})//g;
Expresso is awesome - and free too!
Thanks again for all the help.
!#$%&'*+-/=?^_`.{¦}~
Add to that, you have more rules on the placement and repetition of each, like no double dot in the local part, and you can't have the dot right before the @ symbol, just to mention a few. My needs don't require such a high degree of accuracy. If someone's needs require a higher degree of accuracy, then that someone would probably not come here for answers, they'd start at the RFC.
We should start a new thread and call it, "My pattern matching is more bullet proof than yours" and have some fun with it. :-)
With kind regards.
[edit]
To satisfy curiosity and the RFC3696 Sec 3 for the email local part, here is what I found:
$combination = ‘[a-zA-Z0-9!#$%&\'*+\/=?^`{¦}~.-]‘;
$localpart = “($combination(?:\.$combination)?)+”;
[end high degree of accuracy]