Forum Moderators: coopster

Message Too Old, No Replies

Regex ignore string not char class

Regex ignore string not char class

         

Omala

8:08 am on Jan 14, 2008 (gmt 0)

10+ Year Member



Hi everyone,

I'm usually able to find things out using a regex reference guide online but I think I'm overlooking something here, it's probably simple but I can't figuring it out for the past couple of hours.

I have the following regex which works great:

preg_match_all("/href=\"?(.+?)[\" >]/i",$anchortitle_matches_str,$extintlink_matches);

It gives me all links on a site wether they are in <a href="bla.html"> or <a href=bla.html> or <a href=bla.html target=_blank> format.

So that's great.

Anyway, I'm trying to ignore ftp://, mailto:, javascript: etc.
I can do this by simply looping through my result array and ignore any results found using strpos or so, however, I know it has to be possible with regex.

Basically I'm trying to do this:

href=\"?mailto://¦ftp://¦javascript:(.+?)[\" >]

But the opposite. The above ONLY gives me links that DO contain ftp mailto and javascript but I'm trying to ignore the above.

I can't figure out how to properly use ^ or otherwise negate my unwanted links using the above method. Same goes for character classes, whenever I use [ ] regex just ignores them letter by letter so:

href=\"?[^mailto://](.+?)[\" >]

This simply ignores ANY link with an m a i l t o : or a / in it, that's not what I want, i want it to ignore links with mailto:// in it (hence the subject of thing post, ignore STRING!)

I hope that made sense.

Thanks in advance for any of those who can point me in the right direction!

Also I'm aware I could just look for links based on http:// or https:// but the problem is I'm also trying to find internal links so that's no solution :)

[edited by: eelixduppy at 1:29 pm (utc) on Jan. 14, 2008]
[edit reason] disabled smileys [/edit]

PHP_Chimp

10:18 pm on Jan 14, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can get part of the way by changing your pattern. As a url will only have certian character in it, so you can use -
'%href="?([\w\./#&]+)[" ]?>%'
As this pattern should get rid of mailto: and javascript: because the : is not going to be in a normal url.

To get rid of ftp is not something I managed.
I tried lookbehind assertions but they didnt seem to be working :(

However maybe someone else will get an idea from my NOT WORKING code -


$pattern = '%href=?(?<!ftp\.)([\w\./#&]+)[" ]>%';
preg_match_all($pattern, $string, $matches, PREG_PATTERN_ORDER);
echo '<pre>';
print_r($matches);
echo '</pre>';

The (?<! doesnt seem to be matching the ftp.

d40sithui

10:34 pm on Jan 14, 2008 (gmt 0)

10+ Year Member



Hi Omala,
Instead of finding and negating your finds of "mailto://, ftp://, javascript:", why not just come up with a pattern that only searches for valid links. heres what i got while playing with it. by disallowing the colon ":", you eliminate your chacnes of getting mailto:, ftp:, and javascript:

<?
$str .= "<a href=\"somewhere.php\">";
$str .= "<a href=\"somewhere.php\" target=\"_new\">";
$str .= "<a href=\"mailto://someone@somewhere.com\">";
$str .= "<a href=\"ftp://somewhere.com\">";
$str .= "<a href=\"somewhere.php\" target=\"_blank\">";

$pattern2="/\<{1}a\shref=\"{0,1}[a-zA-Z0-9\%\_\-\?\.\s\&\=\"\/]+\"{0,1}\>{1}/si";
preg_match_all($pattern2, $str, $matches);

print_r($matches); //you'll need to "view source" to see the actual becuase html entities will get parsed.
?>

if this doesn't work for you, im sure theres other ways. suhc as writing a loop to go through your $matches array, store the ones that are valid in a temp array while in the process ignore all the ones with the string you want to ignore.