Regular Expression Help Needed

Forum Moderators: coopster

Message Too Old, No Replies

Regular Expression Help Needed

Trying to extract the href from a URL

Karma

2:17 pm on Aug 13, 2009 (gmt 0)

Hi,

I have a string:

$string = "<a href='http://www.mydomain.tld/randompage.html'>widget page</a>";

I'm trying to extract just the href value:

$result = "http://www.mydomain.tld/randompage.html";

...but I can't seem to get this working.

Anyone?

Thanks

brotherhood of LAN

2:22 pm on Aug 13, 2009 (gmt 0)

If you're looking within whole HTML pages you may be better off with a parser, that'd then load up all the <a> attributes into an array...

But for the example you've given..

/\'([^\']+)/

Would match "all characters after a single quote until another single quote is reached"

rocknbil

5:46 pm on Aug 13, 2009 (gmt 0)

But . . . it may not always be a single quote, or not be quoted at all. :-)

<?php

$string1 = "<a href='http://www.mydomain.tld/randompage.html'>widget page</a>";

$string2 = '<a href="http://www.mydomain.tld/randompage.html">widget page</a>';

$string3 = '<a href=http://www.mydomain.tld/randompage.html>widget page</a>';

$string4 = '<a
href="http://www.mydomain.tld/randompage.html">widget
page</a>';

echo "1 $string1 \n";
echo "2 $string2 \n";
echo "3 $string3 \n";
echo "4 with carriage returns $string4 \n";

// Couldn't figure out why newlines didn't work - so first strip them. A little homework for you. :-)
$string4 = preg_replace('/[\n\r]+/'," ",$string4);

// Note that the [] character classes contain a single quote ' next to a double quote ""
// This may not be obvious in this forum's display

$string1 = preg_replace('/<.*?href\s*=\s*[\'"]*([^\'">]+)[\'"]*>.*<\/a>/i',"$1",$string1);

$string2 = preg_replace('/<.*?href\s*=\s*[\'"]*([^\'">]+)[\'"]*>.*<\/a>/i',"$1",$string2);

$string3 = preg_replace('/<.*?href\s*=\s*[\'"]*([^\'">]+)[\'"]*>.*<\/a>/i',"$1",$string3);

$string4 = preg_replace('/<.*?href\s*=\s*[\'"]*([^\'">]+)[\'"]*>.*<\/a>/i',"$1",$string4);

echo "1 $string1 \n";
echo "2 $string2 \n";
echo "3 $string3 \n";
echo "4 $string4 \n";
?>

breakdown:

/ - pattern delimiter

< - start with carat

.*? - followed by zero or more of any character with the quantifier to prevent it from slurping up the whole string. This could include other attributes: class="myclass" title="mytitle"....

href - followed by href

\s*=\s* followed by zero or more spaces, equal sign, and zero or more spaces. This will account for sloppy coding; spaces are not recommended but supported

[\'"]* - followed by zero or more of ' or ". Zero or more covers unquoted hrefs. I'm escaping ' because that is my preg_replace delimiter; escape " if you use that as the delimiter.

([^\'">]+) - followed by one or more (+) of anything NOT a ', ", or >, which should be the URL. Surrounded in parenthesizes, gets stored in $1

[\'"]* - followed by zero or more of ' or ", again, zero or more covers unquoted attributes

> followed by closing carat

.* - followed by zero or more of any character (sloppy, sorta, but will work in conjunction with . . .

<\/a> - followed by the closing anchor tag, note this will fail for malformed HTML without a closing tag but that should be fairly obvious

/i' - end pattern match, i=case insensitive for a HrEf and </A>

It should have worked for string4 too, don't know why it didn't, which is why I added the carriage return strip line. I'm sure a regexp expert will point it out. :-)

Gibble

5:56 pm on Aug 13, 2009 (gmt 0)

As I tell everyone, download expresso.
Makes writing and testing regular expressions much much easier