Forum Moderators: coopster

Message Too Old, No Replies

RegEx for Grabbing Hyperlinks

PHP Flavo(u?)r

         

brotherhood of LAN

1:54 pm on Jan 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have this code to grab hyperlinks from a page, which seems to work fine for getting links, but doesnt accommodate for images that could be inside the <a> tags.

$linkvolume = preg_match_all("'href=\"?(.[^\"\>]*)\"?(.[^\>]*)>(.[^\<]*)</a>'im",$doc[1],$matches);

Right, it's probably is a bit messy, with un-needed bits ;)

But does anyone know how I can deal with images inside hyperlinks? My preg_match quota is running out rapidly for this week ;) i.e. its taking me ages

I tried replacing images with something like "<img.[^\>]*\>" but cant seem to get rid of em.

Anyone with their regex head on that can sort me out would be great!

/added
$match[1] matches the URL, $match[2] anything in between URL and $match[3], the anchor text

I also escaped < and > not sure if thats needed.

brotherhood of LAN

2:32 pm on Jan 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



whoops, got it now

...(.[^\<i]*)</a>'im",$doc[1],$matches);

Feel free to nuke this thread if no one is doing this ;)

andreasfriedrich

2:52 pm on Jan 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Have you tried matching this link?

<a href = [ac.com...]  >Aaron<img src="" /></ a >

or this one

<a href ='http://www.ac.com/'>Aaron<img src="" /></ a >

Your RE will not match those.

But does anyone know how I can deal with images inside hyperlinks?

How do you want to deal with them? Ignore them? Match their URI?

Try this RE:

$pat = <<<END 
{href\s*=\s*(["'])?([^'" >]+?)(\\1Ķ )?\s*>([^<]*?)(<img[^>]+>)?</\s*a\s*>}si
END;
$text = <<<END
<a href = [ac.com...] >Aaron<img src="" /></ a >
END;
preg_match_all($pat,$text,$m);

  • match href
  • 0 or more whitespace
  • =
  • 0 or more whitespace
  • either " or ' and save them once or 0 times
  • one or more characters that are not ' or " or space or > non greedily and save them
  • what we matched in 1, i.e. either " or ', or space once or 0 times
  • 0 or more whitespace
  • >
  • ([^<]*?)(<img[^>]+>)?
    or
    (.*?)
    use the latter when you want the entire content of the a element
  • </
  • 0 or more whitespace
  • a
  • 0 or more whitespace
  • >

I believe you are saver when you remove images from the content of the a element in a second step.

preg_replace("'<img[^>]+>'",''$a_content);

Sorry for the small font but thatīs the forum software ;)

Andreas

brotherhood of LAN

3:37 pm on Jan 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks Andreas,

I should have held my breath for an hour or two :)

I have this now
$linkvolume = preg_match_all("'href\s?=\s?\"?\s?(.[^\">]*)\"?(.[^>]*)>(.[^<i]*)<\s*/\s*a\s*>'ims",$doc[1],$matches);

I will jump back offline and study your post

Cheers :)

/added
the images, yes, i wanted to remove them. I may wawnt to get any available title="" for images though I'll try master the normal hyperlinks first.