How do I retrieve 'URL text' from URLs?

Forum Moderators: coopster

Message Too Old, No Replies

How do I retrieve 'URL text' from URLs?

by use of cURL etc.

OutdoorMan

1:18 pm on Apr 6, 2008 (gmt 0)

Hi PHP experts,

I'm building a linkchecker but unfortunately I'm stuck.

I use the following code to retrieve URLs from a remote site:

//cURL-code here...
// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}

My next challenge is that I wish to extract the url text:

<a href="[URL]">[URL text]</a>

... but I don't know how to do this -- what to use and how to implement it into the code above.

Any help would be much appreciated, thanks.

[edited by: OutdoorMan at 1:37 pm (utc) on April 6, 2008]

coopster

3:46 pm on Apr 7, 2008 (gmt 0)

You could use the

nodeValue

property of the dom object.

$dom = new DOMDocument(); 
@$dom->loadHTML($html); 
$anchors = $dom->getElementsByTagName('a'); 
foreach ($anchors as $anchor) { 
 $url = $anchor->getAttribute('href'); 
 $urltext = $anchor->nodeValue; 
 echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>'; 
}

OutdoorMan

11:37 am on Apr 12, 2008 (gmt 0)

Great, coopster -- thanks :)

Though it works, I don't know why I get results like this:

...
Link 1:
Link 2: på Østmøn
Link 3:
Link 4:
...

(sometimes I get these empty lines and sometimes there's strange characters in some of the link text as shown above.)

Any suggestions?

coopster

12:40 pm on Apr 16, 2008 (gmt 0)

Have you viewed the page source itself to see what is contained in the values? Perhaps the attributes are blank? The other issue there looks to be an encoding issue.

OutdoorMan

2:22 pm on Apr 18, 2008 (gmt 0)

Thanks again, coopster :)

The page source contains much more information. The empty lines are (of course) caused by empty url text: for example whenever the script returns an 'a' element that contains an 'img' element or so (Doh! I should have thought of that...)

Do you by chance have any suggestions of how to filter the results like this?

if("a element contains an img element") {
// Write img name
echo 'Link: <a href="' . $url . '" title="' . $imgname . '">' . $imgname. '</a><br>';
}
else
{
// Write link text
echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}

And do you also know how to solve the encoding issue? Can this for example be solved by the use of a curl_setopt setting or something else?

I haven't been able to find any solutions to both issues by searching on Google or php.net.

Thanks :)

OutdoorMan

6:34 pm on Apr 18, 2008 (gmt 0)

I've got the encoding issue solved by the use of utf8_decode [php.net] (php.net)

But I still haven't found a solution for seperating URLs as 'a href' and 'img'.

g1smd

10:43 pm on Apr 18, 2008 (gmt 0)

You'll need to check if there is an <img> tag nested within the <a>, but I have no idea of the code you would need for that. Make that check only when no text is found (just in case you find a link with both an image and some text).

Further problem: what do you do with <a name=".."></a> anchors? Those will be mostly empty too.

OutdoorMan

8:52 pm on Apr 19, 2008 (gmt 0)

g1smd > Thanks. Yeah, I'll probably need some reg.exp. to filter the results (I think?). But so far hours and hours of search and online reading haven't brought me closer to a solution.

I also need to handle absolute and relative URLs ('http://www.example.com/something', '/something' etc.) otherwise the links show up as: 'http://www.mysite.com/something' etc.

Further problem: what do you do with <a name=".."></a> anchors? Those will be mostly empty too.

So far the script only grabs 'a' elements containing a 'href' attribute, according to this line:

$url = $anchor->getAttribute('href');

a elements like this one: <a name=".."></a> are invisible to the script :)

[edited by: OutdoorMan at 8:53 pm (utc) on April 19, 2008]