Welcome to WebmasterWorld Guest from 54.167.102.69

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

How do I retrieve 'URL text' from URLs?

by use of cURL etc.

     
1:18 pm on Apr 6, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2006
posts:197
votes: 0


Hi PHP experts,

I'm building a linkchecker but unfortunately I'm stuck.

I use the following code to retrieve URLs from a remote site:


//cURL-code here...

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');

echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}

My next challenge is that I wish to extract the url text:

<a href="[URL]">[URL text]</a>

... but I don't know how to do this -- what to use and how to implement it into the code above.

Any help would be much appreciated, thanks.

[edited by: OutdoorMan at 1:37 pm (utc) on April 6, 2008]

3:46 pm on Apr 7, 2008 (gmt 0)

Administrator

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 31, 2003
posts:12533
votes: 0


You could use the
nodeValue
property of the dom object.
$dom = new DOMDocument(); 
@$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $anchor) {
$url = $anchor->getAttribute('href');
$urltext = $anchor->nodeValue;
echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}
11:37 am on Apr 12, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2006
posts:197
votes: 0


Great, coopster -- thanks :)

Though it works, I don't know why I get results like this:

...
Link 1:
Link 2: på Østmøn
Link 3:
Link 4:
...

(sometimes I get these empty lines and sometimes there's strange characters in some of the link text as shown above.)

Any suggestions?

12:40 pm on Apr 16, 2008 (gmt 0)

Administrator

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 31, 2003
posts:12533
votes: 0


Have you viewed the page source itself to see what is contained in the values? Perhaps the attributes are blank? The other issue there looks to be an encoding issue.
2:22 pm on Apr 18, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2006
posts:197
votes: 0


Thanks again, coopster :)

The page source contains much more information. The empty lines are (of course) caused by empty url text: for example whenever the script returns an 'a' element that contains an 'img' element or so (Doh! I should have thought of that...)

Do you by chance have any suggestions of how to filter the results like this?

if("a element contains an img element") {
// Write img name
echo 'Link: <a href="' . $url . '" title="' . $imgname . '">' . $imgname. '</a><br>';
}
else
{
// Write link text
echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}

And do you also know how to solve the encoding issue? Can this for example be solved by the use of a curl_setopt setting or something else?

I haven't been able to find any solutions to both issues by searching on Google or php.net.

Thanks :)

6:34 pm on Apr 18, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2006
posts:197
votes: 0


I've got the encoding issue solved by the use of utf8_decode [php.net] (php.net)

But I still haven't found a solution for seperating URLs as 'a href' and 'img'.

10:43 pm on Apr 18, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


You'll need to check if there is an <img> tag nested within the <a>, but I have no idea of the code you would need for that. Make that check only when no text is found (just in case you find a link with both an image and some text).

Further problem: what do you do with <a name=".."></a> anchors? Those will be mostly empty too.

8:52 pm on Apr 19, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2006
posts:197
votes: 0


g1smd > Thanks. Yeah, I'll probably need some reg.exp. to filter the results (I think?). But so far hours and hours of search and online reading haven't brought me closer to a solution.

I also need to handle absolute and relative URLs ('http://www.example.com/something', '/something' etc.) otherwise the links show up as: 'http://www.mysite.com/something' etc.

Further problem: what do you do with <a name=".."></a> anchors? Those will be mostly empty too.

So far the script only grabs 'a' elements containing a 'href' attribute, according to this line:

$url = $anchor->getAttribute('href');

a elements like this one: <a name=".."></a> are invisible to the script :)

[edited by: OutdoorMan at 8:53 pm (utc) on April 19, 2008]