Welcome to WebmasterWorld Guest from 107.20.34.173

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

How do I retrieve 'URL text' from URLs?

by use of cURL etc.

     

OutdoorMan

1:18 pm on Apr 6, 2008 (gmt 0)

5+ Year Member



Hi PHP experts,

I'm building a linkchecker but unfortunately I'm stuck.

I use the following code to retrieve URLs from a remote site:


//cURL-code here...

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');

echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}

My next challenge is that I wish to extract the url text:

<a href="[URL]">[URL text]</a>

... but I don't know how to do this -- what to use and how to implement it into the code above.

Any help would be much appreciated, thanks.

[edited by: OutdoorMan at 1:37 pm (utc) on April 6, 2008]

coopster

3:46 pm on Apr 7, 2008 (gmt 0)

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



You could use the
nodeValue
property of the dom object.
$dom = new DOMDocument(); 
@$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $anchor) {
$url = $anchor->getAttribute('href');
$urltext = $anchor->nodeValue;
echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}

OutdoorMan

11:37 am on Apr 12, 2008 (gmt 0)

5+ Year Member



Great, coopster -- thanks :)

Though it works, I don't know why I get results like this:

...
Link 1:
Link 2: på Østmøn
Link 3:
Link 4:
...

(sometimes I get these empty lines and sometimes there's strange characters in some of the link text as shown above.)

Any suggestions?

coopster

12:40 pm on Apr 16, 2008 (gmt 0)

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Have you viewed the page source itself to see what is contained in the values? Perhaps the attributes are blank? The other issue there looks to be an encoding issue.

OutdoorMan

2:22 pm on Apr 18, 2008 (gmt 0)

5+ Year Member



Thanks again, coopster :)

The page source contains much more information. The empty lines are (of course) caused by empty url text: for example whenever the script returns an 'a' element that contains an 'img' element or so (Doh! I should have thought of that...)

Do you by chance have any suggestions of how to filter the results like this?

if("a element contains an img element") {
// Write img name
echo 'Link: <a href="' . $url . '" title="' . $imgname . '">' . $imgname. '</a><br>';
}
else
{
// Write link text
echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}

And do you also know how to solve the encoding issue? Can this for example be solved by the use of a curl_setopt setting or something else?

I haven't been able to find any solutions to both issues by searching on Google or php.net.

Thanks :)

OutdoorMan

6:34 pm on Apr 18, 2008 (gmt 0)

5+ Year Member



I've got the encoding issue solved by the use of utf8_decode [php.net] (php.net)

But I still haven't found a solution for seperating URLs as 'a href' and 'img'.

g1smd

10:43 pm on Apr 18, 2008 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



You'll need to check if there is an <img> tag nested within the <a>, but I have no idea of the code you would need for that. Make that check only when no text is found (just in case you find a link with both an image and some text).

Further problem: what do you do with <a name=".."></a> anchors? Those will be mostly empty too.

OutdoorMan

8:52 pm on Apr 19, 2008 (gmt 0)

5+ Year Member



g1smd > Thanks. Yeah, I'll probably need some reg.exp. to filter the results (I think?). But so far hours and hours of search and online reading haven't brought me closer to a solution.

I also need to handle absolute and relative URLs ('http://www.example.com/something', '/something' etc.) otherwise the links show up as: 'http://www.mysite.com/something' etc.

Further problem: what do you do with <a name=".."></a> anchors? Those will be mostly empty too.

So far the script only grabs 'a' elements containing a 'href' attribute, according to this line:

$url = $anchor->getAttribute('href');

a elements like this one: <a name=".."></a> are invisible to the script :)

[edited by: OutdoorMan at 8:53 pm (utc) on April 19, 2008]

 

Featured Threads

Hot Threads This Week

Hot Threads This Month