Forum Moderators: coopster
I'm building a linkchecker but unfortunately I'm stuck.
I use the following code to retrieve URLs from a remote site:
//cURL-code here...// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}
My next challenge is that I wish to extract the url text:
<a href="[URL]">[URL text]</a>
... but I don't know how to do this -- what to use and how to implement it into the code above.
Any help would be much appreciated, thanks.
[edited by: OutdoorMan at 1:37 pm (utc) on April 6, 2008]
nodeValueproperty of the dom object.
$dom = new DOMDocument();
@$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $anchor) {
$url = $anchor->getAttribute('href');
$urltext = $anchor->nodeValue;
echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}
The page source contains much more information. The empty lines are (of course) caused by empty url text: for example whenever the script returns an 'a' element that contains an 'img' element or so (Doh! I should have thought of that...)
Do you by chance have any suggestions of how to filter the results like this?
if("a element contains an img element") {
// Write img name
echo 'Link: <a href="' . $url . '" title="' . $imgname . '">' . $imgname. '</a><br>';
}
else
{
// Write link text
echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}
And do you also know how to solve the encoding issue? Can this for example be solved by the use of a curl_setopt setting or something else?
I haven't been able to find any solutions to both issues by searching on Google or php.net.
Thanks :)
But I still haven't found a solution for seperating URLs as 'a href' and 'img'.
Further problem: what do you do with <a name=".."></a> anchors? Those will be mostly empty too.
I also need to handle absolute and relative URLs ('http://www.example.com/something', '/something' etc.) otherwise the links show up as: 'http://www.mysite.com/something' etc.
Further problem: what do you do with <a name=".."></a> anchors? Those will be mostly empty too.
So far the script only grabs 'a' elements containing a 'href' attribute, according to this line:
$url = $anchor->getAttribute('href');
a elements like this one: <a name=".."></a> are invisible to the script :)
[edited by: OutdoorMan at 8:53 pm (utc) on April 19, 2008]