homepage Welcome to WebmasterWorld Guest from 50.16.130.188
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
How do I retrieve 'URL text' from URLs?
by use of cURL etc.
OutdoorMan




msg:3620284
 1:18 pm on Apr 6, 2008 (gmt 0)

Hi PHP experts,

I'm building a linkchecker but unfortunately I'm stuck.

I use the following code to retrieve URLs from a remote site:


//cURL-code here...

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');

echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}

My next challenge is that I wish to extract the url text:

<a href="[URL]">[URL text]</a>

... but I don't know how to do this -- what to use and how to implement it into the code above.

Any help would be much appreciated, thanks.

[edited by: OutdoorMan at 1:37 pm (utc) on April 6, 2008]

 

coopster




msg:3620972
 3:46 pm on Apr 7, 2008 (gmt 0)

You could use the
nodeValue property of the dom object.
$dom = new DOMDocument(); 
@$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $anchor) {
$url = $anchor->getAttribute('href');
$urltext = $anchor->nodeValue;
echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}

OutdoorMan




msg:3625263
 11:37 am on Apr 12, 2008 (gmt 0)

Great, coopster -- thanks :)

Though it works, I don't know why I get results like this:

...
Link 1:
Link 2: på Østmøn
Link 3:
Link 4:
...

(sometimes I get these empty lines and sometimes there's strange characters in some of the link text as shown above.)

Any suggestions?

coopster




msg:3628105
 12:40 pm on Apr 16, 2008 (gmt 0)

Have you viewed the page source itself to see what is contained in the values? Perhaps the attributes are blank? The other issue there looks to be an encoding issue.

OutdoorMan




msg:3629985
 2:22 pm on Apr 18, 2008 (gmt 0)

Thanks again, coopster :)

The page source contains much more information. The empty lines are (of course) caused by empty url text: for example whenever the script returns an 'a' element that contains an 'img' element or so (Doh! I should have thought of that...)

Do you by chance have any suggestions of how to filter the results like this?

if("a element contains an img element") {
// Write img name
echo 'Link: <a href="' . $url . '" title="' . $imgname . '">' . $imgname. '</a><br>';
}
else
{
// Write link text
echo 'Link: <a href="' . $url . '" title="' . $urltext . '">' . $urltext . '</a><br>';
}

And do you also know how to solve the encoding issue? Can this for example be solved by the use of a curl_setopt setting or something else?

I haven't been able to find any solutions to both issues by searching on Google or php.net.

Thanks :)

OutdoorMan




msg:3630149
 6:34 pm on Apr 18, 2008 (gmt 0)

I've got the encoding issue solved by the use of utf8_decode [php.net] (php.net)

But I still haven't found a solution for seperating URLs as 'a href' and 'img'.

g1smd




msg:3630220
 10:43 pm on Apr 18, 2008 (gmt 0)

You'll need to check if there is an <img> tag nested within the <a>, but I have no idea of the code you would need for that. Make that check only when no text is found (just in case you find a link with both an image and some text).

Further problem: what do you do with <a name=".."></a> anchors? Those will be mostly empty too.

OutdoorMan




msg:3630677
 8:52 pm on Apr 19, 2008 (gmt 0)

g1smd > Thanks. Yeah, I'll probably need some reg.exp. to filter the results (I think?). But so far hours and hours of search and online reading haven't brought me closer to a solution.

I also need to handle absolute and relative URLs ('http://www.example.com/something', '/something' etc.) otherwise the links show up as: 'http://www.mysite.com/something' etc.

Further problem: what do you do with <a name=".."></a> anchors? Those will be mostly empty too.

So far the script only grabs 'a' elements containing a 'href' attribute, according to this line:

$url = $anchor->getAttribute('href');

a elements like this one: <a name=".."></a> are invisible to the script :)

[edited by: OutdoorMan at 8:53 pm (utc) on April 19, 2008]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved