Forum Moderators: coopster

Message Too Old, No Replies

How do you spider a page and search for a specific anchor text

Anyone have some code lying around

         

Clark

8:51 pm on Aug 28, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've scoured the net for some simple code and my regex experience is nill. I find the darn thing totally confusing. Anyone have some snippet of code that will grab a url, skip the headers, pull out all links in the body and provide anchor text to the link?

coopster

10:17 pm on Aug 28, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



You should scour this forum, it seems there is one of these a week ;)

Here is one ...

[webmasterworld.com...]

lazydog

9:03 pm on Aug 29, 2004 (gmt 0)

10+ Year Member



Hi!

With php5 and DOM there is a much cleaner way of doing it. And no regex required ;-)

Here goes -


<?php
$dom->loadHTMLFile('http://www.webmasterworld.com/');
$anchor = $dom->getElementsByTagName('a');

foreach ($anchor as $node) {
$anchor_href=$node->getAttribute('href');
$anchor_text=$node->textContent;
}
?>

Use the STRING functions to compare $anchor_href and $anchor_text.

That's it. Hope it helped,

Saurabh.

Clark

10:14 pm on Aug 29, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OMG, that's amazing! Thank you I will try.

coopster

1:52 pm on Aug 30, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



That is nice. So it seems the DOM extension is no longer EXPERIMENTAL? Found this in the PHP5 changelog [php.net]...

Completely Overhauled XML support (Rob, Sterling, Chregu, Marcus)

  • Brand new Simplexml extension
  • New DOM extension
  • New XSL extension
  • Moved the old DOM-XML and XSLT extensions to PECL
  • ext/xml can now use both libxml2 and expat to parse XML
  • Removed bundled expat
Anyone have any more info?

Clark

7:36 pm on Aug 31, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I couldn't find any website with more info. Is there a way to define other stuff, like user agent? referer? Do you still need to use get_meta_tags or is that part of it? I found this [us4.php.net] but, uh, I always found even MAN pages confusing. This is even harder.

Still, amazing find!

lazydog

8:19 pm on Aug 31, 2004 (gmt 0)

10+ Year Member



There is a tutorial -
[zend.com...]

The user-agent can be set in php.ini. Look under the section "Fopen wrappers".

For more control over referer and other variables, I suggest you use DOM in combo with Curl.
Retrive the HTML file with Curl and then use -

$dom->loadHTML($html_string) instead of
$dom->loadHTMLFile('http://www.webmasterworld.com/');

coopster: AFAIK PHP5 is not recommended on production systems.

Saurabh.

Clark

2:12 pm on Sep 1, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is Curl built in? Or do I need to install the class?

lazydog

11:30 pm on Sep 2, 2004 (gmt 0)

10+ Year Member



Hi!

To use curl you will need to install the extension. IIRC do configure with "--with-curl"

S.