Programming a Spider

Forum Moderators: mack

Message Too Old, No Replies

Programming a Spider

any hints?

tkarade

12:01 am on Jan 13, 2003 (gmt 0)

I want to code a spider that will search my pages for dead links. I am having some difficulty in resolving the local urls that i find to absolute urls since there are so many different types of local links (htm, asp?, php?, /path, path/file name, just file name ...).

Does anybody have any hints/ideas/tutorials on this aspect of spider programming? Thanks.

andreasfriedrich

12:08 am on Jan 13, 2003 (gmt 0)

Welcome to WebmasterWorld [webmasterworld.com] tkarade.

Be sure to read Marcia`s WebmasterWorld Welcome and Guide to the Basics [webmasterworld.com] post.

O´Reilly has put Clinton Wong´s out of print Web Client Programming with Perl [oreilly.com] online. This is an excellent resource to learn about programming for the web.

Andreas

tkarade

12:45 am on Jan 13, 2003 (gmt 0)

Thanks andreasfriedrich,

I've looked at Clinton Wong´s Web Client Programming with Perl. There is very little about obtaining absolute urls from local url and there are no specifics about implementation.

So far I've been working with "Programming Bots, Spiders, and Intelligent Agents in Microsoft Visual C++" (1999) by David Pallmann. This is a great book for beginners in this area but it only offers one function for obtaining absolute urls using local urls but this function is out dated.

Any other hits/ideas/tutorials/webmasterworld threads?
Is this the right Forum for this discussion?

Thanks.

andreasfriedrich

1:17 am on Jan 13, 2003 (gmt 0)

Are looking for C++ resources only? If so I can´t really help. Google will be a good way to start.

For perl there is the URI [perldoc.com] module which makes handling URIs very easy. There should be URI handling libraries available for every programming language.

RFC2396 - Uniform Resource Identifiers (URI) Generic Syntax [faqs.org] contains an example of how to resolve relative URIs.

Andreas

aspdaddy

2:18 am on Jan 13, 2003 (gmt 0)

Might find this [groups.google.com] useful. You should be able to rewrite with stl strings in c++. I dont know if its up to date though.

Brett_Tabke

8:06 am on Jan 13, 2003 (gmt 0)

The easiest approach is to start with the two "roots":

www.example.com/my/stuff/here

Which is the local path:

/www/home/example/my/stuff/here

1)
The only tricky part is ripping off the front end of a link on an absolute url:
<a href="http://www.example.com/my/stuff/here">stuff</a>

That's not too bad since you look for the "http" and determine it to be an absolute url from that.

2)
The equiv relative url:

<a href="/my/stuff/here">stuff</a>

Which is pretty easy too, since it starts with a "/"

3)
How about a relative url that doesn't start at root?

<a href="here.htm">stuff</a>

Which is stored in the "/my/stuff/here/" directory. That's not too bad either since we have to know what directory it is in in order to get the file. If you do a local file load of it, you probably read the directory into somehow. That variable used to point at the directory gets tacked on to the front of the filename.

4)
Now comes the fun ones:
<a href="../here/here.htm">stuff</a>

Which is called from a example "/my/stuff/overthere/" directory. Again, you know the directory that the file is in based on your loading of it. Now just tack the two together again keeping the "../" on the front end - let the file system figure it out.

tkarade

1:07 am on Jan 15, 2003 (gmt 0)

Thanks for all the help.

I have found several other pages that may be of interest to anybody who is working on the same problem.

Checklist for Search Robot Crawling and Indexing:
http://www.searchtools.com/robots/robot-checklist.html

A document that defines the syntax and semantics for
relative URLS:
http://www.ietf.org/rfc/rfc1808.txt

A document that defines the syntax and semantics for
absolute URLS:
http://www.ietf.org/rfc/rfc1738.txt

An html approach to links:
http://www.w3.org/TR/1999/REC-html401-19991224/struct/links.html

Source Code for Web Robot Spiders:
http://www.searchtools.com/robots/robot-code.html

Thanks again,
TK

andreasfriedrich

1:26 am on Jan 15, 2003 (gmt 0)

Both RFC 1808 and 1738 were updated by [faqs.org ]. So you better have a look there when you are after up2date information. See my msg #4 [webmasterworld.com].

Andreas

tkarade

4:35 pm on Jan 17, 2003 (gmt 0)

Thanks Andreas,

You're right, RFC2396 is the document to look at for most up to date information. Your suggestion of searching for a C++ URI handling library was also a good one. I will try to find one.
If there is anybody who knows about a C++ URI library please let me know.

Thanks,
TK

victor

7:14 pm on Jan 17, 2003 (gmt 0)

If you are a windows user, think about just downloading Xenu Link Sleuth. It does what you want, and it's free.

[home.snafu.de...]

Less fun than programming it yourself, but a bit quicker.

webdevsf

7:25 pm on Jan 17, 2003 (gmt 0)

If you program in VB.NET, here's a sample spider (and article)

[msdn.microsoft.com...]

andreasfriedrich

7:29 pm on Jan 31, 2003 (gmt 0)

A PHP implementation of a function that resolves relative URIs [webmasterworld.com] can be found in the Bag-O-Tricks for PHP II [webmasterworld.com].