Forum Moderators: mack
Does anybody have any hints/ideas/tutorials on this aspect of spider programming? Thanks.
Be sure to read Marcia`s WebmasterWorld Welcome and Guide to the Basics [webmasterworld.com] post.
O´Reilly has put Clinton Wong´s out of print Web Client Programming with Perl [oreilly.com] online. This is an excellent resource to learn about programming for the web.
Andreas
I've looked at Clinton Wong´s Web Client Programming with Perl. There is very little about obtaining absolute urls from local url and there are no specifics about implementation.
So far I've been working with "Programming Bots, Spiders, and Intelligent Agents in Microsoft Visual C++" (1999) by David Pallmann. This is a great book for beginners in this area but it only offers one function for obtaining absolute urls using local urls but this function is out dated.
Any other hits/ideas/tutorials/webmasterworld threads?
Is this the right Forum for this discussion?
Thanks.
For perl there is the URI [perldoc.com] module which makes handling URIs very easy. There should be URI handling libraries available for every programming language.
RFC2396 - Uniform Resource Identifiers (URI) Generic Syntax [faqs.org] contains an example of how to resolve relative URIs.
Andreas
www.example.com/my/stuff/here
Which is the local path:
/www/home/example/my/stuff/here
1)
The only tricky part is ripping off the front end of a link on an absolute url:
<a href="http://www.example.com/my/stuff/here">stuff</a>
That's not too bad since you look for the "http" and determine it to be an absolute url from that.
2)
The equiv relative url:
<a href="/my/stuff/here">stuff</a>
Which is pretty easy too, since it starts with a "/"
3)
How about a relative url that doesn't start at root?
<a href="here.htm">stuff</a>
Which is stored in the "/my/stuff/here/" directory. That's not too bad either since we have to know what directory it is in in order to get the file. If you do a local file load of it, you probably read the directory into somehow. That variable used to point at the directory gets tacked on to the front of the filename.
4)
Now comes the fun ones:
<a href="../here/here.htm">stuff</a>
Which is called from a example "/my/stuff/overthere/" directory. Again, you know the directory that the file is in based on your loading of it. Now just tack the two together again keeping the "../" on the front end - let the file system figure it out.
I have found several other pages that may be of interest to anybody who is working on the same problem.
Checklist for Search Robot Crawling and Indexing:
http://www.searchtools.com/robots/robot-checklist.html
A document that defines the syntax and semantics for
relative URLS:
http://www.ietf.org/rfc/rfc1808.txt
A document that defines the syntax and semantics for
absolute URLS:
http://www.ietf.org/rfc/rfc1738.txt
An html approach to links:
http://www.w3.org/TR/1999/REC-html401-19991224/struct/links.html
Source Code for Web Robot Spiders:
http://www.searchtools.com/robots/robot-code.html
Thanks again,
TK
[home.snafu.de...]
Less fun than programming it yourself, but a bit quicker.
[msdn.microsoft.com...]