Forum Moderators: phranque

Message Too Old, No Replies

How to create a spider?

Where can I learn?

         

jtoddv

9:46 pm on Jan 9, 2004 (gmt 0)

10+ Year Member



I have an idea for a program and want to learn how to create a spider. One that scans just a list of a few pages, not one like GoogleBot, but maybe with potential down the road. Just a simple spider to parse HTML pages.

1. What language should I create this spider in?
2. Where can I learn do this? (resource sites or examples)

thanks,
Justin

bakedjake

9:50 pm on Jan 9, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



There are many, many modules for many, many different languages for parsing HTML pages.

Google for "HTML parser". To extend that to spider functionality, simply read in the array of links, and keep following them. That's what a basic spider does.

I'd recommend perl/python, just for the availabilty of good parsing objects.

rcjordan

9:53 pm on Jan 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



check out Tara's book, "Spidering Hacks"

jtoddv

10:06 pm on Jan 9, 2004 (gmt 0)

10+ Year Member



Thanks guys.

I don't want to keep following links right now, just want to pull out specific parts and that is it.

I am not a coding genius by any means, just needed some direction on which language is the best. I figured Perl would be the best, or at least that is what I have read, but wasn't sure.

I understand how spiders work, just need someone to show me how to do it. I need a teacher. Do you know any good tutorials on creating a spider in Perl?

bakedjake

10:08 pm on Jan 9, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



You don't want a spider then, you want an HTML Parser. :) Spiders crawl the web, and then parse individual pages.

Here's the reference on perldoc for the HTML::Parser module. It's a good starting point, with examples:

[perldoc.com...]

bcolflesh

10:10 pm on Jan 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



spider.pl

swish-e.org/
swish-e.org/current/docs/spider.html

jtoddv

10:19 pm on Jan 9, 2004 (gmt 0)

10+ Year Member



Cool thanks guys.

If anyone else has any links to tutorials on how to create an HTML parser in Perl please post them. I want to learn! ;)

bcolflesh

10:24 pm on Jan 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



2 NL HTML Parser for Perl

nl-html-parser.sourceforge.net/

icewalkers.com/Perl/5.8.0/lib/HTML/Parser.html

libwww-perl also includes an HTML parser:

sourceforge.net/projects/libwww-perl/