Forum Moderators: coopster & phranque

Message Too Old, No Replies

Creating Spider and/or Search Engine

         

ukgimp

3:53 pm on Dec 6, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sorry if this is in the wrong section. I have done some lurking and could not find what i wanted.

Q. Is is possible or even worth bothering trying to write a spider to harvest just web site/pages that contain a certain phrase, 'maths' say and then input them into a searchable database. If so are there any decent resources to adapt. If so i would want to avoid major pitfalls such as unsuitable/repeat requests etc.

Cheers

Richard

seriesint

4:28 pm on Dec 6, 2001 (gmt 0)



Hi ukgimp,
Welcome to Webmaster World :)

If there is a section for building bots here I've yet to find it though the topic keeps popping up in different places. But I'll just assume you're going to use a server side script to run the bot and answer here.

As far as is it worth it? I can't say. It takes a lot of effort to parse pages if one is after specific content. Such as links or images and the spider is to extract those and keep track of where it found them. But something like your suggestion wouldn't be that difficult. The spider would request a webpage, parse what it could (Lots of things interfere with its "reading" of web pages)then for your purpose all the HTML tags would be stripped out of the page and it would index the words looking for the phrase. It could then toss the page (else one would have to have some type of storage setup for the spider to use and this would be a massive drain)and move onto the next url in its list. This simplifies it down and misses some points to watch out for, ones you mention like repeating urls etc. Then its a question of how long does it take it to parse a decent chunk of the web to give the results you want. And what handles the data it feeds back.

Assuming that you are new to spider/bot creation one of the best places to get a start is with Perl's LWP (libwww-perl) module. It allows basic HTTP requests etc and Perl's real good at handling the parsing of pages. That is Perl's good for a starting point. Any serious or major work would have to move to a more complete lanaguage.

I'll stop here, I can go on about spiders/bots for some time before I run out of steam. If you have any questions, just ask and I'll see what I can come up with.

HTH
later