Forum Moderators: phranque

Message Too Old, No Replies

Creating Search Spiders

Make it behave

         

brotherhood of LAN

4:58 pm on Jan 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



After making a directory, I want to add a search feature to the directory and create a web-spider script to grab relevant pages from the web.

My ideal scenario would be to create this with PHP and use CURL for most of the HTTP stuff.

Before I go all ham-fisted trying to make it, im trying to map out the major parts to any search spider that is let loose on the web.

1. Make it robots.txt compliant
2. Try to make it bandwidth friendly (spidering frequency and cache headers)
3. Course of action for loops regarding dynamic pages and spider traps
4. Detection of 4## error pages and what to do with them
5. Deal with re-directs
6. Recognise [domain.com...] and www.domain.com as same or different pages

and no doubt many more! Goal #1 with the spider is to make sure its not a rogue, and that it behaves as much as possible regarding other web sites.

Since the directory in question is topic-specific, I'll either have to manually enter listings or designate a page like a DMOZ category as some sort of spidering hole to get relevant sites.

Aside from that, the spider needs to be well behaved, has anyone had any experiences making these sort of scripts, and is their preferred ways of addressing the above issues?

I wouldnt want to be checking robots.txt for every page that I spider.....or keep requesting pages from a long dead domain.......that sort of thing.

Hints and suggestions most welcome, I'll begin to CURL my way into this script sooner or later ;)

Brett_Tabke

11:06 am on Jan 14, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Very cool BOL.

I went the route of a big ol robots.txt crawl. Do that first and perfect your robots.txt parsing routines. You just can't imagine the kind of stuff out there in robots.txt files. I ran into one that was a love letter from a guy to a bot - go figure.

You are on the right track to compartmentalize the tasks. Do the same with the process. It is tempting to start combining routines when you see repetitive code - don't do it. Keep them stand alone.

I prefer to do it is stages - download robots.txt - then download the pages - then process the pages - then stuff them in a db - then work on the db extraction routines.

All those are stand alone code and can be run by themselves or in the background.

You eliminate 90% of the problems if each part of the process is kept by itself. Weird, bizaree, and difficult to diagnose bugs creep in when you try to combines routines into a mega program - k.i.s.s.

brotherhood of LAN

5:33 pm on Jan 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for replying Brett.

I went the route of a big ol robots.txt crawl. Do that first and perfect your robots.txt parsing routines. You just can't imagine the kind of stuff out there in robots.txt files. I ran into one that was a love letter from a guy to a bot...

I think you done this not long before I signed up ;) You mentioned at the time about 10% of robots.txt are invalid? scary......

With the db I am using I hope to 'compartmentalise' the domain and its subpages so that robots.txt can be grabbed depending on the "domainid".

When the robots.txt has been parsed, the spider is let loose on the site. I'm not sure about how frequent I should visit robots.txt though......

Perhaps revisiting robots.txt every 30 days is acceptable?

I also wonder about the speed at which a robot should grab pages from a site. No doubt my single, shared-server couldnt grab pages too fast but its probably something else I have to consider....

And this is before I even look at the code of a page :)

Seems like I'm going to have to do a lot of reading and learning.......if anyone has anything to chip in it is most welcome!