Forum Moderators: phranque
My ideal scenario would be to create this with PHP and use CURL for most of the HTTP stuff.
Before I go all ham-fisted trying to make it, im trying to map out the major parts to any search spider that is let loose on the web.
1. Make it robots.txt compliant
2. Try to make it bandwidth friendly (spidering frequency and cache headers)
3. Course of action for loops regarding dynamic pages and spider traps
4. Detection of 4## error pages and what to do with them
5. Deal with re-directs
6. Recognise [domain.com...] and www.domain.com as same or different pages
and no doubt many more! Goal #1 with the spider is to make sure its not a rogue, and that it behaves as much as possible regarding other web sites.
Since the directory in question is topic-specific, I'll either have to manually enter listings or designate a page like a DMOZ category as some sort of spidering hole to get relevant sites.
Aside from that, the spider needs to be well behaved, has anyone had any experiences making these sort of scripts, and is their preferred ways of addressing the above issues?
I wouldnt want to be checking robots.txt for every page that I spider.....or keep requesting pages from a long dead domain.......that sort of thing.
Hints and suggestions most welcome, I'll begin to CURL my way into this script sooner or later ;)
I went the route of a big ol robots.txt crawl. Do that first and perfect your robots.txt parsing routines. You just can't imagine the kind of stuff out there in robots.txt files. I ran into one that was a love letter from a guy to a bot - go figure.
You are on the right track to compartmentalize the tasks. Do the same with the process. It is tempting to start combining routines when you see repetitive code - don't do it. Keep them stand alone.
I prefer to do it is stages - download robots.txt - then download the pages - then process the pages - then stuff them in a db - then work on the db extraction routines.
All those are stand alone code and can be run by themselves or in the background.
You eliminate 90% of the problems if each part of the process is kept by itself. Weird, bizaree, and difficult to diagnose bugs creep in when you try to combines routines into a mega program - k.i.s.s.
I went the route of a big ol robots.txt crawl. Do that first and perfect your robots.txt parsing routines. You just can't imagine the kind of stuff out there in robots.txt files. I ran into one that was a love letter from a guy to a bot...
I think you done this not long before I signed up ;) You mentioned at the time about 10% of robots.txt are invalid? scary......
With the db I am using I hope to 'compartmentalise' the domain and its subpages so that robots.txt can be grabbed depending on the "domainid".
When the robots.txt has been parsed, the spider is let loose on the site. I'm not sure about how frequent I should visit robots.txt though......
Perhaps revisiting robots.txt every 30 days is acceptable?
I also wonder about the speed at which a robot should grab pages from a site. No doubt my single, shared-server couldnt grab pages too fast but its probably something else I have to consider....
And this is before I even look at the code of a page :)
Seems like I'm going to have to do a lot of reading and learning.......if anyone has anything to chip in it is most welcome!