How to design a good/nice spider?

Hello everyone!

I'm quite new to these forums, so I hope I'm writing this in the right forum.

I am currently writing my master thesis in computer sciense at the University of Oslo, Norway. A part of this thesis is writing a spider that for all visible purposes, acts like a search engine spider.

My master project is creating a software package that can perform scientific measuring of the internet. (Yeah, I know it sounds abstract but bear with me.) The data that this software will collect will not be publicly available.
The main idea is to create a spider that starts with just a few hand picked pages, and then just let it spread. The pages it visits will be cashed locally. After a while the spider should start to revisit all the pages in the cash, and compare the new one to the cash. Then it can start measuring different things about the websites in the cash.

How often do the page change? And how much?
How much media are included in the page?
Is the page valid x/hxml?
Is the css valid? How is the css included, if any?
etc, etc, etc

After reading a lot of posts in this forum, I've come to understand webmasters as quite paranoid and protective of their content, which is perfectly understandable. But that is why I ask you which rules to follow when I create my spider?

I have picked up a few things:

Always create a unique UA, which includes the word 'spider' or 'bot' for easy recognition.
Include a e-mail address in the from header
Set up a website with information about the spider and what it does and why. (Include this in the UA?)
Respect the robots.txt, or else it will be blocked
Don't read the robots.txt, cause if you're not google, you're blocked ;)
Respect the site's bandwitch and don't pull too much data too fast

But I still have quite a few questions that I don't have the answer to:

Do I get blocked for reading the robots.txt?
How long should I wait between each page pull from a site? And how long should I wait between each time I revisit a site to compare pages?
How widespread is the HTTP header If-Modified-Since? Is it in use? Do webmasters correctly use it even on dynamic generated pages, where the main content of the page usually don't change?
I've read in my spidering book about something called E-tags. Are these used at all?
How to deal with redirects and other http error pages? How many pages can return a 404 before I realise I'm banned? How long until I could retry to see if I'm unbanned?

Phew, this was a long post. To sum things up, the main question is: Do you have any guidelines to follow when creating a good spider?

Best Regards

Vidar Johansen

How to design a good/nice spider?

spider design

vidaj

incrediBILL

Lord Majestic

wilderness

incrediBILL

Lord Majestic

thetrasher

Lord Majestic

incrediBILL

Staffa

Lord Majestic

jdMorgan

incrediBILL

Lord Majestic

incrediBILL

Rosalind

Samizdata

Lord Majestic

brotherhood of LAN

incrediBILL

Lord Majestic

incrediBILL

Lord Majestic

jdMorgan

Lord Majestic

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week