Forum Moderators: open
Assumption: I have a database of urls as a starting point.
#1. I would start with the first domain, crawl the index page and then add all links to the same domain into one list (let's call it List A) for immediate crawling and all links to outside domains for future crawling into List B.
2. Starting with List A, make sure I didn't crawl the link already, follow the link. Go to #1.
3. Once I'm done with this domain, move on to domain #2 in my list. Go to #1.
4. Once I'm done with all the domains that I started with that were in my database, start to crawl the domains from List B. Go to #1.
Once the domains are done, figure out Pagerank etc, run through standard list of tests for new index, especially going through the top few hundred terms searched on and go live.
5. Sent a freshness bot to recheck all sites that had a timestamp within last couple of days. Decide how frequently to visit. Don't touch the PR until next month, though.
How did I do?
I dabbled in a few languages like php/perl and some databases, but I'm not looking for a job, sorry.
There is of course a lot more to a crawler than that. They need to check the status of pages, to worry that they are crawling human generated (or "based on human generated) content...make sure that they are not crawling into an infinite loop...not to mention a million spam related issues...I'm sure it's really really tough. I bet 10% of the job is dealing with the WWW's good citizens, which make up 99.9% of the people. And 90% is dealing with the .1% of spammers and the .01% of very very clever spammers with lots of very very clever techniques up their sleeve.
If I were them, I'd cut off crawling any content within a domain beyond a certain number of links and have a human check those to make sure it isn't dynamically generated.
<added>BTW, "dynamically generated" is a phrase like "search engine optimization" - why good? why bad? I wouldn't like to have my dynamically generated sites cut off! My content sits inside a well organized database. This doesn't mean, it's no good. But if you speak of dynamically generated subdomains (wildcards) or generated doorway / simply cheating pages, i'd agree!</added>
I think we may have a little miscommunication...I wasn't offended at all. I was just saying that I know very little about seo, I'm just enjoying this for the fun of it...I like to understand the technology...
I'm also not building a robot or crawler. Google seems to do a great job :)
But I'm just curious how their crawler works...
Sounds like you have built your own crawler? Just for fun? You have a search engine?