ZydoSEO - 3:44 pm on Dec 11, 2012 (gmt 0)
I didn't mean to imply in any way that crawlers crawl sites exactly the way humans do. I may have oversimplified my explanation, but didn't feel the need to explain how URLs indexed from a site might be queued up for crawling, get crawled, and get indexed.
And I too have heard Cutts say on numerous occasions say that PR has a lot to do with not only how often sites/URLs are crawled, but also how may and which pages from a site are indexed. And yes, internal linking structures (navigation links) DEFINITELY play a part in which pages are crawled and indexed as they are strong signals as to which pages the webmaster deems most important, second in importance, etc. and they control the flow of most PR around the site.
But I think everyone will agree that there is some scheduled frequency at which a site's URLs are crawled. Over time a site's URLs likely get divided into sets with different crawl frequencies, but when a site is new with no inbound links the PR of all of the URLs on the site are infintesimally small. So most all of its URLs fall in the same "set".
This seems to be blatently obvious for new sites or those with no or few links. For new sites this frequency might start out something like once per month. Huge spikes in crawling activity are seen in WMT and server logs periodically (like once every 3-5 weeks), though a few pages (maybe home page and a few of the 1st level pages) might get crawled in between. I think these spikes are the scheduled "carpet-crawling" events to which 1script was refering and the schedued deep crawl events that I referred to where seeming most of the URLs known for that site seem to get crawled in a short burst.
This is likely because they where all queued up at the same time based on the crawling algos schedule for that domain or "set" of pages for that domain. I think these queued up URLs simply become "seed" URLs for the crawling process. But the crawler likely has some liberty to "on the fly" not just crawl those specific seed URLs from the queue, but also to follow links on those seed URLs to crawl other URLs not yet scheduled to be crawled... perhaps under certain circumstances even to crawl, for example, any URL within X hops of that seed URL.
As you mentioned, freshness definitely plays a factor in crawl frequency. If Googlebot returns for a deep crawl in a month and no new pages or no updated content are discovered... see ya next month (for the most part)! But if hundreds of new pages are discovered, they'll likely return sooner the next time... maybe in two weeks. These adjustments in scheduled crawl frequency continue until the crawling algorithm finds a balance between the rate at which your site generates content and the rate at which they deep crawl your site.
However, I think external links also play a big roll in crawl frequency. From what I've seen, they trigger incremental, partial crawls of small sections of your site between those URLs scheduled crawls. Not only do the known URLs on your site get queued up to be crawled periodically based on their schedule crawls, but those URLs on other sites that link to your site also get queued up for their scheduled crawling in between your URLs scheduled crawl events. And when crawling those external URLs that link to your URL, I do believe the crawler often takes the liberty to crawl not only the page on your site being linked to by that external "seed" URL, but one or more pages on your site in close proximity to your URL being linked to.
Honestly, why would they call it a crawler if it were not allowed to "explore" in an attempt to discover new content. It seems they could simply have labeled it a "fetcher" that reads a fixed set of URLs from a queue and fetches the documents at those addresses. I am pretty sure that prioritized list that gets queued up is simply a seed list to make sure that the various URLs "at least" get crawled ever so often, but links from other sites can trigger them to be crawled more frequently.
Perhaps I'm wrong, but that has been my experience. Think I'll do some testing with some brand new domains and track Googlebot activity over the first 6 months as they go from no links to having links to see if anything can be learned.