TypicalSurfer - 12:36 pm on Oct 9, 2012 (gmt 0)
Your initial crawl could just be bone-headed in terms of TLD, that would give a good base, a decent document collection. Since large scale search indexes are distributed across multiple machines you would have to split that accordingly, roughly a million pages per gig of RAM. So a machine with 24G of physical memory would be able to store 24MM documents and serve queries without disk seeks (newer SSD disks could be of assistance as well), that's how you maintain speed.
Since you are really just storing separate indexes you can run separate crawls, just updating individual collections. This is where you could do more "nook and cranny" type crawling, bring in new or relevant urls into an individual index.
There are still some human edited directories that would be a good starting point.