Hyperseek and Dataparksearch relays on relational databases to do the searching and storing the index. This type of setup typically donít scale to more then a few millions documents. But are easier to install, and develop.
Nutch one the other hand is trying to scale sow it can index the entire net.
If you are planning to humanly review all the pages, then you probably won't get to millions of pages. And thereof can use a simpler system then Nutch.
htdig is not for web indexing and says so in the FAQ. Both Swish-e and ht-dig are for fairly small databases. The next step as far as scalability would be Mnogosearch blob mode and Dataparksearch cache mode. Both which are used for niche web indexing.
Dataparksearch was a branch of Mnogosearch and continued development of cache mode. Meanwhile Mnogosearch went in another direction with blob mode.
I believe ASPseek was once a branch of Mnogosearch or vice-versa. But development and support died years ago but many people still use it. Mnogosearch and Dataparksearch are both very actively supported.
We're working on a niche search engine built on nutch. As noted above, it's not for the faint of heart. It is however very scalable and will easily handle 10's of millions of docs. In addition the user mailing lists are populated with some very friendly and helpful folks - including at least one other member from here.
So far we're still crawling and haven't launched yet. That means the bulk of our traffic is incoming (from the crawl) rather than outgoing. This is the opposite of most webservers. We've got a 20mbs feed towards our server that we negotiated with our ISP for this purpose - and because most of this traffic is the opposite of the norm they were able to give us a heavily discounted rate. I'd expect many ISP's to work the same way. (our current server setup with nutch when crawling will use that entire 20mbs feed. We can scale that up or down just by changing the number of open threads used in the crawl).
In terms of hardware, we were using a P4 2.8 with 2 gigs of RAM and a couple of SATA hard drives. I found the results to be a bit slow so we're upgrading to a dual Xeon 3.06 processors, 8gigs of RAM, and scsi hard drives in a raid 0 configuration. We'll also be investigating some caching tools like OScache to help speed things up. A very rough estimate I've found is about 10 gigs per million pages indexed.
The other issue we're still working with is defining what pages qualify to be indexed (the tough part for all niche engines I think). If you're building a SE for the country indicated in your profile, you might be able to filter just those domains that match your country's extension. Failing that things get a bit tricker and require some programming.
I've PM'ed you some other info that includes specific sites.