Welcome to WebmasterWorld Guest from 22.214.171.124
Forum Moderators: bakedjake
Then the user can search as a regular search. the user will get highly relevant content, because all the sites are guaranteed to be about this specific and are human reviewed and added to the index .
Has anybody tried nutch? and what is the experience about it? What about installation , resources , bandwith , behaviour of nutch bot, costs etc.?
Any other than nutch?
It all depends on your budget, how many documents you intend to index, features you need and how much time you have to master the software.
Nutch one the other hand is trying to scale sow it can index the entire net.
If you are planning to humanly review all the pages, then you probably won't get to millions of pages. And thereof can use a simpler system then Nutch.
Other system I can think off is [htdig.org...] , [mnogosearch.org...] , [swish-e.org...]
htdig is not for web indexing and says so in the FAQ. Both Swish-e and ht-dig are for fairly small databases. The next step as far as scalability would be Mnogosearch blob mode and Dataparksearch cache mode. Both which are used for niche web indexing.
Dataparksearch was a branch of Mnogosearch and continued development of cache mode. Meanwhile Mnogosearch went in another direction with blob mode.
I believe ASPseek was once a branch of Mnogosearch or vice-versa. But development and support died years ago but many people still use it. Mnogosearch and Dataparksearch are both very actively supported.
So far we're still crawling and haven't launched yet. That means the bulk of our traffic is incoming (from the crawl) rather than outgoing. This is the opposite of most webservers. We've got a 20mbs feed towards our server that we negotiated with our ISP for this purpose - and because most of this traffic is the opposite of the norm they were able to give us a heavily discounted rate. I'd expect many ISP's to work the same way. (our current server setup with nutch when crawling will use that entire 20mbs feed. We can scale that up or down just by changing the number of open threads used in the crawl).
In terms of hardware, we were using a P4 2.8 with 2 gigs of RAM and a couple of SATA hard drives. I found the results to be a bit slow so we're upgrading to a dual Xeon 3.06 processors, 8gigs of RAM, and scsi hard drives in a raid 0 configuration. We'll also be investigating some caching tools like OScache to help speed things up. A very rough estimate I've found is about 10 gigs per million pages indexed.
The other issue we're still working with is defining what pages qualify to be indexed (the tough part for all niche engines I think). If you're building a SE for the country indicated in your profile, you might be able to filter just those domains that match your country's extension. Failing that things get a bit tricker and require some programming.
I've PM'ed you some other info that includes specific sites.