Welcome to WebmasterWorld Guest from

Forum Moderators: bakedjake

Message Too Old, No Replies

launch specific purpose search engine

using nutch open source engine or any other



5:11 pm on Jan 28, 2006 (gmt 0)

10+ Year Member

The idea is to build manually the index of good resources for a specific theme and have a spider crawl those websites/webpages I specify.

Then the user can search as a regular search. the user will get highly relevant content, because all the sites are guaranteed to be about this specific and are human reviewed and added to the index .

Has anybody tried nutch? and what is the experience about it? What about installation , resources , bandwith , behaviour of nutch bot, costs etc.?
Any other than nutch?


9:05 pm on Jan 28, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

I've used hyperseek to run a niche search engine for a lot of years. I really like it. No experience with nutch.


11:23 pm on Jan 28, 2006 (gmt 0)

10+ Year Member

I looked at Nutch and was overwhelmed by the learning curve. I found Dataparksearch to be feature rich, scales well and is fairly easy to configure/support.

It all depends on your budget, how many documents you intend to index, features you need and how much time you have to master the software.


11:40 pm on Jan 28, 2006 (gmt 0)

10+ Year Member

Nutch is good but you have to be a developer to really get the most out of it.

Its not a fully fledged search engine rather its the tools for a search engine and you have to tweak and play around with settings and code.


4:36 am on Jan 29, 2006 (gmt 0)

10+ Year Member

Hyperseek and Dataparksearch relays on relational databases to do the searching and storing the index. This type of setup typically donít scale to more then a few millions documents. But are easier to install, and develop.

Nutch one the other hand is trying to scale sow it can index the entire net.

If you are planning to humanly review all the pages, then you probably won't get to millions of pages. And thereof can use a simpler system then Nutch.

Other system I can think off is [htdig.org...] , [mnogosearch.org...] , [swish-e.org...]


10:18 pm on Jan 29, 2006 (gmt 0)

10+ Year Member

Other system I can think off is [htdig.org...] , [mnogosearch.org...] , [swish-e.org...]

htdig is not for web indexing and says so in the FAQ. Both Swish-e and ht-dig are for fairly small databases. The next step as far as scalability would be Mnogosearch blob mode and Dataparksearch cache mode. Both which are used for niche web indexing.

Dataparksearch was a branch of Mnogosearch and continued development of cache mode. Meanwhile Mnogosearch went in another direction with blob mode.

I believe ASPseek was once a branch of Mnogosearch or vice-versa. But development and support died years ago but many people still use it. Mnogosearch and Dataparksearch are both very actively supported.


4:29 pm on Feb 20, 2006 (gmt 0)

WebmasterWorld Senior Member wheel is a WebmasterWorld Top Contributor of All Time 10+ Year Member

We're working on a niche search engine built on nutch. As noted above, it's not for the faint of heart. It is however very scalable and will easily handle 10's of millions of docs. In addition the user mailing lists are populated with some very friendly and helpful folks - including at least one other member from here.

So far we're still crawling and haven't launched yet. That means the bulk of our traffic is incoming (from the crawl) rather than outgoing. This is the opposite of most webservers. We've got a 20mbs feed towards our server that we negotiated with our ISP for this purpose - and because most of this traffic is the opposite of the norm they were able to give us a heavily discounted rate. I'd expect many ISP's to work the same way. (our current server setup with nutch when crawling will use that entire 20mbs feed. We can scale that up or down just by changing the number of open threads used in the crawl).

In terms of hardware, we were using a P4 2.8 with 2 gigs of RAM and a couple of SATA hard drives. I found the results to be a bit slow so we're upgrading to a dual Xeon 3.06 processors, 8gigs of RAM, and scsi hard drives in a raid 0 configuration. We'll also be investigating some caching tools like OScache to help speed things up. A very rough estimate I've found is about 10 gigs per million pages indexed.

The other issue we're still working with is defining what pages qualify to be indexed (the tough part for all niche engines I think). If you're building a SE for the country indicated in your profile, you might be able to filter just those domains that match your country's extension. Failing that things get a bit tricker and require some programming.

I've PM'ed you some other info that includes specific sites.


Featured Threads

Hot Threads This Week

Hot Threads This Month