homepage Welcome to WebmasterWorld Guest from 54.243.17.133
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Alternative Search Engines
Forum Library, Charter, Moderators: bakedjake

Alternative Search Engines Forum

    
launch specific purpose search engine
using nutch open source engine or any other
vfilip




msg:463963
 5:11 pm on Jan 28, 2006 (gmt 0)

The idea is to build manually the index of good resources for a specific theme and have a spider crawl those websites/webpages I specify.

Then the user can search as a regular search. the user will get highly relevant content, because all the sites are guaranteed to be about this specific and are human reviewed and added to the index .

Has anybody tried nutch? and what is the experience about it? What about installation , resources , bandwith , behaviour of nutch bot, costs etc.?
Any other than nutch?

 

treeline




msg:463964
 9:05 pm on Jan 28, 2006 (gmt 0)

I've used hyperseek to run a niche search engine for a lot of years. I really like it. No experience with nutch.

Kahless




msg:463965
 11:23 pm on Jan 28, 2006 (gmt 0)

I looked at Nutch and was overwhelmed by the learning curve. I found Dataparksearch to be feature rich, scales well and is fairly easy to configure/support.

It all depends on your budget, how many documents you intend to index, features you need and how much time you have to master the software.

mooret




msg:463966
 11:40 pm on Jan 28, 2006 (gmt 0)

Nutch is good but you have to be a developer to really get the most out of it.

Its not a fully fledged search engine rather its the tools for a search engine and you have to tweak and play around with settings and code.

runarb




msg:463967
 4:36 am on Jan 29, 2006 (gmt 0)

Hyperseek and Dataparksearch relays on relational databases to do the searching and storing the index. This type of setup typically donít scale to more then a few millions documents. But are easier to install, and develop.

Nutch one the other hand is trying to scale sow it can index the entire net.

If you are planning to humanly review all the pages, then you probably won't get to millions of pages. And thereof can use a simpler system then Nutch.

Other system I can think off is [htdig.org...] , [mnogosearch.org...] , [swish-e.org...]

Kahless




msg:463968
 10:18 pm on Jan 29, 2006 (gmt 0)


Other system I can think off is [htdig.org...] , [mnogosearch.org...] , [swish-e.org...]

htdig is not for web indexing and says so in the FAQ. Both Swish-e and ht-dig are for fairly small databases. The next step as far as scalability would be Mnogosearch blob mode and Dataparksearch cache mode. Both which are used for niche web indexing.

Dataparksearch was a branch of Mnogosearch and continued development of cache mode. Meanwhile Mnogosearch went in another direction with blob mode.

I believe ASPseek was once a branch of Mnogosearch or vice-versa. But development and support died years ago but many people still use it. Mnogosearch and Dataparksearch are both very actively supported.

wheel




msg:463969
 4:29 pm on Feb 20, 2006 (gmt 0)

We're working on a niche search engine built on nutch. As noted above, it's not for the faint of heart. It is however very scalable and will easily handle 10's of millions of docs. In addition the user mailing lists are populated with some very friendly and helpful folks - including at least one other member from here.

So far we're still crawling and haven't launched yet. That means the bulk of our traffic is incoming (from the crawl) rather than outgoing. This is the opposite of most webservers. We've got a 20mbs feed towards our server that we negotiated with our ISP for this purpose - and because most of this traffic is the opposite of the norm they were able to give us a heavily discounted rate. I'd expect many ISP's to work the same way. (our current server setup with nutch when crawling will use that entire 20mbs feed. We can scale that up or down just by changing the number of open threads used in the crawl).

In terms of hardware, we were using a P4 2.8 with 2 gigs of RAM and a couple of SATA hard drives. I found the results to be a bit slow so we're upgrading to a dual Xeon 3.06 processors, 8gigs of RAM, and scsi hard drives in a raid 0 configuration. We'll also be investigating some caching tools like OScache to help speed things up. A very rough estimate I've found is about 10 gigs per million pages indexed.

The other issue we're still working with is defining what pages qualify to be indexed (the tough part for all niche engines I think). If you're building a SE for the country indicated in your profile, you might be able to filter just those domains that match your country's extension. Failing that things get a bit tricker and require some programming.

I've PM'ed you some other info that includes specific sites.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Alternative Search Engines
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved