Forum Moderators: bakedjake
There is an ongoing development effort and a mailing list, but that list is very techie.
The claim is that CLucene is faster than Lucene because it's written in C++
The package doesn't have a crawler .. nor does it presently have clustering. Can anyone recommend an OS crawler?
Is there a CNutch on the horizon?
Being a C++ port of Lucene, it won't have a web crawler - Lucene is a full-text indexer, which is just one component of a search engine. Nutch integrates the lucene code into a search engine and bolts on its own web crawler. One point to note about CLucene and Lucene is how you would integrate each into a web server - Lucene is easy, using a Tomcat servlet engine (it comes with a web template example), but how would you do this with a C++ version?
As far as I know, there are no announced plans for a C+ nutch. There are other ports of Lucene, notably, Lupy, a python version of Lucene.
Simon
I believe you are using Nutch?
From what I have been able to find, CLucene is structured slightly differently, besided being written in C++.
For me, the big attraction to CLucene is being able to run it on a server without JAVA.
Although CLucene does not have a crawler and I'm not sure about topic clustering.
I have found it almost impossible to get a JAVA developer to give me a ballpark price on rewriting the crawler and indexer for the niche I'm interested in.
So I figure that there are many more C++ programmers out there than JAVA.
I may be wrong, of course.
Although CLucene does not have a crawler and I'm not sure about topic clustering.
You're right that CLucene doesn't have a crawler, but you could try larbin, a GPL'ed C++ web crawler to do this for you. It's a bit old, but reliable. It uses a hash table to keep track of which pages it has already seen, and the default size will allow up to 64m pages - you should be able to increase this, though.
With Heretrix, Nutch and the basic java crawlers and support thereof i don't see any reason to re-write in c/c++.
Heretrix is free and gathers up the content for Archive.org so it scales pretty well. Nutch is being re-written in mapreduce so even that can grow really well.
old_expat " .. to find help rewriting the crawler and indexer."
Byron "With Heretrix, Nutch and the basic java crawlers and support thereof i don't see any reason to re-write in c/c++."
I'm speaking of rewiriting the crawler and indexer (maybe my wrong choice of words) to crawl only the niche I want to crawl. I don't want a full web engine, but a niche engine
I have spent some time on the Nuch mailing lists trying to get help and have been mainly ignored. It seems to be open source but a somewhat closed community.
It has sort of always gone like,
"Can you give me an order of magnitude for .. "
"I charge $150 per hour."
"Can you estimate the number of hours? .. hello .. hello?"
feel free to PM me.. i don't think switching to C/C++ will find you more developers though :)
The list generally won't hand-hold you because we get tons of people asking for such when the docs are fairly well done to get people going. Best to build a search, test it, implement it and then get it going to the way you want once you understand the process.
Indexing times were 13 min. for IXE, 6 min. for Zettair and 4 hours for Lucene
- [cs.yorku.ca...]
Seems odd that Lucene is that slower. Data size and search time isn’t that different.