Forum Moderators: bakedjake

Message Too Old, No Replies

CLucene .

.. anyone using it?

         

old_expat

5:30 am on Jan 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I ran across this recently.

There is an ongoing development effort and a mailing list, but that list is very techie.

The claim is that CLucene is faster than Lucene because it's written in C++

The package doesn't have a crawler .. nor does it presently have clustering. Can anyone recommend an OS crawler?

Is there a CNutch on the horizon?

simon2263

9:27 am on Jan 13, 2006 (gmt 0)

10+ Year Member



It also doesn't have threading, although I don't know how this affects performance. I'm a little suspicious about claims that program X written in C++ is faster than X in Java just because C++ is compiled and Java isn't - it also depends on how well the program is written.

Being a C++ port of Lucene, it won't have a web crawler - Lucene is a full-text indexer, which is just one component of a search engine. Nutch integrates the lucene code into a search engine and bolts on its own web crawler. One point to note about CLucene and Lucene is how you would integrate each into a web server - Lucene is easy, using a Tomcat servlet engine (it comes with a web template example), but how would you do this with a C++ version?

As far as I know, there are no announced plans for a C+ nutch. There are other ports of Lucene, notably, Lupy, a python version of Lucene.

Simon

ByronM

7:40 pm on Jan 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I use java lucene and working towards indexing over a billion documents. I would say it's fairly fast and scallable.

old_expat

11:14 pm on Jan 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



" ... java lucene ..."

I believe you are using Nutch?

From what I have been able to find, CLucene is structured slightly differently, besided being written in C++.

For me, the big attraction to CLucene is being able to run it on a server without JAVA.

Although CLucene does not have a crawler and I'm not sure about topic clustering.

I have found it almost impossible to get a JAVA developer to give me a ballpark price on rewriting the crawler and indexer for the niche I'm interested in.

So I figure that there are many more C++ programmers out there than JAVA.

I may be wrong, of course.

old_expat

11:19 pm on Jan 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



".. Lupy .."

"The Lupy project has been RETIRED! For full-text indexing and search we recommend Xapwrap or PyLucene instead."

On the Clucene mailing list, some comments suggest that one of the Python wrappers is considerably slower.

ByronM

3:51 am on Jan 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is there a reason you would use C over the java implementation?

simon2263

9:34 am on Jan 18, 2006 (gmt 0)

10+ Year Member




Although CLucene does not have a crawler and I'm not sure about topic clustering.

You're right that CLucene doesn't have a crawler, but you could try larbin, a GPL'ed C++ web crawler to do this for you. It's a bit old, but reliable. It uses a hash table to keep track of which pages it has already seen, and the default size will allow up to 64m pages - you should be able to increase this, though.

old_expat

3:53 pm on Jan 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



" Is there a reason you would use C over the java implementation?"

No JAVA requirement on server and easier, I think, to find help rewriting the crawler and indexer.

ByronM

6:28 pm on Jan 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, jvm is easy to install.. infact easier to support than potential glibc problems you face in c/c++ world (unless you compile everything from source)

With Heretrix, Nutch and the basic java crawlers and support thereof i don't see any reason to re-write in c/c++.

Heretrix is free and gathers up the content for Archive.org so it scales pretty well. Nutch is being re-written in mapreduce so even that can grow really well.

old_expat

1:18 am on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Byron .. I'll defer to your knowledge on JAVA server vs Non JAVA, except to say that getting started costs more .. to get a server with Tomcat and enough RAM to run Tomcat. Finding a shared JAVA server is much more difficult from my experience

old_expat " .. to find help rewriting the crawler and indexer."

Byron "With Heretrix, Nutch and the basic java crawlers and support thereof i don't see any reason to re-write in c/c++."

I'm speaking of rewiriting the crawler and indexer (maybe my wrong choice of words) to crawl only the niche I want to crawl. I don't want a full web engine, but a niche engine

I have spent some time on the Nuch mailing lists trying to get help and have been mainly ignored. It seems to be open source but a somewhat closed community.

It has sort of always gone like,

"Can you give me an order of magnitude for .. "

"I charge $150 per hour."

"Can you estimate the number of hours? .. hello .. hello?"

ByronM

2:34 am on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



hmm.. i usually respond pretty quickly to the list. I offer nutch support much cheaper than 150/hr right now.

feel free to PM me.. i don't think switching to C/C++ will find you more developers though :)

The list generally won't hand-hold you because we get tons of people asking for such when the docs are fairly well done to get people going. Best to build a search, test it, implement it and then get it going to the way you want once you understand the process.

runarb

11:57 am on Jan 19, 2006 (gmt 0)

10+ Year Member



Dos any Lucene people have a comment to the test results in this article?

Indexing times were 13 min. for IXE, 6 min. for Zettair and 4 hours for Lucene

- [cs.yorku.ca...]

Seems odd that Lucene is that slower. Data size and search time isn’t that different.

ByronM

1:37 pm on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm not sure why lucene took so long. They don't mention which JVM they used, the code they ran to produce the index or any of that. I also don't know much about the competing products to compare.

Depending on how many values i index i can get anywhere from 800 rec/s to 60-70 rec/s indexing speed.