Forum Moderators: bakedjake

Message Too Old, No Replies

What's the fastest off-the-shelf free natuaral language search engine?

swish-e, mnogosearch, mysql fulltext indices, etc...

         

phaze

11:17 pm on Nov 4, 2004 (gmt 0)

10+ Year Member



So I'm too lazy and too stupid to roll my own natural language indexer. Even though the specs are everywhere:
[google.com...]

So I want to use something off the shelf. I'm using mysql right now to index about 10 million records, but it needs 1 Gig of RAM per million records to keep the index in memory. Anyone know of something better, faster and cheaper in hardware terms?

mm.

Maxime

3:03 pm on Nov 6, 2004 (gmt 0)

10+ Year Member



Try DataparkSearch: [dataparksearch.org...]
It keeps index in disk files (when cache storage mode is used). Also you may preload some data in memory, about 20 bytes per URL indexed.

freeflight2

8:48 pm on Nov 9, 2004 (gmt 0)

10+ Year Member



even mysql.com does not use the mysql text search ;)
htdig is pretty OK

zootreeves

7:35 pm on Nov 11, 2004 (gmt 0)

10+ Year Member



What about nutch, u can download it of sourceforge

phaze

9:39 pm on Nov 11, 2004 (gmt 0)

10+ Year Member



Thanks for all the input. I've checked out all the suggestions, and I'm rolling my own language search using mysql as a fast file access daemon. So check this out:

I'm busy rewriting our search engine and borrowing heavily from page and brin's stanford paper on the original version of google. So I hit the problem of phrase search and stopwords. When users perform an individual word search like:
cheap widgets
Then you just go look in your data barrels to see which documents contain 'cheap' and which documents contain 'widgets' and you serve up the intersection of the two data sets ordered by your ranking algo.
If users search for:
"cheap widgets"
it gets a bit more complex. 'cheap' must occur next to 'widgets' so you ask your first data barrel to find 'cheap' and the next data barrel to find 'widgets' at position (cheaps_position + 1).

So that's all fine and dandy. Problem is with stopwords. Search engines love stopwords because it means you can drop them from your index. 'the' is too common, so why index it. Not one is going to be that stupid to search on 'the'....

Unless they're doing a phrase search. 'the chronicles of riddick' is a possible search. And it's a real nasty one because so many documents contain 'the'. So you're forced to index stopwords because smart ass users want to do phrase searches. ;)

So what I'm doing is creating a completelly seperate data barrel for stopwords. Without boring you with my database schema, what's interesting is that the results of this is phrase searches starting with a stopword 'the' 'and' 'or' etc will be slow, and searching for individual stopwords will be slow. So stupid users get a slow response. Specific users get a fast response. I like it.

And what's maybe more interesting is the slowest search you can do on google is a phrase of stopwords:
"the and or"
"the but if"
Notice the seconds taken to execute the query. I've had it up to over 3 seconds at times.

Which brings me to my question. What is the slowest search possible on google? What will make the beast work up a sweat?

zootreeves

9:11 pm on Nov 12, 2004 (gmt 0)

10+ Year Member



Results 1 - 10 of about 54,900 for "the but if" OR "the and or". (4.71 seconds)

zootreeves

9:20 pm on Nov 12, 2004 (gmt 0)

10+ Year Member



Actually i've done better...

Results 1 - 10 of about 24,300 for "the but if" OR "the of to". (9.19 seconds)

phaze

12:39 am on Nov 13, 2004 (gmt 0)

10+ Year Member



Nice. I got 9.29 seconds for "the but if" OR "the of to" OR "the a and of"