Forum Moderators: phranque

Message Too Old, No Replies

Open source search engine - does it exist?

         

kgormat

10:27 am on Oct 26, 2002 (gmt 0)

10+ Year Member



I have a small pet project that I would like to pursue which requires some search technology. I don't have a full set of requirements in mind at this time other than complete ownership and control of the database.

My question is, are there any open source search engines worth looking at that can spider external sites and scale well up to 100k queries per day?

As an alternative, a chunk of code that would allow me to operate my own metasearch engine may suffice.

Brett_Tabke

1:59 pm on Oct 26, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Here's what I've found:

There are about 2-3 se's that will work for up to 100k per day. Those 3 are very poor in relevance and I wouldn't use them or even mention them.

As for Open Source Se's [searchtools.com], there are few that work good in a commercial environment.

The only one I've not tried is AspSeek. It has gotten some good reviews from friends.

Htdig has relevance problems.
Swish in C++ is ok and a Perl front end can be customized. However, the relavance is not great and phrase searching is not supported (I see the latest version does include phrase searching).

Some perl scripts out there like XAV's FDSE [xav.com] that when combined with MySQL and a super fast box can suffice.

Other than that, the Open Source engines are either abandoned, lack spiders, lack major features, or are so poor in relevance that they are unusable.

There's a huge opportunity still remaining in commercial SE software here.

shady

2:14 pm on Oct 26, 2002 (gmt 0)

10+ Year Member



That looks interesting, Brett.
Do you think it would be worth writing a php/mysql SE or do you consider other languages to be better suited to the task?

Brett_Tabke

4:03 am on Oct 27, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



On a scale to 100k a day, you are probably in the ballpark of 3-4k views an hour. That's pretty serious load. If it is a dedicated fast box and your index isn't that large, it would work ok I think. However, the biggest factor will be the size of the database.

There are some challenges to working with any SQL language based db when creating an se. It can be very complex creating good relavance while getting the speed you need at the same time. Think very long and hard about how you setup your dbs and don't be afraid to dump the format, backup and start over. These things take on a life of their own once you lock yourself into a database structure.

kgormat

5:46 am on Oct 27, 2002 (gmt 0)

10+ Year Member



Brett,

Thank you for the informative reply. I'll check out the references you cited.

Regards,
Kent

littleman

6:18 am on Oct 27, 2002 (gmt 0)



ASPseek!
[aspseek.org...]
It is open source (GNU) and amazing. I am going from memory, but I think it is good for over a million documents on a single dedicated server.

Brett_Tabke

7:28 am on Oct 27, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



*nix only right? Or has someone ported it?

kgormat

7:43 am on Oct 27, 2002 (gmt 0)

10+ Year Member



From what I just read it appears to be Unix only, although the source is available. I'm going to load it up this week, if anyone want's to see it action I'll be happy to provide a link.

My goal is to have a small scale version of something similar to Gigablast, with some spidering restrictions in place. On that note, I'm curious to learn a few best practices for managing spidering criteria.