Forum Moderators: phranque
My question is, are there any open source search engines worth looking at that can spider external sites and scale well up to 100k queries per day?
As an alternative, a chunk of code that would allow me to operate my own metasearch engine may suffice.
There are about 2-3 se's that will work for up to 100k per day. Those 3 are very poor in relevance and I wouldn't use them or even mention them.
As for Open Source Se's [searchtools.com], there are few that work good in a commercial environment.
The only one I've not tried is AspSeek. It has gotten some good reviews from friends.
Htdig has relevance problems.
Swish in C++ is ok and a Perl front end can be customized. However, the relavance is not great and phrase searching is not supported (I see the latest version does include phrase searching).
Some perl scripts out there like XAV's FDSE [xav.com] that when combined with MySQL and a super fast box can suffice.
Other than that, the Open Source engines are either abandoned, lack spiders, lack major features, or are so poor in relevance that they are unusable.
There's a huge opportunity still remaining in commercial SE software here.
There are some challenges to working with any SQL language based db when creating an se. It can be very complex creating good relavance while getting the speed you need at the same time. Think very long and hard about how you setup your dbs and don't be afraid to dump the format, backup and start over. These things take on a life of their own once you lock yourself into a database structure.
My goal is to have a small scale version of something similar to Gigablast, with some spidering restrictions in place. On that note, I'm curious to learn a few best practices for managing spidering criteria.