Forum Moderators: open
Anybody know of any open source search engine project(s)?
Anybody wanna start one...?What I believe I see happening is the de facto privatization of the Net. If all access is effectively owned and controlled, strangled, squeezed and extorted by Google, Yahoo!, MSN or all of the above, it really doesn't matter which master you're beholden to; you're still *-ed (insert epletive here, starts with F). The only thing which could halt or challenge this trend is some kind of creative commons/copyLeft Wiki SE.
Somebidy mentioned
ht://digand I have no idea what he's talking about. Someone else mentioned Nutch.
So, apparently it's not really happening out there...yet. My question is:
Would some of you brighter bulbs than I am care to speculate on the implications of a completely transparent, ever-evolving search algorithm with publicly available source code? What would happen if everything were laid on the table, so that spammers could be attacked by all, and spammers could attack all? Theoretically, could it force a rapid evolution into some kind of brilliant, organically unbeatable engine so focused on true relevancy that it could not be defeated even if you knew its guts? It's the idea that, to paraphrase a famous Supreme court justice, " a little sunlight is the best disinfectant."
Or, alternatively, do I sound like I'm on crack here?
Actually, on second thoughts, let me retract that statement. Linux is open source and by the above logic would be open to abuse from all sorts of hackers. In reality, it's a lot more secure than the software that is closed-source and commercial. Seems that many heads are better than one. Obviously in the initial stages, it would be open to lots of abuse, but would eventually become tougher and tougher to beat, as it gets refined more and more.
Mmm, you could be on to something here. I'd love to offer my services, but unfortunately (when it comes to search engines) I couldn't code my way out of a brown paper bag.
-- Aspseek: written in C++ with STL, last official release of 2002-07. [aspseek.org...]
-- Nutch: written in Java. [lucene.apache.org...]
These links point to the homes of the projects/sources, not to search engines.
May the SOURCE be with you.
Have fun and regards,
R.
These days computers are powerful and databases are freely available -- anybody can build just a search engine, and making it open source is easy too.
Search engine is easy, however making a World Wide Web search engine is not -- most code will choke and die on indexing of 100 mln pages, even if it actually manages to obtain so many pages which is harder than it may sound.
So, if we are talking about WWW search engine then number of community driven projects is rather low due to high complexity of the task: ranking is particular problem because for any word combinations you get millions and millions of matches of which you can show 10 or 20 at best. You can't avoid it because you can't just index 0.1% of Web Space and hope it will be enough for general purpose queries -- it won't be, so you have to do the whole lot or not do it at all.
Its tough, which is why almost everyone capable in this field is working for commercial WWW search engines like Google. Bandwidth and disk space requirements are rather unholy, so this is not just free software only project, and bandwidth limits are real issue.
However the task is noble and necessary -- I use search engines every day so much that only usage of the browser itself overshadows it. Two most important apps are the browser and WWW search engine.
As a quick note "ht://dig" was not designed to be WWW search engine -- it won't scale to that level.
-------
Nutch
-------
I was not impressed with Nutch when I reviewed the situation last year - for starters they never released demo system with 100 mln pages because (as it was posted on their site) they did not have hardware necessary for public test. I looked at their costs projects and have to agree that if they really needed that much hardware (well its written in Java innit?) then its not suprising they needed some serious money. Perhaps now they will get the money as they are part of Apache.
They in my view still have issues of organising the whole system of getting pages, indexing and making them available for searching: having source is great, but you can't just download something, have a few clicks and join efforts in building their search engine.
I'd say Nutch is more of a reasearch platform that _can_ be used to create a WWW search engine, but I don't think anybody would do that apart from developers of Nutch itself.
Pro: Open Sourced and lead by experienced people
Con: You can't just "join in" easily, no public search engine, dev seemed slow (faster now?)
-------
Grub
-------
Grub is a concept based on distributed crawling -- you download client software that uses your bandwidth to crawl for pages, much like SETI@Home and distributed.net operate.
Pro: Easy to join
Con: Not open source, no search engine, no active dev
-------
My project
-------
My project is similar to Grub in concept (download client that crawls), with the client technically more advanced, but most importantly we actually have search engine that is actively being developed: alpha version is available, naturally its not Google's quality, but it only existed for 10 weeks.
Pro: Easy to join, fast dev, alpha of search available
Con: Not open source
--
Now you may notice that my project is not open source. The reasons for this are explained on my site's forum, however I would merely like to say that IF a project is community driven then it represents better control over project than having source. Most people simply don't have multi-terabytes systems at their disposal, and going through 90k lines of code is not exactly a walk in the park.
When community actually contributes by running software it means that they hold keys to the project future and their opinion is pretty much determining because if project start acting in some kind of dishonest fashion then the community can pull the plug by not running software.
I've written already way too much for most people to bother reading, for those few who managed it I hope it was worth it.
regards
alexc
I guess the real limitation is about who's going to finance a hangarful of machines to digest the enormity that is the Net. A SETI-esque distributed solution's not going to work...people don't share all too much in this world, plus there's a lot of well-justified fear out there about opening anything on your personal computer to be used by a vast anonymous outside collective. Maybe there's hope in nanoscale, self-replicating computer technology. In the decades to come, more bandwidth than anyone could ever use may be as cheap and abundant as grains of sand on a beach...it could happen.
Somehow it would need a revenue-generating angle (certainly proven models already exist, in both PPC and other "inux"-ish OS-derived commercial solutions). And you'd need an angel investor or two who'll finance it either from a vision of the possibilities of financial gain and enlightened self-interest, or even, just maybe...a simple desire to do some good in the world.
Personally, I wish I had the resources. The consequences of offering a public and open source alternative to the access monopolies rapidly developing online would have world-shifting, historical effects upon Life, Liberty, and the Pursuit of Happiness for all humankind.
Really really really
Somehow it would need a revenue-generating angle
Indeed -- that's why I plan to license bits of software I develop to gain revenue and additionally do some consulting. This should provide enough revenues to go by and have hardware necessary to take the project high up. This is one of the reasons why project is not open sourced.
As for people who crawl their costs are minimal -- software does not require much storage, broadband connection is pretty much the only requirement, and these are paid for with fixed fee, so in effect participation in the project is free to all participants apart from those who run central server (ie me).
Consider the Wikipedia. An open source encyclopedia. 600,000+ articles. The quality control is not as good as the commercial 'competition' but it is generally quite good. And the wikipedia is often more up to date, often more detailed, and free. As the Wikipedia community develops, the product gets better.
I think that an open search engine will have to be designed rather differently than Google, Yahoo, or MSN. Not sure how, although the idea of a network of independent search engines with meta search is one possibility. I think that once there is free software that makes establishing a specialty search engine as easy as tossing up a bulletin board, things will catch on.
Of course, lots of people have tossed up BBSs, with the result that there are many schlock sites. But lots of good forums, too. And the good ones develop reputation, people hang out there, refer their friends, and those sites grow.
So lets say that 20 widget making specialty search engines are built by individuals, orgs and companies. Some of them are really good, some are mediocre, some are controlled by evil spam wizards to promote their Acme products. Widget lovers will find the good ones, and they will catch on.
And suddenly, there is open source meta-search engine software that is freely available. Anyone can set one up and configure it to query any of the the 20 widget search engines that are set to share their data. The OS meta-search engines can exclude the ACME search engine when they discover that it is a doodoo maker. They can weight the results from the different engines according to their respect for it. They can look for agreement between the different search engines and try to find consensus - if 9 of 10 independent engines rank the widget.com page as #1, #2, or #3, then one might conclude that it is the #1 page if no other page has such a consistent high rank.
As the OS meta-search engines develop, they will allow more topics to be queried, more independent search engines to be linked together. Reputation algorithms will allow a degree of trust between engines that don't know much about each other (I don't know Alice, but Bob says he trusts Alice, and Bob is a level headed guy, so Alice's opinions on widgets are probably reasonable).
Some people will feel overwhelmed by the choices and stick with the big commercial engines. Others will experiment and maybe find that the OS engines are more powerful because they are customized and driven by users who are passionate about their widgets. Goblins will stay awake late at night, writing new spells to trick the OS engines into displaying their crappy landing pages. And Google Adsense will be displayed on many of the independent search engines, because if you can't beat them ...
Any idea how effective NUTCH is in indexing internal websites?
You don't need NUTCH for this, you might be better off using Lucene straight away (Nutch is more WWW specific stuff on top of Lucene - or so I understand).
If you want to inder small number (<1 mln) of documents then using a search engine designed for WWW corpus is an overkill.
Then again WebmasterWorld uses Google for searches...