Forum Moderators: bakedjake
The questions im trying to post in the thread is
1. Whether any of you guys have come across any current university project search engines or search engines that have come from university related research, if so what do you think of them and post a link if possible .
2. Also on a side not ive been watching Nutch ( the opensource search project ) for a few years . Thus i haven't seen much happening with it, just wondering whether anyone has can update us more on anything they have heard or seen from/about nutch .
All i got was a Page Not Displayed message. It's either being redesigned or is dead in the water. I say it's dead.
According to SE Watch, Search is a 100 million market, and that's just to enter the market. To survive, you either need the cash (100 million) or such a great idea, people flock to it.
They released a new updated version on the 15th of august .
Not heard much else from them. they originaly promised alot but have yet to deliver.
To enter the search market you don't need that much . all you need is a good technology that improves on what is already out there aswell as a big of clever marketing ( like google did) , it doesn't matter what the arena is already like . Back in 1998 we all thought yahoo, altavista, excite were the bomb and were here to stay then came google.
google had only 1 million dollars in funding when they took on altavista, yahoo etc in the internet bubble.
The mailing list & developers have been active and it's fairly straigh forward to extend.
I've been working on tweaking out a 50million-75million page index and just now expanding.
100 million bucks? nah.. just 100million page corpus with good results :)
Sticky me for info.. i'm always looking for help!
To enter the search market you don't need that much . all you need is a good technology that improves on what is already out there aswell as a big of clever marketing ( like google did) , it doesn't matter what the arena is already like . Back in 1998 we all thought yahoo, altavista, excite were the bomb and were here to stay then came google.Or a technology that is more finely focussed on a niche market. Yahoo is still here. Excite is gone and Altavista is more hasta la vista these days.
google had only 1 million dollars in funding when they took on altavista, yahoo etc in the internet bubble.The web was significantly smaller then. At the moment, I'm loading the data from a website to IP mapping project that has mapped all websites in com/net/org/biz/info to their respective countries and applying some algorithms to improve the geolocation by non-IP methods. In 1998, it would have required quite a few computers and a lot of HDs to do this. Now I can play with the dataset on a near Desktop spec computer and one HD.
Starting a country level search engine does not require a million dollar investment and it is still the final frontier for Google and the big players. It is also the only frontier on which they can be repeatedly defeated by smaller players.
A large index means nothing if it is not relevant to the end user. Any new developments may come from people who identify and exploit new niches. The me-too operations that seek to fight Google and Yahoo on their own terms may find it difficult to get funding because they simply do not have any unique selling points.
Regards...jmcc
How much disk space your index takes? And is it 50 or 75 mlns? Just curious as to how many bytes used per page in your index.
We are almost back up to 100 million pages. I'll see what we are using.. it's nearly 1T for everything since the 100 million pages is what we have indexed, the actual db containing links is pushing 200 million so the sizing/skew isn't exactly right.
Thanks Byron - one more question tho: does this include pure index or data structures necessary to show text snippets around matches?
2k is index data, segments really depend on what you are fetching & have parsed. (rss/xml/html/pdf/msword and such..)
I'll try and run some stats and publish it on the nutch wiki and possibly here as well.