New Search Engines aswell as Nutch

Forum Moderators: bakedjake

Message Too Old, No Replies

New Search Engines aswell as Nutch

Discussion of Nutch and New Search Engines

mooret

4:53 pm on Aug 19, 2005 (gmt 0)

As we all know all the major search engines yahoo, google , teoma etc have all come from small university projects.

The questions im trying to post in the thread is

1. Whether any of you guys have come across any current university project search engines or search engines that have come from university related research, if so what do you think of them and post a link if possible .

2. Also on a side not ive been watching Nutch ( the opensource search project ) for a few years . Thus i haven't seen much happening with it, just wondering whether anyone has can update us more on anything they have heard or seen from/about nutch .

Event_King

10:21 pm on Aug 19, 2005 (gmt 0)

[nutch.org...]

All i got was a Page Not Displayed message. It's either being redesigned or is dead in the water. I say it's dead.

According to SE Watch, Search is a 100 million market, and that's just to enter the market. To survive, you either need the cash (100 million) or such a great idea, people flock to it.

mooret

12:30 am on Aug 20, 2005 (gmt 0)

When it comes to Nutch they moved to from sourceforge to been in the apache incubator . [lucene.apache.org...]

They released a new updated version on the 15th of august .

Not heard much else from them. they originaly promised alot but have yet to deliver.

To enter the search market you don't need that much . all you need is a good technology that improves on what is already out there aswell as a big of clever marketing ( like google did) , it doesn't matter what the arena is already like . Back in 1998 we all thought yahoo, altavista, excite were the bomb and were here to stay then came google.

google had only 1 million dollars in funding when they took on altavista, yahoo etc in the internet bubble.

ByronM

10:36 am on Aug 21, 2005 (gmt 0)

Nutch is great, i use it for mozdex. I'm currently migrating to the Mapread instance so i can finally break the 100m page mark in a manageable system.

The mailing list & developers have been active and it's fairly straigh forward to extend.

I've been working on tweaking out a 50million-75million page index and just now expanding.

100 million bucks? nah.. just 100million page corpus with good results :)

Sticky me for info.. i'm always looking for help!

jmccormac

11:17 am on Aug 21, 2005 (gmt 0)

I find that the problem with a lot of university research is that it is solving last year's problems. The commercial search business moves at a far faster pace than university research.

To enter the search market you don't need that much . all you need is a good technology that improves on what is already out there aswell as a big of clever marketing ( like google did) , it doesn't matter what the arena is already like . Back in 1998 we all thought yahoo, altavista, excite were the bomb and were here to stay then came google.

Or a technology that is more finely focussed on a niche market. Yahoo is still here. Excite is gone and Altavista is more hasta la vista these days.

google had only 1 million dollars in funding when they took on altavista, yahoo etc in the internet bubble.

The web was significantly smaller then. At the moment, I'm loading the data from a website to IP mapping project that has mapped all websites in com/net/org/biz/info to their respective countries and applying some algorithms to improve the geolocation by non-IP methods. In 1998, it would have required quite a few computers and a lot of HDs to do this. Now I can play with the dataset on a near Desktop spec computer and one HD.

Starting a country level search engine does not require a million dollar investment and it is still the final frontier for Google and the big players. It is also the only frontier on which they can be repeatedly defeated by smaller players.
A large index means nothing if it is not relevant to the end user. Any new developments may come from people who identify and exploit new niches. The me-too operations that seek to fight Google and Yahoo on their own terms may find it difficult to get funding because they simply do not have any unique selling points.

Regards...jmcc

Event_King

5:15 pm on Aug 21, 2005 (gmt 0)

There are a couple of engines that have gone bust. Mozilla being one. I never got to know what Mozilla's usp was. Oh well....

Lord Majestic

12:06 pm on Aug 28, 2005 (gmt 0)

I've been working on tweaking out a 50million-75million page index and just now expanding.

How much disk space your index takes? And is it 50 or 75 mlns? Just curious as to how many bytes used per page in your index.

Dave_A

2:17 am on Sep 30, 2005 (gmt 0)

Hi it doesn't look like your getting that much help trying to find a university based search engine project.
The last I had heard was Google which I think started out as a university project.
I have developed a new search engine but it is only a country specific engine which has only indexed around thirty eight million web pages so far.
The problem that have come from it so far is that of page ranking, even this small number of pages needs page ranking, so I have almost finished a content related page ranking method, that measures relationships or words/meanings and forms a picture of the most revelent pages within the index.
I think that any page ranking must come from content and not how many links a website has?
It also means that you can't buy content, so it makes the website design more important.

ByronM

1:45 pm on Oct 27, 2005 (gmt 0)

How much disk space your index takes? And is it 50 or 75 mlns? Just curious as to how many bytes used per page in your index.

We are almost back up to 100 million pages. I'll see what we are using.. it's nearly 1T for everything since the 100 million pages is what we have indexed, the actual db containing links is pushing 200 million so the sizing/skew isn't exactly right.

ByronM

2:48 pm on Oct 28, 2005 (gmt 0)

it's about 2k per document to index, not including storing the entire document for cache/refreshing or re-analyzing without refetching.

Hope that helps :)

Lord Majestic

3:22 pm on Oct 28, 2005 (gmt 0)

it's about 2k per document to index

Thanks Byron - one more question tho: does this include pure index or data structures necessary to show text snippets around matches?

ByronM

1:48 pm on Nov 2, 2005 (gmt 0)

Thanks Byron - one more question tho: does this include pure index or data structures necessary to show text snippets around matches?

2k is index data, segments really depend on what you are fetching & have parsed. (rss/xml/html/pdf/msword and such..)

I'll try and run some stats and publish it on the nutch wiki and possibly here as well.

Lord Majestic

1:49 pm on Nov 2, 2005 (gmt 0)

I'll try and run some stats and publish it on the nutch wiki and possibly here as well.

Thanks and this would be very helpful indeed.