create a new search engine

Forum Moderators: bakedjake

Message Too Old, No Replies

create a new search engine

search engine spider new

stevegpan2

2:19 pm on May 13, 2003 (gmt 0)

Hi,

What are the main points to create a new search engine like google.
Is this technically difficult?
Can someone shed some lights on this or point me to some resource?

Thanks,

mayor

5:36 am on Jun 14, 2003 (gmt 0)

I'd be happy to work with a team.

My background includes real time processes. Am also a sort of technical visionary, sometimes peering so far over the horizon no one will listen.

For instance, my first vision of peer-to-peer parallel processing was in 1984. People thought I was joking. Google is playing with that now, and I'm not up to date on this but I'm sure many others are too. Then, of coarse, the peer-to-peer file sharing networks like Napster, Kazaa, etc. are testimony to the potential of PTP wide area networks.

I saw the "information highway" forming while most people were still trying to decide whether a home PC was worth the investment, and for the most part, deciding against them.

I'll start by saying the only place the resources exist for a next-generation search engine are on peer-to-peer networks. Server farms of 10,000+ processors just won't cut it. Think of a small application running behind a search toolbar on a few hundred thousand computers. Think of a hierarchal real-time resource manager managing the data acquisition, storage and processing as computers randomly enter and leave the network.

The concepts here aren't rocket science. In fact, they're old stuff as anyone who cooked up their own real-time multi-tasking operating systems and designed distributed real-time processes can attest to.

Fischerlaender

12:43 pm on Jun 14, 2003 (gmt 0)

While I'm not convinced that a public domain search engine will be successfull, there is one idea where the help of several voluntary editors could improve a spidering search engine.

One of the most challenging problems today for a search engine is to cope with all the dynamic pages. There are plenty of sites that use dynamic pages to manage their very valuable content; but there is also a huge amount of crap.

A group of volunteers could be really helpful in this situation and provide address schemes which tell the engines where there are the good dynamic pages.

This should result in some kind of a "white list", where the "good" dynamic URLs are listed as regular expression patterns:
www.site1.com/article.php?id=[0-9]{1,4}
www.site2.com/board/thread.cgi?forum=[0-9]+&id=[0-9]{1,5}

With such a list, the engines would be able to crawl a notable part of the hidden web, but avoid wasting their ressources on garbage pages.

SlowMove

2:29 pm on Jun 14, 2003 (gmt 0)

Then, of coarse, the peer-to-peer file sharing networks like Napster, Kazaa, etc. are testimony to the potential of PTP wide area networks.

the only problem there is speed.

poluf1

6:28 pm on Jun 16, 2003 (gmt 0)

I am also interested in joining a team project, specially if it is concerning a spider combined with the interpretation of text in order to be able to respond to real questions in real sentences rather than giving the user a bunch of sites.

Lisa

7:14 pm on Jun 16, 2003 (gmt 0)

Well, I would be interested in discussing this with serious folks only. Volunteers are great but in the beginning it needs to be a crack shot team. Volunteers can be added later. Personally my NameIntel team has got multiple people, CS degrees, FreeBSD experts, sysadmins, bandwidth, servers, and a desire to better organize the web. We are building some of the internal workings that most people aren't aware exist in a good engines.

Some of our theories and observations are:
- The web is 4 Billion pages, but only 45 Million domains exist.
- Utilize and leverage existing technologies.
- DMOZ is great, it should be used as a core (1.8 Million domains)
- Yahoo is great, (700K domains)
- Advanced relevancy technology is needed.
- Database approach is too slow.
- Distrubed Toolbar approach is too slow.
- Need for SEO minded people as the core group to better understand the SE.
- C is needed for core processes.
- Perl or Phython should be used for crunching.

wbienek

7:16 pm on Jun 16, 2003 (gmt 0)

I am part owner and consultant to <mydomain>... NOW THERE IS A DOMAIN!

Give me a talk to.. Summerize everything in this tread.. We are looking to redo this search engine..

Wayne

wbienek@lvcm.com

[edited by: jeremy_goodrich at 9:12 pm (utc) on June 16, 2003]
[edit reason] snipped domain name - please see TOS thanks [/edit]

stevegpan2

7:28 pm on Jun 16, 2003 (gmt 0)

I am a CS SW engr. I am in USA. I have many staff (over 10 in China) including myself can contribute our time. I especially like to see how it taps into the huge Chinese market. My strong points: math, programming, rich human resources and Chinese market.

jeremy goodrich

9:13 pm on Jun 16, 2003 (gmt 0)

Everybody - let's try to keep this discussion away from too many company / specifics / etc.

After all, I'm not about to go dropping the name of MY search tool for everybody to peruse - as much as I might like to.

Thanks much - great discussion, and I can see that there are potentially LOTS of great things that could come out of collaboration among the various parties here.

SlowMove

12:15 am on Jun 17, 2003 (gmt 0)

If there are people out there who know freebsd, c/c++ and php and general database theory, by all means stickymail me.

Volunteers are great but in the beginning it needs to be a crack shot team.

working on areas where i fall short. would like to volunteer to help when better prepared.

brotherhood of LAN

12:21 am on Jun 17, 2003 (gmt 0)

as Jeremy mentions, how about talking about the theory and practice instead of all the name dropping ;)

I know there are many 'established' people in all the different aspects of making an SE, here, at WW, but it would be nice if we could talk about it instead of CV's and 'i can do this'? no? :)

penfold25

4:43 am on Jun 17, 2003 (gmt 0)

Im certainly in on this project.
However i like many of you ppl have seen the rise and fall of many search engines, but there is always room for improvement.
I believe definitely more ppl have noticed google since it had gone on yahoo,aol etc, yahoo basically advertised google.

In any case do we decide to use cache to lookup information like yahoo?
Are we going to use link popularity or page ranking?

We going to have any directory at all in it(even google now has a directory on its web site)

I do believe a serious discussion needs to involved of ppl who are taking this seriously and not in it for the money.

kmarcus

6:09 am on Jun 17, 2003 (gmt 0)

Regarding "number of domains" out there, you can very easily purchase a list (or probably even get one mostly for free) for at least the .com/.net's out there and maybe .org still too. But since I even use DMOZ as a major input to my "starting point" of where to bother looking, I can't deny that it's rather handy. On top of that, one could potentially observe things about the associated whois information to determine if the owners are the same or not, as a form of spam prevention.

Anyways, since it seems that most of the people I have stickied with are more in the middle to top end and not on the low level of se work, I remembered awhile ago that google had some sort of programming contest where they gave out a bit of code and a dump of data, so that is probably the first route I will give a try to. Like I have already said, spidering is probably the area that i need the most help, I can either put up something whereby people take the rawest of the raw spider output (which is mostly an xml header w/html document) *or* the parsed output (which I think would
be more useful, especially for people who are into the whole indexing thing). Or better yet - both. :)

Anyway, I have provided mostly data files and descriptions of how i go from one place to another with the hopes that you guys can have some creativity out there. We'll see! So you don't have to download the whole thing, the 'readme' like file is provided:

Jun 16 2003
You're free to do whatever you want with all of this stuff.
As I've previously stated, it is pretty clear (to me) the most amount of
work needs to go into the spider and parser. Anyone who does any work that is used at
xxx based on these files will be given appropriate credit.
xxx basic file format info:
First, the spider runs and generates:
fluffy.out:
This is a pseudo-xml doc which contains the url as I see it. Each "FluffyURL" section acts as a
record separator as well. SEO's: don't try to put this into your page because the spider will
toss it.
-----
This file is then processed by a metadata generator which creates two files, fluffy.out.db and
fluffy.out.log. The log is used to find other urls to crawl and also combined with other data
(toolbar usage, links to xxx, etc) to generate the overall hipranking for a site.
-----
fluffy.out.db:
This is exactly the same format as fluffy.hrd (below), except that it doesn't have the hiprank.
Also, the ID ends up getting reassigned.
fluffy.out.log:
1) Link type
2) If "1", this is a non local link
3) Link destination
-----
Last of the metadata parsing/munging, the fluffy.out.db file is ranked and produces:
-----
fluffy.hrd:
1) RecID (assigned)
2) HipRank (int, absolute value, not geometrically scaled)
3) URL (fully normalised url)
4) DomRev (just the domain in reverse)
5) AdultFlag (int)
6) TimeStamp
7) Size (int)
8) PageTitle
9) PageText
10) PagePhrase (text, comma separated phrases, user submitted listing stuff)
11) OwnerID (text, user submitted listing)
12) Modifiers (for customisation, should always be blank here)
13) Meta Description
14) Meta Keywords
15) ImgAlt text, (backtick separated per image in order found on page)
16) Record Type (int)
17) Link texts (backtick separated, per link in order found on page)
18) EMBText (any text found with <EM> or <B>'s. backtick separated, per link in order found on page)
19) HText (any text found within <Hx>'s backtick separated, per link in order found on page)
20) CTURL (specialized clickthrough url)
-----
This file (fluffy.hrd in this example) serves as the input to yet another filter which
produces a series of "tagmaps" for different parts of the data which is sorted based
on the "word rank". The file format here is
1) Tag
2) WordRank (this is a geometrically scaled value, higher=better)
3) RecID (from fluffy.hrd)
4) AdultFlag (really from fluffy.hrd)
5) DomRev (again, from fluffy.hrd)
Ultimately, this is the inverted index. Note the coarseness -- no word positioning
or flagging here, it's all been incorporated to the WordRank. How should it be searched?
If you have questions or comments, email to xxxxxxxxxxx

To get the raw data files, go to "my site" /hiptest-20030616.tgz -- this file is about 2M.
If you can't figure out that magic url sticky me and i will email it to you.

steve40

9:02 pm on Jun 17, 2003 (gmt 0)

hi just 2 cents worth, there are many sites who do multiple searches using other sites results ie yahoo alexa etc etc , what they do not appear to do is to be able to filter out spammers etc so same probs that appear on google etc appear in results maybe the answer is to use some existing dbases but apply different filters that way the code required would be just algo's for filtering results
i say this as google av etc are not bad spiders just not good at filtering spammers , maybe part of the answer is one member of project team should be really good seo , one other point in equation is to add counter or other way to place higher in results

jrobbio

4:48 pm on Jun 25, 2003 (gmt 0)

I'll start by saying the only place the resources exist for a next-generation search engine are on peer-to-peer networks

Agreed, although if grub manages to attract the DC community to support them, then I'd see them as a formidable power in the future.

Maybe if Shippo and Gigablast teamed up with some sort of shared human index, then I would find that very exciting.

Rob

stevegpan2

2:54 pm on Jul 24, 2003 (gmt 0)

I read somewhere that you do not use a regular database like oracle or informix to store your data because it is too slow.
why? flat data file is processed faster?

stevegpan2

3:00 pm on Jul 24, 2003 (gmt 0)

as odp (dmoz) is corrupted, how about using spider to index all web sites so everyone has a chance to get in.

and then ask volunteers to rate each site?

rating for each site will be included in ranking, to certain extent.

obvious the top 10 will be rated actively and spammer will get caught easily.

Fischerlaender

7:52 pm on Jul 24, 2003 (gmt 0)

I read somewhere that you do not use a regular database like oracle or informix to store your data because it is too slow. why? flat data file is processed faster?

A typical RDBMS provides a lot of functionality that isn't needed by a search engine. But this functionality makes it slower compared to a tailored solution. I wouldn't call it a "flat data file" though; think of it as a specialized version of a RDBMS that just does what the engine needs - in a very efficient way.

jeremy goodrich

7:54 pm on Jul 24, 2003 (gmt 0)

This thread is just huge - and there have been LOTS of side topics that really are worthy of there own thread brought up ;).

With that, I think we'll lock it now.

This 78 message thread spans 3 pages: 78