create a new search engine

Forum Moderators: bakedjake

Message Too Old, No Replies

create a new search engine

search engine spider new

stevegpan2

2:19 pm on May 13, 2003 (gmt 0)

Hi,

What are the main points to create a new search engine like google.
Is this technically difficult?
Can someone shed some lights on this or point me to some resource?

Thanks,

SlowMove

5:19 pm on Jun 9, 2003 (gmt 0)

Not everyone would think it's a good idea. One of the problems may be that if there's more competition among the search engines, Webmasters would have to try to learn how to optimize for the various systems. As I understand it, it's difficult to optimize a site for several search engines.

colintho

6:27 pm on Jun 9, 2003 (gmt 0)

What if the "webmasters" owned the search engine?

If you were to ask these questions:

Q1: Show me 100 webmasters that would have liked to have shared ownership of "yahoo or similar"

Q2: Show me 100 webmasters that would be interested in owning the "next yahoo or similar"

Thats what the point of interest should be.

SlowMove

9:21 pm on Jun 9, 2003 (gmt 0)

It would probably be a done deal if someone like Marc Andreessen or Linus Torvalds got involved.

jeremy goodrich

9:29 pm on Jun 9, 2003 (gmt 0)

Not that hard to build a world class, cutting edge application - or design one, with loads of bells & whistles.

Problem is - getting people to use it.

waldemar

9:45 pm on Jun 9, 2003 (gmt 0)

Sad, that I followed several project plan threads like this one develop in different forums and they only died slowly with a decreasing number of posts and increasming amount of time between them.
But don't be so counterproductive...

It seems like some people have good contributions on a project like this and they are willing to put something into it - so anybody got some web space and another free domain for a first step :-)? How about a subcommunity here called the "Webmaster World Wide Web.com"? (The "W4" :-) )

jeremy goodrich

10:01 pm on Jun 9, 2003 (gmt 0)

I'm not saying it won't work - or that there isn't a tremendous amount of enthusiasm for such things here. :)

Having designed search engine middle ware, as well as fused a few open source apps to create my own SE, I know something of the difficulties in such a project - which is why I suggest starting out small.

However, as Gigablast [gigablast.com] or SearchHippo [searchhippo.com] demonstrate, "one man shows" can still make a run at it & achieve a level of success.

It depends on your goals, I guess, how far you want to take it. There is always the Stickymail here at WebmasterWorld [webmasterworld.com] to communicate with other like minded folks here & get something going.

I wish anybody who embarks down the path of SE creation luck - after all, they just might create something I'll grow to love as much as Google one day :)

Fischerlaender

8:58 am on Jun 10, 2003 (gmt 0)

colintho,
search engines and directories are two very different approaches. Both of them have their benefits - for the user as well as for the owner.

Directories:
+ technologically simple
+ get visitors from the big engines
- costs very much time (or man power) to grow to a size that can compete with the likes of Yahoo or ODP

Search Engines:
+ can be operated by a single person
- needs high end technique
- needs an internet connection with high bandwidth

So you have to decide if you can get several people to help you in filling your directory or if your programming skills are good enough to create a search engine. In other words: A directory needs social, an engine technical skills.

But whatever decision you make, be prepared to compete with the big ones: Yahoo/ODP/Looksmart on the directory side, Google/FAST/Inktomi on the search engine side.

colintho

9:38 am on Jun 10, 2003 (gmt 0)

Thanks for the reply.

One challenge that both directories and search engines encounter would be marketing & establishing a brand and brand awareness.

This is where I believe the "joint project" approach has an advantage that could sway a decision towards a directory. With a joint directory project team working together with a united promotion and marketing strategy.

Directories:
"+ get visitors from the big engines"

ADVANTAGE for marketing and establishing a brand / brand awareness

"- costs very much time (or man power) to grow to a size that can compete with the likes of Yahoo or ODP"

I believe that the manpower input could be minimal - with the main emphasis being on promotion - as the webmasters here - that is one of the key skills that can be implemented as part of the team effort from a joint project.

What would be the result of a team of skilled webmasters promoting a directory as a joint project?

With a good foundation / starting base for a directory, the build could be self generating with business / organisations listing their sites through "add url" feature - "paid submissions". The admin side of a directory should be a percentage of the work involved.

1 person could process about 200 submissions per day - if the submission fee was $50 .....productivity of $10,000 per day per admin / editor

Could a directory be promoted so well that a submission fee of $100, $200 or $300 (yahoo charges that amount)

How many sites would want to be listed?

What value would be put on such a directory?

How many webmasters would want to have equal shares in a joint directory?

poluf1

7:39 pm on Jun 10, 2003 (gmt 0)

Is this submission fee realistic? I mean, if it is not occasional but a permanent flow of customers?
Would you pay $50? Yahoo is different, they are the #1 portal of the world (sort of).
DMoz is crawled by Google, that is also a big advantage -- although I am not sure if they would keep crawling it if it wasn't 'open'.
How much do you spend on PPC on average per keyword? The traffic generated by your $50 submission should be measured against that figure. So if you otherwise would spend $0.50 on a click, you would need some 100 hits within say a month or two, but not more time - focused hits, and guaranteed, otherwise you (the customer) just won't know what happens to the money.
DMoz claims to have 4 million websites - so assuming that each visitor picks one site and leaves, it would need around 10-15 million hits (more precisely: exit clicks) a day to provide that kind of value on your entire collection of sites.
Of course you won't have 100% of sites that pay, that would be overly optimistic.
I recall an analysis that showed how much money customers spend on average on a click (at overture or google, I don't remember).
Overall, in my opinion, generating the traffic to this collection is the major issue.

Fischerlaender

7:38 am on Jun 11, 2003 (gmt 0)

Overall, in my opinion, generating the traffic to this collection is the major issue.

ACK. As colintho pointed out, to achieve this marketing is crucial. But before even thinking about marketing we'll need a product - a directory that can compete with the big ones. If we start with a nearly empty directory, there would be no traffic to the site and no (paid) submissions.

It's the classical chicken-egg-problem.

colintho

2:53 pm on Jun 11, 2003 (gmt 0)

we do not use PPC - all traffic is from search results

We are slowly populating our directory, to help with the "chicken & egg" situation, just the 2 of us, husband + wife team adding sites whenever we can.

At some point we hope to achieve a "critical mass" that would attract higher numbers of paid submissions.

Where is this point? unknown

How long will it take? unknown

How can this be accelerated? "joint project"

But a joint project also needs a "critical mass" of interested participants before it is worth "transferring" a sole enterprise project over to a "joint project"

With just a few participants - it may not work

Then the situation remains almost "as it is" - same as the past couple of years

In Sep2002 - there was little interest shown in doing or trying this - so I started by myself

Its taken quite some time to even reach the point we are at.

To date there are a small number of people that have shown some interest and replied by sticky - much appreciated.

If there is enough interested people - it may be worth really doing.

SlowMove

4:55 pm on Jun 11, 2003 (gmt 0)

I think that the programming involved in creating a winning search engine is extremely difficult. But there are other first rate engines out there right now. If one of the developers decided to make all the code open-source in exchange for links from companies that decided to use and modify the engine, it could work. There could be forums discussing ways to improve the engine, yet anyone running it, could modify the code and hide algorithms and parameter settings from the competition. If companies that know a lot about marketing decided to adopt the engine, it would certainly help the developer compete against Google and the other big engines.

penfold25

5:02 pm on Jun 11, 2003 (gmt 0)

Look i remember when yahoo first started, it was not a big deal , two uni students who were lucky.

Years later, stanford university students invent a search engine, these arnt companies like microsoft,ibm , compaq.
Not massive corporations, just any body.

Basically its just building of a brand and getting people to trust in it.

Isnt it a surprise , no major corporation has invented anything at all that is dominant.

kmarcus

3:27 am on Jun 12, 2003 (gmt 0)

i have several times thought about open-sourcing searchhippo a few times, especially since i dont have enough time to work on it myself. The primary problem that i think exists though is that the idea is that you would have to, more or less, centralize where the "main server" is going to be. That is, if there are xyz numbers of "qualified" people, I could easily open things up to them, but the spider and search needs to take place on someone's hardware and use someone's bandwidth. with dmoz, it's aol. And a lot of people take that data and then go out and do their own thing with it. That's great and the data is somewhat manageable for most people.

But the fact is that most people probably don't understand the amount of data and work/hardware/bandwidth to do it in a reasonable fashion. So, much like dmoz, they chose instead of downloading a bunch of rdf's to instead simply grab some html parser that runs the search off dmoz and is effectively backfill for them.

And that was exactly the idea I had in mind when way way way back in 2001, when i decided to open up the free xml "feed" of the spidered results. I generally would welcome help with index building/optimizations/techniques but also with the current load that i have, slight programming errors can have hugely drastic results. In particular, everything will look fine until you hit that magic ~300 disk seek/second number and then the entire thing comes to a screeching halt. So you need to be pretty involved in the process.

On the other hand, you could simply put together a package of "here is the software required, spider etc." which is exactly what the mnogosearch people are doing (although I question the mysqlness. I have been experimenting with mysqlitizing some pieces of shippo but none of them are working well at scale). And, even then there are a bunch of other issues -- is there a "master index" everyone starts from and uses? or is it that everyone ends up doing crazier things and you have ten billion little search engines using the same code with different indexes of varying sizes and quality, but the same basic algorithms? Again, I come back to the mnogosearch thing and I just ponder what the use of that would be.

kmarcus

4:53 am on Jun 12, 2003 (gmt 0)

Oh yeah, and i forgot, of course, two other very important pieces.

The first one is money. There is clearly a lot of money to be made in the search engine market, so this makes for a situation where money can quickly get in the way of passion and fun. I try to keep things down somewhat -- enough to pay for my bandiwdth and hardware with a bit left over that basically goes into promo stuff.

The second is traffic. Obviously letting people get involved implicitly drives the whole notion of traffic but again in a search engine oriented fashion, I wonder how that would work. In other words, a large part of any search engine, to be interesting, is that it has to be able to drive traffic to other sites. And obviously if there is no traffic to a semi-centralized repository for this, then there is no point in bothering.

Fischerlaender

7:37 am on Jun 12, 2003 (gmt 0)

But the fact is that most people probably don't understand the amount of data and work/hardware/bandwidth to do it in a reasonable fashion.

FULL ACK. Working on a little (full text) search engine in my spare time, I can fully back your postings. A search engine consists of more than just several programs as, for example, an operating system, that anyone is able to install on his box.

A search engine is a combination of several servers, programs, databases, storage units, network stuff and a person who knows how to let these parts interact.

So, I can't see what an open source search engine may bring.

colintho

2:54 pm on Jun 12, 2003 (gmt 0)

Could a group project run with eight individuals?

SlowMove

4:33 pm on Jun 12, 2003 (gmt 0)

kmarcus,
It's good to see that someone involved in the process has at least considered open source. And you bring up a lot of good objections. Maybe it's still possible though.

the spider and search needs to take place on someone's hardware and use someone's bandwidth. with dmoz, it's aol

the dmoz does require a series of links on any pages that use dmoz data regardless of how they get the data. This is just an idea, but what if you also insisted on a series of links that when clicked not only brought up the requested page, but also some affiliate advertizing that somehow relates to the search that was just run on the other site?

I generally would welcome help with index building/optimizations/techniques but also with the current load that i have, slight programming errors can have hugely drastic results. In particular, everything will look fine until you hit that magic ~300 disk seek/second number and then the entire thing comes to a screeching halt. So you need to be pretty involved in the process.

what if suggestions had to go through a process before being implemented, as is done at w3c.org? do you think there would be enough interest in competing with Google to find qualified programmers to sift through and filter recommendations?

On the other hand, you could simply put together a package of "here is the software required, spider etc." which is exactly what the mnogosearch people are doing (although I question the mysqlness. I have been experimenting with mysqlitizing some pieces of shippo but none of them are working well at scale). And, even then there are a bunch of other issues -- is there a "master index" everyone starts from and uses? or is it that everyone ends up doing crazier things and you have ten billion little search engines using the same code with different indexes of varying sizes and quality, but the same basic algorithms? Again, I come back to the mnogosearch thing and I just ponder what the use of that would be.

you couldn't have everyone running their own spider

kmarcus

6:03 pm on Jun 12, 2003 (gmt 0)

Well, I wouldn't necessarily call them "objections" but more like "unanswered concerns". ;)

--------

Actually, I think I was also the first engine to give "bonus points" to sites which link back to shippo. Nonetheless, I do try to encourage people who list with me to link back and obviously anyone who uses the xml feed should, but most people don't.

I have many times asked people on this board for ideas and suggestions but I haven't seen/heard anythign that jumped out as the "wow i need to do that" sort of thing. Sure, you could run things through some processes and see what happens but who knows.

If there are people out there who know freebsd, c/c++ and php and general database theory, by all means stickymail me. I am in the middle of a bunch of updates and rejiggering of data (from the crawl that i think the people in forum11 noticed!) but once i get these next pieces laid down it probably would be at a good "let someone in" stage. Maybe. ;)

Brad

6:33 pm on Jun 12, 2003 (gmt 0)

>>I believe that the manpower input could be minimal - with the main emphasis being on promotion

For a directory of decent size it would require a lot of manhours. Nobody wants to use a directory (or search engine) with no listings. Expect to do nothing but add sites for the first year. And you would need a directory script with category editors, because when you have thousands of listings somebody has to patrol for deadwood, redirects etc.

Few people will pay for submission until you do start getting some traffic. None of those factors are insurmountable but they have to be considered. :)

Skylo

10:59 am on Jun 13, 2003 (gmt 0)

"What's the point? Who is going to use something small when they can use Google or the ODP?"

See the thing is that you need a good idea at a time when competition have not even noticed this need or realised it's relevance. Like when G got off the ground it had a brilliant relevancy idea and it's timing was impeccable. No explainaition needed hey guys.

Anyway I think we all know what that next idea is going to be. Maybe more relevancy of content based search results. How to execute this is the key!

Brad

11:52 am on Jun 13, 2003 (gmt 0)

I think the small guys will have the best chance a succeeding on a new startup if they go for a vertical search engine one that is about a specific topic. (ie. civil engineering SE, sports SE, football SE, etc.)

Even with an off the shelf SE program you can then beat the big general engines for relevency without having to hire a gaggle of expensive Phd's and cooks. :)

Net_Wizard

4:21 pm on Jun 13, 2003 (gmt 0)

That's really interesting and what a coincedence. I've been thinking the same thing for the past week of such open source search engine. But obviously it just created a whole slew of questions. For instance...How to make this open source search engine honest and at the same time programatically protect it from algo exploitation?

I'm also curious if there's any attempt to convert a document into a mathematical equation.

Lots of interesting theory but lots of hard works and long way to go.

Cheers

pingpong123

5:16 pm on Jun 13, 2003 (gmt 0)

Im Currently Working on a AI System for search engines. I have created a database that creates relations between words. And the hierarchy of language. Take a look at this example. The system has a multi parent scheme. Most category systems only have one parent and the continue to drill down.

Monkey is a Animal (True)
Animal is a Monkey (False)

Animal is the parent of monkey.

Any Way Just looking for ideas.

jeremy goodrich

5:28 pm on Jun 13, 2003 (gmt 0)

If you haven't seen WordNet before (I forget the url, you can find it easily in Google or what not) they have a database already set up with mappings of words to either noun, pronoun, verb, etc - as well as similar words, word families, and what not.

Imho, that database is a must for anybody thinking of building out a lexicon for their own search engine - it only makes good sense to build on the work that other people have done, instead of reinvent the wheel :)

However, afaik, the rest of the stuff is pretty hard to come up with - a good crawl application that respects robots.txt, works fast, and also follows http 1.0 well.

Then there is the indexer, and the algo - which the 'algo' portion is the most difficult of them all.

Great thread - lots of interest, and search engines are, as systems, probably my favorite 'ai type application' that exists at this point.

brotherhood of LAN

6:00 pm on Jun 13, 2003 (gmt 0)

Agreed with Jeremy on Wordnet, alot of hard (manual) work has been put into making that dictionary.

A few related PDF's I've read use it for "word disambiguation", i.e. is a "flyer" something that flies or a piece of paper, the idea is to use surrounding text to find out which one is true by comparing it to the wordnet description, or hypernyms/hyponyms.

The thing I'm trying to learn more about is phrase building....when a page is tokenised you move from the beginning to end with a "window" (perhaps 7/8 words long) looking for phrases. If anyone knows of a good tutorial (and if its OK with Jeremy), a URL or two would be great.

Maybe easier matching a single word query with matches, but something that recognises phrases would help to no end.

jeremy goodrich

6:04 pm on Jun 13, 2003 (gmt 0)

Hey, post away. A friend of mine found some compresion algo recently that whipped through a huge index amazingly fast - I don't have an url :( else I'd post it myself.

Resources - good ones, with unique & incredible algorithms for engine building - are few & far between.

Once you get past the 'basic steps' I've found that it's very, very hard to find anything public domain / readily available that will help with the more common tasks, tokenizing, fast look ups, compression, etc.

Though there could be stuff out there I've not seen - it's been a while since I was in on an engine building project of my own.

stevegpan2

7:26 pm on Jun 13, 2003 (gmt 0)

seems a lot of talents here.
If any of you have a good idea, I am interested in joining you as a volunteer.

nice weekend.....

Essex_boy

7:34 pm on Jun 13, 2003 (gmt 0)

If your gonna start something let me know, Id nbe happy to help,

kmarcus

10:16 pm on Jun 13, 2003 (gmt 0)

I see the whole wordnet piece as more or less a piece that sits in between the user and the actual search that takes place -> the "query categorizer". In other words, I dont see it so much on the relevancy (by means of which site is more important than another for a particular query) side of the equation but rather more on the "where should i look based on what i know of these things". It might be niec for disambiguating the "time flies like an arrow" sort of problem (what are "time flies"?) but you can accomplish a lot of this nlp-ish stuff using simple POS taggers to decide which things are important enough to bother putting in extra effort on. Or at least that's what I think.

One of the users who actually did stickymail me as i previously asked I replied with this, so again if there is any interest given this additional information about shippo, let me know:

So right now the core backend stuff is all pretty much in c/c++. This includes the spider, parser, indexer and search engine. I use fastcgi across a few backend servers running the search engine which spits out an xml doc to the front end web server and things go from there into php land.
I did spend some time rewriting the spider to use mysql as a datastore and do a few other clever things to help down the line. The big issue there became processor power: php is very slow (or at least my code is) for parsing the html documents. On top of that, i was having problems getting curl to work (which i have since figured out), although of course I still prefer my own fsockopen's ;)
I spent about 2 months experimenting with the mysql fulltext search in a variety of ways and even started modifying mysql slightly to fix irritating things like stopwords, size limitations, etc. but they are all performance-crappy. Or, to say it slightly differently, You can't mix indexes with match very well and while using limit seems to help, you can easily enter stupid queries that take forever, even with only a million indexed documents.
I am now experimenting with a slightly different use for mysql that is basically using the exact same "algorithm" but applied to mysql with a few extra parameters to help things along. This is where I am at right now and I cant yet saw whether or not it is promising or not. At this point, the spider and at least a part of the parser (meta-data generator) are all c and more or less act as filters. I wrote yet another one a few days ago in php to parse the metadata and do a few other nice things. It is *dog slow* but of course I could always bring that to c if the technique actually ends up working in the long run.
The big issue I have faced with searchhippo has always been around listlengths. That is, when I get my inverted indexes, sometimes the lists are so large that performing the union of those sets causes issues, so there are a bunch of heuristics applied to decide which things to really pay attention to and to pay attention for how long. This implies that short lists (i.e. infrequent words) are important which is a good thign as well. There are currently are some sort order issues that arise because sometimes the deduper (which runs afterwards) ends up cleaning up too much stuff -- more than it retrieved. And, lastly, there are some really hackish things to support glorified phrase matching and position weight that were, more or less, implemented nicely two years ago but is now a big hack.
So the current path i am on is to try and modify the way that my inverted indexes work, and see if i can put them into mysql. If I can get that part to work, then I can also dump the fastcgi piece and do some of that processing either with mysql or with some php code, either of which would probably work fine.
What kinds of things would you want to work on the most? which areas do yout hink you'd be able to work on the best? I can email to you some of the code or something and let you muck with it and see how things go. it is somewhat modularized so it's easy to say "this is the input, this needs to be the output" sort of thing.
And, like I have already expressed, most of wht i am doing now is all experimental to see if i can get it to work while solving a bunch of other problems that i have.

Lastly, jeremy -- there is a book named "managing gigabytes" that addresses all of the things you are commenting on -- compression, doc weight, fast lookups, etc. The funny thing is though that I think of those as the easy parts of the problem that you learned how to solve in school or from a book. The harder parts are the implementations of them, but worse, for me at least, is the spider+meta-data generator. Maybe that is what you mean about tokenizer, but I spend more time tweaking that thing than anything.

This 78 message thread spans 3 pages: 78