Forum Moderators: bakedjake
If you were to ask these questions:
Q1: Show me 100 webmasters that would have liked to have shared ownership of "yahoo or similar"
Q2: Show me 100 webmasters that would be interested in owning the "next yahoo or similar"
Thats what the point of interest should be.
It seems like some people have good contributions on a project like this and they are willing to put something into it - so anybody got some web space and another free domain for a first step :-)? How about a subcommunity here called the "Webmaster World Wide Web.com"? (The "W4" :-) )
Having designed search engine middle ware, as well as fused a few open source apps to create my own SE, I know something of the difficulties in such a project - which is why I suggest starting out small.
However, as Gigablast [gigablast.com] or SearchHippo [searchhippo.com] demonstrate, "one man shows" can still make a run at it & achieve a level of success.
It depends on your goals, I guess, how far you want to take it. There is always the Stickymail here at WebmasterWorld [webmasterworld.com] to communicate with other like minded folks here & get something going.
I wish anybody who embarks down the path of SE creation luck - after all, they just might create something I'll grow to love as much as Google one day :)
Directories:
+ technologically simple
+ get visitors from the big engines
- costs very much time (or man power) to grow to a size that can compete with the likes of Yahoo or ODP
Search Engines:
+ can be operated by a single person
- needs high end technique
- needs an internet connection with high bandwidth
So you have to decide if you can get several people to help you in filling your directory or if your programming skills are good enough to create a search engine. In other words: A directory needs social, an engine technical skills.
But whatever decision you make, be prepared to compete with the big ones: Yahoo/ODP/Looksmart on the directory side, Google/FAST/Inktomi on the search engine side.
One challenge that both directories and search engines encounter would be marketing & establishing a brand and brand awareness.
This is where I believe the "joint project" approach has an advantage that could sway a decision towards a directory. With a joint directory project team working together with a united promotion and marketing strategy.
Directories:
"+ get visitors from the big engines"
ADVANTAGE for marketing and establishing a brand / brand awareness
"- costs very much time (or man power) to grow to a size that can compete with the likes of Yahoo or ODP"
I believe that the manpower input could be minimal - with the main emphasis being on promotion - as the webmasters here - that is one of the key skills that can be implemented as part of the team effort from a joint project.
What would be the result of a team of skilled webmasters promoting a directory as a joint project?
With a good foundation / starting base for a directory, the build could be self generating with business / organisations listing their sites through "add url" feature - "paid submissions". The admin side of a directory should be a percentage of the work involved.
1 person could process about 200 submissions per day - if the submission fee was $50 .....productivity of $10,000 per day per admin / editor
Could a directory be promoted so well that a submission fee of $100, $200 or $300 (yahoo charges that amount)
How many sites would want to be listed?
What value would be put on such a directory?
How many webmasters would want to have equal shares in a joint directory?
Overall, in my opinion, generating the traffic to this collection is the major issue.
ACK. As colintho pointed out, to achieve this marketing is crucial. But before even thinking about marketing we'll need a product - a directory that can compete with the big ones. If we start with a nearly empty directory, there would be no traffic to the site and no (paid) submissions.
It's the classical chicken-egg-problem.
We are slowly populating our directory, to help with the "chicken & egg" situation, just the 2 of us, husband + wife team adding sites whenever we can.
At some point we hope to achieve a "critical mass" that would attract higher numbers of paid submissions.
Where is this point? unknown
How long will it take? unknown
How can this be accelerated? "joint project"
But a joint project also needs a "critical mass" of interested participants before it is worth "transferring" a sole enterprise project over to a "joint project"
With just a few participants - it may not work
Then the situation remains almost "as it is" - same as the past couple of years
In Sep2002 - there was little interest shown in doing or trying this - so I started by myself
Its taken quite some time to even reach the point we are at.
To date there are a small number of people that have shown some interest and replied by sticky - much appreciated.
If there is enough interested people - it may be worth really doing.
Years later, stanford university students invent a search engine, these arnt companies like microsoft,ibm , compaq.
Not massive corporations, just any body.
Basically its just building of a brand and getting people to trust in it.
Isnt it a surprise , no major corporation has invented anything at all that is dominant.
But the fact is that most people probably don't understand the amount of data and work/hardware/bandwidth to do it in a reasonable fashion. So, much like dmoz, they chose instead of downloading a bunch of rdf's to instead simply grab some html parser that runs the search off dmoz and is effectively backfill for them.
And that was exactly the idea I had in mind when way way way back in 2001, when i decided to open up the free xml "feed" of the spidered results. I generally would welcome help with index building/optimizations/techniques but also with the current load that i have, slight programming errors can have hugely drastic results. In particular, everything will look fine until you hit that magic ~300 disk seek/second number and then the entire thing comes to a screeching halt. So you need to be pretty involved in the process.
On the other hand, you could simply put together a package of "here is the software required, spider etc." which is exactly what the mnogosearch people are doing (although I question the mysqlness. I have been experimenting with mysqlitizing some pieces of shippo but none of them are working well at scale). And, even then there are a bunch of other issues -- is there a "master index" everyone starts from and uses? or is it that everyone ends up doing crazier things and you have ten billion little search engines using the same code with different indexes of varying sizes and quality, but the same basic algorithms? Again, I come back to the mnogosearch thing and I just ponder what the use of that would be.
The first one is money. There is clearly a lot of money to be made in the search engine market, so this makes for a situation where money can quickly get in the way of passion and fun. I try to keep things down somewhat -- enough to pay for my bandiwdth and hardware with a bit left over that basically goes into promo stuff.
The second is traffic. Obviously letting people get involved implicitly drives the whole notion of traffic but again in a search engine oriented fashion, I wonder how that would work. In other words, a large part of any search engine, to be interesting, is that it has to be able to drive traffic to other sites. And obviously if there is no traffic to a semi-centralized repository for this, then there is no point in bothering.
But the fact is that most people probably don't understand the amount of data and work/hardware/bandwidth to do it in a reasonable fashion.
FULL ACK. Working on a little (full text) search engine in my spare time, I can fully back your postings. A search engine consists of more than just several programs as, for example, an operating system, that anyone is able to install on his box.
A search engine is a combination of several servers, programs, databases, storage units, network stuff and a person who knows how to let these parts interact.
So, I can't see what an open source search engine may bring.
the spider and search needs to take place on someone's hardware and use someone's bandwidth. with dmoz, it's aol
the dmoz does require a series of links on any pages that use dmoz data regardless of how they get the data. This is just an idea, but what if you also insisted on a series of links that when clicked not only brought up the requested page, but also some affiliate advertizing that somehow relates to the search that was just run on the other site?
I generally would welcome help with index building/optimizations/techniques but also with the current load that i have, slight programming errors can have hugely drastic results. In particular, everything will look fine until you hit that magic ~300 disk seek/second number and then the entire thing comes to a screeching halt. So you need to be pretty involved in the process.
what if suggestions had to go through a process before being implemented, as is done at w3c.org? do you think there would be enough interest in competing with Google to find qualified programmers to sift through and filter recommendations?
On the other hand, you could simply put together a package of "here is the software required, spider etc." which is exactly what the mnogosearch people are doing (although I question the mysqlness. I have been experimenting with mysqlitizing some pieces of shippo but none of them are working well at scale). And, even then there are a bunch of other issues -- is there a "master index" everyone starts from and uses? or is it that everyone ends up doing crazier things and you have ten billion little search engines using the same code with different indexes of varying sizes and quality, but the same basic algorithms? Again, I come back to the mnogosearch thing and I just ponder what the use of that would be.
you couldn't have everyone running their own spider
--------
Actually, I think I was also the first engine to give "bonus points" to sites which link back to shippo. Nonetheless, I do try to encourage people who list with me to link back and obviously anyone who uses the xml feed should, but most people don't.
I have many times asked people on this board for ideas and suggestions but I haven't seen/heard anythign that jumped out as the "wow i need to do that" sort of thing. Sure, you could run things through some processes and see what happens but who knows.
If there are people out there who know freebsd, c/c++ and php and general database theory, by all means stickymail me. I am in the middle of a bunch of updates and rejiggering of data (from the crawl that i think the people in forum11 noticed!) but once i get these next pieces laid down it probably would be at a good "let someone in" stage. Maybe. ;)
For a directory of decent size it would require a lot of manhours. Nobody wants to use a directory (or search engine) with no listings. Expect to do nothing but add sites for the first year. And you would need a directory script with category editors, because when you have thousands of listings somebody has to patrol for deadwood, redirects etc.
Few people will pay for submission until you do start getting some traffic. None of those factors are insurmountable but they have to be considered. :)
See the thing is that you need a good idea at a time when competition have not even noticed this need or realised it's relevance. Like when G got off the ground it had a brilliant relevancy idea and it's timing was impeccable. No explainaition needed hey guys.
Anyway I think we all know what that next idea is going to be. Maybe more relevancy of content based search results. How to execute this is the key!
Even with an off the shelf SE program you can then beat the big general engines for relevency without having to hire a gaggle of expensive Phd's and cooks. :)
I'm also curious if there's any attempt to convert a document into a mathematical equation.
Lots of interesting theory but lots of hard works and long way to go.
Cheers
Monkey is a Animal (True)
Animal is a Monkey (False)
Animal is the parent of monkey.
Any Way Just looking for ideas.
Imho, that database is a must for anybody thinking of building out a lexicon for their own search engine - it only makes good sense to build on the work that other people have done, instead of reinvent the wheel :)
However, afaik, the rest of the stuff is pretty hard to come up with - a good crawl application that respects robots.txt, works fast, and also follows http 1.0 well.
Then there is the indexer, and the algo - which the 'algo' portion is the most difficult of them all.
Great thread - lots of interest, and search engines are, as systems, probably my favorite 'ai type application' that exists at this point.
A few related PDF's I've read use it for "word disambiguation", i.e. is a "flyer" something that flies or a piece of paper, the idea is to use surrounding text to find out which one is true by comparing it to the wordnet description, or hypernyms/hyponyms.
The thing I'm trying to learn more about is phrase building....when a page is tokenised you move from the beginning to end with a "window" (perhaps 7/8 words long) looking for phrases. If anyone knows of a good tutorial (and if its OK with Jeremy), a URL or two would be great.
Maybe easier matching a single word query with matches, but something that recognises phrases would help to no end.
Resources - good ones, with unique & incredible algorithms for engine building - are few & far between.
Once you get past the 'basic steps' I've found that it's very, very hard to find anything public domain / readily available that will help with the more common tasks, tokenizing, fast look ups, compression, etc.
Though there could be stuff out there I've not seen - it's been a while since I was in on an engine building project of my own.
One of the users who actually did stickymail me as i previously asked I replied with this, so again if there is any interest given this additional information about shippo, let me know:
So right now the core backend stuff is all pretty much in c/c++. This includes the spider, parser, indexer and search engine. I use fastcgi across a few backend servers running the search engine which spits out an xml doc to the front end web server and things go from there into php land.
I did spend some time rewriting the spider to use mysql as a datastore and do a few other clever things to help down the line. The big issue there became processor power: php is very slow (or at least my code is) for parsing the html documents. On top of that, i was having problems getting curl to work (which i have since figured out), although of course I still prefer my own fsockopen's ;)
I spent about 2 months experimenting with the mysql fulltext search in a variety of ways and even started modifying mysql slightly to fix irritating things like stopwords, size limitations, etc. but they are all performance-crappy. Or, to say it slightly differently, You can't mix indexes with match very well and while using limit seems to help, you can easily enter stupid queries that take forever, even with only a million indexed documents.
I am now experimenting with a slightly different use for mysql that is basically using the exact same "algorithm" but applied to mysql with a few extra parameters to help things along. This is where I am at right now and I cant yet saw whether or not it is promising or not. At this point, the spider and at least a part of the parser (meta-data generator) are all c and more or less act as filters. I wrote yet another one a few days ago in php to parse the metadata and do a few other nice things. It is *dog slow* but of course I could always bring that to c if the technique actually ends up working in the long run.
The big issue I have faced with searchhippo has always been around listlengths. That is, when I get my inverted indexes, sometimes the lists are so large that performing the union of those sets causes issues, so there are a bunch of heuristics applied to decide which things to really pay attention to and to pay attention for how long. This implies that short lists (i.e. infrequent words) are important which is a good thign as well. There are currently are some sort order issues that arise because sometimes the deduper (which runs afterwards) ends up cleaning up too much stuff -- more than it retrieved. And, lastly, there are some really hackish things to support glorified phrase matching and position weight that were, more or less, implemented nicely two years ago but is now a big hack.
So the current path i am on is to try and modify the way that my inverted indexes work, and see if i can put them into mysql. If I can get that part to work, then I can also dump the fastcgi piece and do some of that processing either with mysql or with some php code, either of which would probably work fine.
What kinds of things would you want to work on the most? which areas do yout hink you'd be able to work on the best? I can email to you some of the code or something and let you muck with it and see how things go. it is somewhat modularized so it's easy to say "this is the input, this needs to be the output" sort of thing.
And, like I have already expressed, most of wht i am doing now is all experimental to see if i can get it to work while solving a bunch of other problems that i have.
Lastly, jeremy -- there is a book named "managing gigabytes" that addresses all of the things you are commenting on -- compression, doc weight, fast lookups, etc. The funny thing is though that I think of those as the easy parts of the problem that you learned how to solve in school or from a book. The harder parts are the implementations of them, but worse, for me at least, is the spider+meta-data generator. Maybe that is what you mean about tokenizer, but I spend more time tweaking that thing than anything.