Forum Moderators: bakedjake
What does it take to start a new search engine, spinder, database or whatever-you-want-to-call-it? Obviously there are differences, but let's take the simplest one, say a spidering search engine.
What do you need?
* Hardware
- for the server(s) that shows the engine
- for the server(s) that spider
* Software
- for searching / categorizing
- for the spiders
* Bandwidth
- the more the better for the cheaper
* People to manage/edit (? could be free like ODP or Zeal
* Advertising the server to others
- could be big $$ or could be done just as editing
I am sure I missed a lot of obvious ones but
What other roadblocks are there?
Originality unfortunatelly is not my forte. As an engineer by trade, I tend to take several ideas, smash them together and create something new, but definitelly not original, just original use or implementation.
Obviously technology is not the roadblock.
Isn't most SEs started as FFA lists?
Value proposition would be the reason that people would use the thing - not just you, because obviously, if you make it, you'll use it all the time.
What market would use it? Which would adopt it, and why would they switch from their current favorite (MSN, Yahoo, or AOL, or perhaps Google).
Read about the traffic Wisenut has recieved, to date. Their results are outstanding for such a young company, their db is huge, but nobody uses them (significantly) because it takes a lot of incredible PR to gain traffic.
The trick also is to come up with relevant, unique results. Alltheweb, Wisenut, and Google have incredible results, but they are all different. If they weren't, there would be no competitive differentiator, which is another thing very important for the aspiring search engine.
Hope these ideas help. Funding for a huge db is also a major issue, with CPU being crucial, along with tons of ram, and bandwidth for spidering. And good customer service, to handle the "why did you spider my site?" questions.
Good luck.
Right on the money -- also called USP (Unique Selling Proposition). And you want to make sure that your VP/USP fills a real need, not a manufactured need.
Search returns that are not influenced by advertising $$$ might be a VP, but only if enough people recognize and dislike the influence of money in their search engine. Even then, the results better be pristine and highly relevant, or the biggest server farm in the world won't matter.
The next "must-have" would be a viable business plan. Not so easy without advertising. As we are seeing, not so easy even WITH advertising.
I would say the business aspects are at least as important than the technical, and need to be addressed right up front.
>specialize in a specific subject
There's been thought expressed that specialized topical directories will become more important as the free/pay landscape continues to change, the net becomes more $ driven and search results degrade.
I'd think that a specialized directory in a given area would be a whole lot easier to get rolling and maintain, and would at least give the benefit of on-topic relevant links, as well as providing a resource. It could possibly elevate the status of the site hosting it if there were enough interest generated. It could be legitimately done, not a FFA.
>if you can deliver results in that topic: better...
If the simpler to implement possibility of a directory reached a certain point of success, search could then be added.
The logic to this being that, if my 2nd tier SE \Directory is already listed and regularly spidered by the major players, them provided the content is tightly focused and the site well designed, it stands a good chance of popping up in top pages of the major SE's listings. Consequently submitted sites can piggy back on the success of the 2nd tier Engine.
Equally a lot of major search engines are fighting constant battles against information saturation, with search engine optimization techniques, and pay-per-links, becoming prevalent surfers are increasingly concerned that top-rated SE content isn't neccessarily what they are actually looking for.
I recently created a search engine\directory focused exclusively on Africa, and these are the factors I took into consideration, and have been what have driven the success of the site. As you can imagine for a lot of small sites in Africa $199 is a lot of money when successful submission is not guaranteed.
I've been equally fortunate with small portals in niche markets. Whenever possible I develop a portal and not just a site. It's a terrific way to gather traffic and exposure to promote the products and/or services I have to offer.
I hope you've lokked at the discussion on hubs, Tapolyai.
A question of hubs [webmasterworld.com]
Not that it answers your question but I believe there are parts of answers there and maybe one option to consider.
Good luck. I'll flag this post and see if there is more I can offer later.
I cannot afford to pay for submissions.
As the price of SE, and pay-per-click goes up, the number of web sites that cannot afford it increases.
This of course also affects the SEO industry. Those individuals, small companies and mid size firms will believe that if they cannot afford the SE pricing, the SEO is a waste of money (be it true or not). Large firms that can afford it, can potentially afford their own SEO in-house (not discussing the quality of course).
Therefore, there is a huge opening here for a SE that, like DMOZ/ODP, would work for "free”. If you think about it - for every 1 mega web sites there are 10 medium size sites; for every medium size there is 100 small webs; and every small web there are 1,000 individual sites. (The numbers are not actual just estimates from some SE research I have been reading.)
The critical part and the most important after quality of listings - is minimal human capital investment, I believe. I wrote a test spider that will efficiently look around and catalog web sites using industry standard methods (titles, meta tags, etc.).
An other idea that DMOZ/ODP should have capitalized on is distributed spidering. If you are familiar with SETI@Home or the distributed.net project, you got the idea. This would be a major savings on bandwidth requirements. Most SEs have to worry about inbound AND outbound traffic. With distributed spidering the requirements in hardware and bandwidth for outbound traffic (other then serving up results, and sending spider seeds to the search network) is minimal.
Of course, I do not want to fall into the pit of "if I build it they will come". I believe with a distributed network such system can catalog sites faster and more then the current to SE has.
This is where the money could come in (and partially the Value Proposition) - Other SEs and commercial businesses would be willing to pay for commercial use and access to the DB, which would cover the upkeep of the system.
I have done crazier things in my life for altruistic reasons. I truly believe there is a need for an "everyone's SE" that is true for those who search and those who are searched!
Some of you might say "but A, and B and C are much easier to search". Yes, they are easier because it is limited in scope and potentially "censored". (I am fed up with companies saying to garbage work, "this is what our customers asked for".)
Everything you say makes sense but I see several problems:
1. Getting the searchers to use your search engine without spending huge amounts of money on publicity and advertising.
2. All general search engines have the same problem - the sites that rise to the top of the SERPS are generally those that have the money for SEO or some sort of paid placement. Yes, good SEO can be done for very little if one has the time and inclination to learn but that can freeze out hobby webmasters, most of whom, have neither. Note: here is an underserved market for listing the little websites.
3. Covering your costs.
4. Turning a profit.
I think economy of scale is working against starting large general search engines because the Web has gotten so large. I'm still thinking the answer is to break it up into more managable chunks. That is just my opinion.
With all that said the attempt to start a new and better general search engine is a noble one and I don't want to disuade you from it. Somebody, someday will come up with an SE that eclipse even mighty Google.
This could be a problem indeed. I have some ideas as to how to do it, but nothing concrete. Any ideas?
>2. All general search engines have the same problem - the sites that >rise to the top of the SERPS are generally those that have the money >for SEO or some sort of paid placement. Yes, good SEO can be done >for very little if one has the time and inclination to learn but >that can freeze out hobby webmasters, most of whom, have neither.
If the methodology to spider sites is basic, such as meta tags, the SEO is limited in extent. Picking the right keywords indicates that the web site truly understands their client base, which should be rewarded. I believe SEs "lost control" once they moved from meta tags to other details. I would also be the advocate of publishing all the positioning algorithms.
>3. Covering your costs.
The initial cost of course would be a burden. The resell of the DB would be the income in the future.
>4. Turning a profit.
This would not be a requirement. I never intended this site to make money other then to cover the initial and recurring costs.
I didn't do justice to the clustering issue, but it deserves its own entire thread. Matter of fact you could start a thread for each of these topics to give them the attention they deserve. I'll just say for the record, setting up a cluster that works reliably is more of a task than most realize. It requires some good sys admin skills if linux is used. And further more the programming requires a slightly different mindframe and the best resources for any information break down to hard to dicipher to english man pages or incomplete, outdated notes taken by someone 4 years ago.
oh and if you find any libs etc that help in the parsing of Javascript based nav menus, let me know :)
HTH
later
url to Anatomy of a Large Scale Hypertextual Web Search Engine
[www7.scu.edu.au...]
other paper was by Page, Cho and Garcia-Molina
Efficient Crawling Through URL Ordering
[www-db.stanford.edu...]
Web Client Programming with Perl
[oreilly.com...]
Well, getting surfers to use your engine is about changing habits. That can be a tricky thing to do. Getting to people right when they first get connected to the internet could help so people inprint like a hatchling on your engine.
I know, in my early internet explorations, I learned a lot about different search engines from metasearch. I paid attention to which databases in the meta-engine consistantly brought back better results and eventually started checking them individually.
Searchers also seem to like clean, fast search results pages - that might keep them coming back.
I agree about the metatags, but as one who runs a directory with a little spider that only fetches meta-tags, I can tell you that a lot of sites no longer use them. I don't know how to reverse that trend.
I wish I had more answers for you, but some of the best SEO minds frequent these forums so maybe they will have more ideas.
There are a lot of search engines and directories around, many of them well known, but the majority are basically non existant to the net masses. Knowing this begs the question of "why make another search engine? (as someone already pointed out)". Well, in my case, I found a gap in search services around daily headlines, so I built an index/search service that has fresh headlines every morning from around 10,000 sources, and growing. (Yes, Google and Fast do too, but technically, mine has been publically available longer.) No one really knows about my little niche in the search world yet, but I know it works, and it has been up for several months. My point is that if you can find a niche, you may have a great idea, but if you can't, wouldn't it be better to help OTHER engines improve?
Examples of how to help:
Advocate the use of meta tags, even some that you may have never seen before. Put them on your site, and help others use them also.
Use headline syndication to share your updates with the net. Many premade portal systems like PHP Nuke, Zope, and Slash have this feature built it, but it is easy to write a script to generate XML from a database, or do it by hand. There are several places you can get your XML feeds listed and put into search engines (like mine).
Report broken links - most search / directory sites have a way for you to do this easily.