Forum Moderators: bakedjake

Message Too Old, No Replies

What does it take to start a new search engine/database?

Or how to really get in trouble at home for spending too much time on net

         

Tapolyai

2:03 am on Nov 21, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I keep reading these post and one thought keeps popping into my head.

What does it take to start a new search engine, spinder, database or whatever-you-want-to-call-it? Obviously there are differences, but let's take the simplest one, say a spidering search engine.

What do you need?
* Hardware
- for the server(s) that shows the engine
- for the server(s) that spider
* Software
- for searching / categorizing
- for the spiders
* Bandwidth
- the more the better for the cheaper
* People to manage/edit (? could be free like ODP or Zeal
* Advertising the server to others
- could be big $$ or could be done just as editing

I am sure I missed a lot of obvious ones but

What other roadblocks are there?

Eric_Jarvis

11:51 am on Nov 21, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



IMO the obvious one is the question "why?"

unless you have a reason why your search engine will get used there really is no point

Key_Master

12:23 pm on Nov 21, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You could write a book on the subject. Nothing simple about starting up a spidering search engine, especially if you want to be original. Just my 2 cents.

Tapolyai

1:26 am on Nov 22, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmmm... Why? Because even DMOZ or ODP seems to be "influenced" by other organizations. From what I understand some backers are heavily weighting in, and distracting from the base idea.
Because I believe all "worthwhile" SEs will become for pay, and the top positions will be taken by for-profit or official non-profit sites, pushing out the "home brew" pages, which I believe really kick started the web.

Originality unfortunatelly is not my forte. As an engineer by trade, I tend to take several ideas, smash them together and create something new, but definitelly not original, just original use or implementation.

Obviously technology is not the roadblock.

Isn't most SEs started as FFA lists?

jeremy goodrich

10:55 pm on Nov 22, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



At one point, I was working on a search engine. What I would recommend to anybody else doing the same is first decide what your 'value proposition' is: a term from marketing in my college days, it is very important for a search engine.

Value proposition would be the reason that people would use the thing - not just you, because obviously, if you make it, you'll use it all the time.

What market would use it? Which would adopt it, and why would they switch from their current favorite (MSN, Yahoo, or AOL, or perhaps Google).

Read about the traffic Wisenut has recieved, to date. Their results are outstanding for such a young company, their db is huge, but nobody uses them (significantly) because it takes a lot of incredible PR to gain traffic.

The trick also is to come up with relevant, unique results. Alltheweb, Wisenut, and Google have incredible results, but they are all different. If they weren't, there would be no competitive differentiator, which is another thing very important for the aspiring search engine.

Hope these ideas help. Funding for a huge db is also a major issue, with CPU being crucial, along with tons of ram, and bandwidth for spidering. And good customer service, to handle the "why did you spider my site?" questions.

Good luck.

tedster

1:55 am on Nov 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> first decide what your 'value proposition' is

Right on the money -- also called USP (Unique Selling Proposition). And you want to make sure that your VP/USP fills a real need, not a manufactured need.

Search returns that are not influenced by advertising $$$ might be a VP, but only if enough people recognize and dislike the influence of money in their search engine. Even then, the results better be pristine and highly relevant, or the biggest server farm in the world won't matter.

The next "must-have" would be a viable business plan. Not so easy without advertising. As we are seeing, not so easy even WITH advertising.

I would say the business aspects are at least as important than the technical, and need to be addressed right up front.

Brad

10:26 pm on Nov 26, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



One way of creating a VP/USP for a search engine might be to specialize in a specific subject if you can deliver results in that topic: better, faster, deeper, and more expertly than the general engines can.

Marcia

10:54 pm on Nov 26, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just a thought:

>specialize in a specific subject

There's been thought expressed that specialized topical directories will become more important as the free/pay landscape continues to change, the net becomes more $ driven and search results degrade.

I'd think that a specialized directory in a given area would be a whole lot easier to get rolling and maintain, and would at least give the benefit of on-topic relevant links, as well as providing a resource. It could possibly elevate the status of the site hosting it if there were enough interest generated. It could be legitimately done, not a FFA.

>if you can deliver results in that topic: better...

If the simpler to implement possibility of a directory reached a certain point of success, search could then be added.

Uhuru

10:33 am on Dec 1, 2001 (gmt 0)

10+ Year Member



With pay per submission gaining ground a lot of people are looking to take advantage of 2nd tier SE \Directories \Portals to give their sites visibility.

The logic to this being that, if my 2nd tier SE \Directory is already listed and regularly spidered by the major players, them provided the content is tightly focused and the site well designed, it stands a good chance of popping up in top pages of the major SE's listings. Consequently submitted sites can piggy back on the success of the 2nd tier Engine.

Equally a lot of major search engines are fighting constant battles against information saturation, with search engine optimization techniques, and pay-per-links, becoming prevalent surfers are increasingly concerned that top-rated SE content isn't neccessarily what they are actually looking for.

I recently created a search engine\directory focused exclusively on Africa, and these are the factors I took into consideration, and have been what have driven the success of the site. As you can imagine for a lot of small sites in Africa $199 is a lot of money when successful submission is not guaranteed.

paynt

11:48 am on Dec 1, 2001 (gmt 0)



Hello Uhuru and welcome to the board here at Webmaster World.

I've been equally fortunate with small portals in niche markets. Whenever possible I develop a portal and not just a site. It's a terrific way to gather traffic and exposure to promote the products and/or services I have to offer.

I hope you've lokked at the discussion on hubs, Tapolyai.

A question of hubs [webmasterworld.com]

Not that it answers your question but I believe there are parts of answers there and maybe one option to consider.

Good luck. I'll flag this post and see if there is more I can offer later.

Tapolyai

7:23 pm on Dec 1, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My question came about very much because an issue Uhuru brought up.

I cannot afford to pay for submissions.

As the price of SE, and pay-per-click goes up, the number of web sites that cannot afford it increases.

This of course also affects the SEO industry. Those individuals, small companies and mid size firms will believe that if they cannot afford the SE pricing, the SEO is a waste of money (be it true or not). Large firms that can afford it, can potentially afford their own SEO in-house (not discussing the quality of course).

Therefore, there is a huge opening here for a SE that, like DMOZ/ODP, would work for "free”. If you think about it - for every 1 mega web sites there are 10 medium size sites; for every medium size there is 100 small webs; and every small web there are 1,000 individual sites. (The numbers are not actual just estimates from some SE research I have been reading.)

The critical part and the most important after quality of listings - is minimal human capital investment, I believe. I wrote a test spider that will efficiently look around and catalog web sites using industry standard methods (titles, meta tags, etc.).

An other idea that DMOZ/ODP should have capitalized on is distributed spidering. If you are familiar with SETI@Home or the distributed.net project, you got the idea. This would be a major savings on bandwidth requirements. Most SEs have to worry about inbound AND outbound traffic. With distributed spidering the requirements in hardware and bandwidth for outbound traffic (other then serving up results, and sending spider seeds to the search network) is minimal.

Of course, I do not want to fall into the pit of "if I build it they will come". I believe with a distributed network such system can catalog sites faster and more then the current to SE has.
This is where the money could come in (and partially the Value Proposition) - Other SEs and commercial businesses would be willing to pay for commercial use and access to the DB, which would cover the upkeep of the system.

I have done crazier things in my life for altruistic reasons. I truly believe there is a need for an "everyone's SE" that is true for those who search and those who are searched!

Some of you might say "but A, and B and C are much easier to search". Yes, they are easier because it is limited in scope and potentially "censored". (I am fed up with companies saying to garbage work, "this is what our customers asked for".)

Brad

8:26 pm on Dec 1, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Tapolyai -

Everything you say makes sense but I see several problems:

1. Getting the searchers to use your search engine without spending huge amounts of money on publicity and advertising.

2. All general search engines have the same problem - the sites that rise to the top of the SERPS are generally those that have the money for SEO or some sort of paid placement. Yes, good SEO can be done for very little if one has the time and inclination to learn but that can freeze out hobby webmasters, most of whom, have neither. Note: here is an underserved market for listing the little websites.

3. Covering your costs.

4. Turning a profit.

I think economy of scale is working against starting large general search engines because the Web has gotten so large. I'm still thinking the answer is to break it up into more managable chunks. That is just my opinion.

With all that said the attempt to start a new and better general search engine is a noble one and I don't want to disuade you from it. Somebody, someday will come up with an SE that eclipse even mighty Google.

Tapolyai

12:03 am on Dec 4, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>1. Getting the searchers to use your search engine without spending >huge amounts of money on publicity and advertising.

This could be a problem indeed. I have some ideas as to how to do it, but nothing concrete. Any ideas?

>2. All general search engines have the same problem - the sites that >rise to the top of the SERPS are generally those that have the money >for SEO or some sort of paid placement. Yes, good SEO can be done >for very little if one has the time and inclination to learn but >that can freeze out hobby webmasters, most of whom, have neither.

If the methodology to spider sites is basic, such as meta tags, the SEO is limited in extent. Picking the right keywords indicates that the web site truly understands their client base, which should be rewarded. I believe SEs "lost control" once they moved from meta tags to other details. I would also be the advocate of publishing all the positioning algorithms.

>3. Covering your costs.

The initial cost of course would be a burden. The resell of the DB would be the income in the future.

>4. Turning a profit.

This would not be a requirement. I never intended this site to make money other then to cover the initial and recurring costs.

seriesint

2:20 am on Dec 4, 2001 (gmt 0)



Hi
  Thought I would chime in on this for a sec. In the past few days I looked over the idea/concept of a specialized search engine. So first thing ,I went to google and started a search ;)
  The guys that created Google (Brin and Page) wrote several papers about Google in its Stanford pre mass public use stage. One of those goes in to some detail about the setup they used and some obstacles. And better yet some hard numbers for you to mull over. Stuff like 24 Million Pages 150GB of raw data . Compressed it came to like 55GB. They had to create a virtual filesystem to handle the excessive size (64bit addressing). Course their PageRank idea is expanded upon and it lightly touches some of the issues of parsing and indexing pages. Last it gives some idea of the time frame required; from the paper I gathered it took about 2 weeks or so to harvest and index the pages. That's given some serious hardware. So where does that leave anyone wanting to make a search engine?
  This is just my take on it of course. Given the sheer size of the web and amount of data, it's close to silly to even consider starting out with the goal of being a major SE. Given the requirements (Hardware & Software), the amount of data and then you can start to worry about the business pieces others mentioned. The requirements are for hard code programming know how and system admin. Ever stop to think that any SE has to run its own Name Server? I never gave it that much thought til I read the Google White Paper. And that's just one part of the equation. Know about clustering? Programming dist. programs? Then start in on indexing techniques and I'll leave the rest to what's in the white paper.
    With all that out of the way. Go ahead and do it!! Nothing says it has to be the next Google or even Altavista. Specialization is where it's at for those starting out. Get the experience in what is going on and build it up til its a solid idea. While any serious SE will take customized programming, with a cheap spare system and a large HDD one could get a basic crawler started probably within a few weeks. The key there is to narrow down what it is the SE is after and to just bounce URLs it deems out of bounds. Though Perl isn't suited for large scale SEs nothing says its the wrong tool to start the ball rolling. Ruby or Python whatever you know or happen to have a book for and then start reading and refining the idea.
  Oh ya, be real careful about clearing duplicate URLs lol. My first attempt ended up with like 30 revisits to the same site. Luckily, I started with a server I run so I know I won't get screamed at by the webmaster.
   One more bit that could help. If you know Perl, the book from OReilly "Web Client Programming with Perl" is available for free at ORA's Open Books Project. It's a bit outdated but does cover some basic ground.

  I didn't do justice to the clustering issue, but it deserves its own entire thread. Matter of fact you could start a thread for each of these topics to give them the attention they deserve. I'll just say for the record, setting up a cluster that works reliably is more of a task than most realize. It requires some good sys admin skills if linux is used. And further more the programming requires a slightly different mindframe and the best resources for any information break down to hard to dicipher to english man pages or incomplete, outdated notes taken by someone 4 years ago.

  oh and if you find any libs etc that help in the parsing of Javascript based nav menus, let me know :)

HTH
later

url to Anatomy of a Large Scale Hypertextual Web Search Engine
[www7.scu.edu.au...]

other paper was by Page, Cho and Garcia-Molina
Efficient Crawling Through URL Ordering
[www-db.stanford.edu...]

Web Client Programming with Perl
[oreilly.com...]

Brad

2:06 pm on Dec 4, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>Any ideas?

Well, getting surfers to use your engine is about changing habits. That can be a tricky thing to do. Getting to people right when they first get connected to the internet could help so people inprint like a hatchling on your engine.

I know, in my early internet explorations, I learned a lot about different search engines from metasearch. I paid attention to which databases in the meta-engine consistantly brought back better results and eventually started checking them individually.

Searchers also seem to like clean, fast search results pages - that might keep them coming back.

I agree about the metatags, but as one who runs a directory with a little spider that only fetches meta-tags, I can tell you that a lot of sites no longer use them. I don't know how to reverse that trend.

I wish I had more answers for you, but some of the best SEO minds frequent these forums so maybe they will have more ideas.

KodeKrash

10:52 am on Dec 18, 2001 (gmt 0)

10+ Year Member



Just a few (late) thoughts on the subject...

There are a lot of search engines and directories around, many of them well known, but the majority are basically non existant to the net masses. Knowing this begs the question of "why make another search engine? (as someone already pointed out)". Well, in my case, I found a gap in search services around daily headlines, so I built an index/search service that has fresh headlines every morning from around 10,000 sources, and growing. (Yes, Google and Fast do too, but technically, mine has been publically available longer.) No one really knows about my little niche in the search world yet, but I know it works, and it has been up for several months. My point is that if you can find a niche, you may have a great idea, but if you can't, wouldn't it be better to help OTHER engines improve?

Examples of how to help:

Advocate the use of meta tags, even some that you may have never seen before. Put them on your site, and help others use them also.

Use headline syndication to share your updates with the net. Many premade portal systems like PHP Nuke, Zope, and Slash have this feature built it, but it is easy to write a script to generate XML from a database, or do it by hand. There are several places you can get your XML feeds listed and put into search engines (like mine).

Report broken links - most search / directory sites have a way for you to do this easily.