|Could Google be duplicated?|
what would be required
Please pardon a bit of idle speculation, but recent talk of distributed computing got me thinking about this topic. Specifically, could the crawl, crunch, query and distribution functions be performed by an open source, distributed computing effort. Think Linux development model with SETI/FOLDING DC technology. Is such an initiative possible and could it ever be a realistic threat to Google? Why haven't we seen it to date?
I doubt it.
The biggest problem with seti is that some people try to mess with the results just to create some hype.
Now, imagine what people would do if they were a part of a PR calculating net.
|brotherhood of LAN|
Definetely a possibility I'd like to think, but as said by bcc, there are too many of us webmaster types, or less palettable webmasters that may "screw around" with the results and see what sort of information was going in and out their DC'ing comp.
They would probably have to double up on security and encryption, most likely meaning bandwidth and processing time and everything_else would take a little more time - just to make sure those that are processing the algo are not deprocessing it in the process ;) ;)
I'll tell you what CAN'T be duplicated. The environment in which Google raised $25 MM in venture capital from two top tier VC firms and the additional $10 MM investment from Yahoo. Without that kind of monetary initial commitment and backing I don't see a giant threat to the likes of Google for a long time. Ang by the time that could potentially happen again, Google would most likely have outpaced Yahoo in terms of valuation. (it will happen...post IPO...whenever that is)
Bcc's argument is not valid. It is easy to throw redundancy and simple "is this within reason" checks into the mix. Besides, if you were a WebmasterWorld person and wanted to influence the results, what are the odds that your chunk of data would include any sites that you are responsible for?
ggrot, look around the net for comments from guys who do seti.
You'll see it's the biggest problem they have to deal with.
Besides, the Canadian government just launched a project for distributed computing and they DID NOT use private computers as a part of the net for this exact reason.
It's not simple to check the results. Also, if you were to do something like that - you would have to run the same batch twice (at least) on different computers, selected randomly.
And if the results and not the same - you have to run it once again to see which one of the original two is correct.
That's 3 times the processing for one unit.
Actually, we are considering it. All it takes is to improve the algo of Open Source Aspseek, and cluster-enable the program under Grid or Constellation conditions.
We have already contacted the original Aspseek developers team and a independent Indian team to try to find out how much would it cost, and how long could it take.
Basicaly, we are thinking of well regulated grid/costellation of dedicated computers run by independent companies, ISP, Registrars, or the like, no SETI-style volunteers.
The revenue, if any, may come from advertisement, but the truth is that to freed our customers from Google, and self promotion would be more than enough for many of us. Not to mention to stop paying those costly licences some pay to Fast/Google to enable portal user searches.
For a ISP like ourselves, hosting a few boxes will not be a significant cost at all, and there are thousands of companies like ours. ISPs usually host hundreds of boxes, some of them thousands, and general infrastructure and bandwidth cost to do such a thing is not really a issue. Many ISP commit resources for Linux distro downloading, MySQL downloading, and Tucows like networks. This will not be different.
Does anybody want to play?
[edited by: Marcos at 5:12 pm (utc) on Oct. 29, 2002]
>but as said by bcc, there are too many of us webmaster
>types, or less palettable webmasters that may "screw
>around" with the results
That should not be a issue, unless you let anonymous volunteers and webmasters run the show, of course.
If you use well stabilised companies, say ISP, that is not likely to happen. ISP can easily be held accountable if they falsified search records or queries.
Besides, If they wanted to divert traffic, a rouge ISP could just tamper the DNS boxes they control, as rouge Registrars could tamper with the domain they register. But they don´t do it, because they will became immediately a legal target.
An Aspseek Grid Network, managed like the DNS hierarchy, would be more or less the same scenario.
[edited by: Marcos at 4:12 pm (utc) on Oct. 29, 2002]
Would those people you mentioned really give you a representative sample of the different subject areas of the web?
There were some other distributive computing search projects going on about a year ago, one by a university, but I do not know if they ever went anywhere.
>Would those people you mentioned really give you a
>representative sample of the different subject areas of
Why not? It all depends of the crawler´s Algo. If the foundations of the project and the software used mach that goal, they will. If they don’t want to, they will just not join the project.