Forum Moderators: open
and provides an interesting read
re the task at hand every second for G
Over four billion Web pages, each an average of 10KB, all fully indexed.Up to 2,000 PCs in a cluster.
Over 30 clusters.
104 interface languages including Klingon and Tagalog.
One petabyte of data in a cluster -- so much that hard disk error rates of 10-15 begin to be a real issue.
Sustained transfer rates of 2Gbps in a cluster.
An expectation that two machines will fail every day in each of the larger clusters.
No complete system failure since February 2000.
It is one of the largest computing projects on the planet, arguably employing more computers than any other single, fully managed system (we're not counting distributed computing projects here), some 200 computer science PhDs, and 600 other computer scientists.
[edited by: Brett_Tabke at 4:51 pm (utc) on Dec. 2, 2004]
[edit reason] fixed link [/edit]
I must say, I do wonder about the cheap hardware philosophy, though. It just strikes me as something that would add complexity and maintenance costs that would outweigh the cost benefit of getting cheap hardware in the first place.
The Farming analogy to cheap equipment:
I know a manager for a large agribusiness outfit, with a couple dozen farms of over 10,000 acres each. They have a simple way of measuring the cost of competing equipment. Because they have so many farms, they can set 4 of them aside to run equipment from four major vendors. They've been doing this on a running basis since the mid 70s, and they've discovered that, over time, cheap equipment is to expensive to run. The highest priced equipment on the market (you've all seen it, with that green paint job), is actually much, MUCH cheaper to run in the long term, because it breaks down less, requires less ongoing maintenance. This not only reduces basic maintenance costs, but reduces costs by preventing lost man-hours, better "on-time" delivery of results (in farming, you have to do certain things within very specific and narrow time frames - if you're a little bit off, sometimes by as little as a day, with seeding, harvesting, spraying, etc., your yield drops, and costs you money).
Google is easily big enough to run real time comparitives on this sort of thing. From the sounds of that article, a lot of time is needlessly wasted just dealing with issues relating to cheap equipment.
Mind you, I'm sure greater minds than mine have pondered the issue. But from my own experience in a few different fields, I've learned the hard way that cheap equipment just doesn't pay.
Now, with its strong income stream and cash hoard, it might behoove Google to revisit its philosophy.
* Over four billion Web pages, each an average of 10KB, all fully indexed
* Up to 2,000 PCs in a cluster
* Over 30 clusters
* 104 interface languages including Klingon and Tagalog
* One petabyte of data in a cluster -- so much that hard disk error rates of 10-15 begin to be a real issue
* Sustained transfer rates of 2Gbps in a cluster
* An expectation that two machines will fail every day in each of the larger clusters
* No complete system failure since February 2000
[zdnet.co.uk...]
If we take the raw figures, that's "around or more than" 60,000 PCs, so i assume that the old debate about the figure being "around" 10,000 will not continue much further.
It's funny to contrast the inherent lack of precision of these statements with the precision on google.com - i mean, the company states publicly that it has indexed exactly one (1) page more than 8.058.044.650.
That page must have been a very important one.
I was surprised to read Google runs on Intel CPUs. I would have thought if their goal is cheap hardware, AMD would be the better choice since it has a better price / performance ratio than Intel.
This is why what they are doing is smart by using huge number of servers/replicated clusters.
Consolidate that server power or storage and you get what they said happened when somebody unplugged an 80 server rack... a slower, longer failover wich could cause additional problems.
Also with more expensive and/or more exotic or proprietary components you can have other problems related to supply, servicing etc. Keeping it vanilla insures that the hardware is always available and there are not any unusual servicing needs.
Sure, buy high quality components but it sounds like there distributed file-system is working just fine.
Our company used to be a 100% agricultural company. It is approaching only 10% of our business because very few farmers put into practice the analogy that you described. They want everything on the cheap, very short-sighted. For every one Ag customer we have that is progressive, we have 50 that are the opposite.
However, this does not seem to be hindering Google at all. In fact the low cost hardware combined with the low cost OS is what allowed them to grow so quick and remain a private company. They spent the start up cash on brainpower and not hardware and software.
But what google does is running redundant cheap equipments tied together by a super smart software which expects that the equipment goes down .The end result is a system with the power of multiple super computers but made up of cheap machines .
The best analogy would be an ant colony . Each individual ant is week and can fail but as a group they are powerful and efficient
You may be right. But given Google's current cash situation, it might be worthwhile for them to set up a "sweet" cluster of high quality compnents to test against the others for long term costing/reliability.
Even the professional Beowulf community is moving towards higher quality components, for the simple reason that they're seeing better stability with more expensive equipment. The architecture remains the same, and they still get massive cost benefits over custom racks. But the individual PCs in the cluster, and their drives, are far superior than run-of-the-mill COTS equipment.
Mind, maybe Google has set up a sweet cluster, and just aren't talking about it. Google is, after all, the master of selective dissemination of information, when it comes to what they're doing in-house.
in a cluster of 1,000 PCs you would expect, on average, one to fail every day.
If one fails every day, a cluster would take three years to change. Hardware prices on server racks and drives drop so quickly, within six months to a year, you could easily double your storage with an improved processor for the same price you are paying now.