Forum Moderators: open
Ensuring reliability was another concern. With so many commodity hardware servers, "expect to lose one a day," he said. Google decided to "try to deal with that in an automated way. Otherwise, you will have lots of people running around trying to restart servers."
cnet [news.com.com]
According to Hoelzle, Google has inexpensively built out its computing infrastructure by using thousands of "commodity" servers, instead of fewer high-end, and high-priced, machines. The trick is to make these racks of hardware work together and to ensure that the failure of one machine doesn't derail an operation.
Hoelzle then flashed a picture on the screen of six fire trucks at a Google data center. "I can't tell you what happened, but it's not about one machine going down," he said. He didn't disclose when the incident occurred. "No users were harmed in this picture," he added.
Is it kind of weird they won't tell us what happened? Or is that just me?
If you owned a large public company and one of your "super-smart phd's" forgot to extinguish a cigarette correctly, left a candle burning, or went crazy from coding too much and developed para maniac tendancies, would you want to tell people about it?
another 10 pages explaining their technology.
One major chunk -- the filing system -- explained here:
[labs.google.com...]
Also possibly explains (see other current thread) why they need good operating system developers.
I spent a long time trying to find a hosting provider which actually delivered on the 99.99% uptime claims (I never found one).
I finally figured out I was better off going with two cheap providers that were reasonably good - say 99.5%
with some monitoring software that automates fail-over via dynamic DNS - a server can crash when I'm out hiking and things keep running...
side note: google actually seems to be down at the moment...