Here's a fuller version of the quote:
|As Google grows, so does its need to store and handle more Web site information, video and e-mail content on its servers. "Those machines are full," Mr. Schmidt, the chief executive, said in an interview last month. "We have a huge machine crisis." |
This is exactly what Big Daddy is purposed to fix, I thought. And we are now seeing the effect of "filling" the new machines on an expanded infrastructure. This is data handling on a mega-scale that scares the stuffing out of me, and I'm not surprised to see some bits and bytes falling off the new baskets from time to time. Doesn't mean I like it, but I do feel it's aimed at a future improvement and is a necessity for Google.
Another relevant part of the article:
|Last month, when reporting its quarterly earnings, Google reported a doubling in its rate of capital investment, mainly in computer servers, network equipment and space for data centers, and said it would spend at least $1.5 billion over the next year. |
$1.5 billion in one year for infrastructure alone. Wow!
Especially when you consider that the web and Google's ambitions are not getting any smaller. 1.5 Billion this years, more next? After that?
The numbers are truly impressive and eye opening, even more when you realise the simplicity of it all from the user end.
Which is why I think they are doing what they are doing now. Makes perfect sense seeing what is happening. I bet all duplicate content and all 404's will finally disappear forever.
Wasn't this last upgrade (Big Daddy) a migration from 32-bit computing to 64-bit computing? ... and also ... Google is preparing for the onslaught from IPv6 (2008)..(35 trillion IP enabled devices...potentially)..(or is that 35 thousand trillion IP numbers enabled through the 24 digit IP number structure?)
This goes to show that, if the "next Google" is born in a garage, the garage will have to be the size of a stadium parking ramp--and the founders won't be able to fund their venture by hocking their Playstations. :-)
|The numbers are truly impressive and eye opening, |
even more when you realise the simplicity of it all from
the user end.
Well said. And even from the site owner's perspective, Google appears to be MUCH simpler than it really is. We just want to know "how many of my pages do you have" and so on. Questions like that seem like no-brainers, until the issues of mega-scale come into play.
Hmm, so was all that discussion on this forum about DOCIDs, just about a year ago, spot on but just a tad too early?
Actually, once a problem like that had been identified at Google, it must have spawned an update project of enormous proportions.
Something like that could not be planned and implemented in just a few months, so I guess the DOCID people were probably right...
Except the DOCID discussion was about running out of unique keys to label the data...
These quotes suggest that they are running out of disk space to store the data itself.
is google the next enron? Maybe they are getting too big for their britches.
P.S. I love google.
|This goes to show that, if the "next Google" is born in a garage, the garage will have to be the size of a stadium parking ramp--and the founders won't be able to fund their venture by hocking their Playstations. :-) |
Unless of course the new start-up has no infrastructure what-so-ever. How about, hummm, maybe... Distributed?
The G CEO referring to full machines may not have specifically been talking about the DC's for searching.
G has had many resource problems (e.g. remember the Analytics fiasco) where G underestimated user demand?I would be highly surprised if they'd let their search DCs to be under-resourced.
.. although anything is possible.
Love it! Now several years ago I think around or just after the Florida fiasco some one pointed out on here that google may well be having problems adddressing all of its content.
Hence the dropped sites.
All those that agreed were seriously slated and laughed at.
I wonder if those pious souls will now admit to being wrong?
Get in line.
Technology is never a barrier. Particularly when you have billions of dollars to spend. It's cute of Google to say they are having problems and they will be fixed.
OK Google, here's how to do it:
1. DROP the freaking pages deleted 2 years ago, and index the current ones
2. Stop storing everything users do while at Google.com (or at a site that uses Adsense)
3. You should be fine by now.
I have to say that I'm amazed that will take $1.5 Billion, just in hardware a year, to run Google; I would have never thought it was so much.
[edited by: walkman at 7:36 pm (utc) on May 3, 2006]
|1. DROP the freaking pages deleted 2 years ago, and index the current ones |
Agreed. As a user I've found supplemental cached results useful, but as a site owner, I'm not sure that I want Google retaining them.
And what about Gmail? I don't know how many users they have, but it seems to have turned into a free file storage service.
Well, if you don't want your pages to end up in Google cache, simply set the <META NAME="ROBOTS" CONTENT="NOARCHIVE"> tag, and Google will delete the page from the cache the next time it crawls the page. (or you can use the Automatic url remover tool [services.google.com] from Google to have it done sooner.)
Ha ha - priceless!
I don't think anyone could sum up Google's problem better than Webmasterworld - just look at a few of the main forum topics at the moment:
1. "Google CEO admits, "We have a huge machine crisis"
2. "Pages Dropping Out of Big Daddy Index"
3. "Somethings Up Right Now! -- 30 domains just went "home page only"
At the end of the day, in some part it can explain some of the things that are going on (including the new caching proxy where multiple crawlers cache one view of the data).
To me, if you have a huge space problem you solve it in 2 ways - more space, less data. Now then, if you come up with an amazing technique to store less data (i.e. duplicate content removal, canonical improvements) then you have saved a huge amount of cash, but.... if you underestimate what the implications are of your assumptions, you removed millions of web pages by accident. And better yet, if you do it "live" you have a bigger problem - but you have no choice but to do it live because you are running out of space.
But, you have loads of other services that need data storage - which ones do you sacrifice (the free ones or the ones that generate your whole income). Thats right, it is a better business decision to compromise your free search product (nobody may notice after all if you do it right) than your paid one (adsense and adwords need the space and no compromises can be made on this).
And there is the difference between Google of 1997 and the Google of today.
I was under the impression the Big daddy was done to save bandwidth--not disk space. As for a machine crisis, it isn't hard to purchase datacenters--there are quite a few for sale at basement prices from companies going under.
I bet this problem could be solved by more efficient compression? In any case, I did read somewhere that the Google file system stores 7 copies of any one document for backup and load-balancing sake. Mabe they can reduce that to 4 or 5 as a very temporary solution...
google stock up 18 cents amid problems.
rohitj, yes but if it was as simple as increasing compression then surely the answer would be just to do it and not state in an interview that you are having a machine crisis. Especially as a public traded company.
I wonder if they are dropping quotes like this to the press to scare off potential rivals. Also, I know gigablast claims they can index the web at a much lower cost than Google.
Google want to “Organise the Worlds Information”.
Does that mean just indexing, or do they want to STORE the worlds info. What would it take to electronically store all the books in the British Library, Library of Congress, etc. plus all the newly released books.
With the increase in storage density it might be possible to plan for it.
I think they are doing it to deflate the current problems and deflect any criticisms that may arise if and when a public story occurs about what is being found by webmasters.
By "bigging up" the scale and cost etc. they justify any problems that may be faced in that "gigantic mission" they have to index the web and therefore find sympathy and understanding from the community by presenting the challenge they face. Which, there is no doubt is a big challenge, but if you have hit this sort of crisis in 2006 - don't run a search engine.
Would it help if the very largest sites used <META NAME="robots" CONTENT="nocache">, one wonders?
Even large sites doing it would probably make little difference, but some of the megasites doing it might have an effect.
I've been thinking of doing it by default myself, but mainly to prevent people accessing old information, now that Google's update cycle seems to have slowed so much.
|if you don't want your pages to end up in Google cache |
Yes, I meant the stale entries, and I was speaking philosophically. I personally don't have any problem.
Phil_Payne, what you said there is another signal of the problems they are facing "update cycle slowing".
When faced with these sort of storage problems you need to keep a check on the refresh rate of your data - especially if your algo has a "time" factor i.e. link rate, domain history etc. because then your data storage requirement explodes.
And to think... one day the entire current Google database will be able to fit on a thumb drive.
> When faced with these sort of storage problems you need to keep a check on the refresh rate of your data - especially if your algo has a "time" factor i.e. link rate, domain history etc. because then your data storage requirement explodes.
True. I was for quite a time involved in large systems performance (Vice-Chairman of UK CMG) and nasty things happen in queueing theory when resources get tight - the "knee of the curve" effect. If Google is tight on storage, it will be spending ever more time and resources finding places to stick things. Rules of thumb (crude, I know) suggest 70% is about the best utilisation of a storage resource you'll ever get if you want the system to perform.
I've commented in other places about a site of mine where Google's response to my sitemap updates has been odd. I've now realised exactly what's happening - my Googlebot is faithfully downloading changes I notified via new sitemaps around five weeks ago. It's just worked its way through March 28 and started on March 29 - in order.
The corollary is that Google is delivering results to users based on old data. At least five weeks old, on this one site. It won't take long for Joe Public to start to realise that Yahoo, Ask Jeeves, etc., deliver more pertinent results than Google. In some consumer markets, five weeks is the lifetime of a product.
[edited by: Phil_Payne at 8:47 pm (utc) on May 3, 2006]
| This 183 message thread spans 7 pages: 183 (  2 3 4 5 6 7 ) > > |