Liane: Hmm, I didn't read the original NYTimes article. It requires a log in.
In my post above, I was maybe thinking about the current problems being bugs in the final rollout of an upgrade happening right now. I now see that what you are talking about is getting new machines in the next year. There is no upgrade to hardware at this time. In fact Matt Cutts said that in his blog just a few weeks ago.
In that case, it puts a whole new spin on things. This new "infrastructure" that Matt Cutts talks about in his blog for the last few months is then merely lots of band-aids on the existing kit, and more clever ways of utilising the existing kit until such time as the new stuff can come online many many months from now...
Now, that would explain why they have added a crawl-cache. They haven't got enough kit to spider the entire web any more, and the bandwidth needed was becoming too great. It might also explain why pages from some sites are disappearing en-masse. They are deleting "unimportant" stuff to make way for new stuff; but have got some of it wrong. It doesn't explain why a very large pile of very obvious junk does remain indexed though.
I do have a theory that supplemental results for 404 pages and for expired domains are kept cached to use them to compare for duplicate content when a spammer sets up a new copy of a banned site on a new domain hoping to "start again" without being noticed, but that is for another time.
Let's take a just one of Google's many products [google.com] and analyze the needed storage space... Google Desktop
In order for Google Desktop to store a mirror copy of a user's files, Google needs to have an equal amount of storage space as each user. "But they'll compress it!" Yes, but Google will also save multiple backups of each compressed chunk of bytes. So, I bet that in the end, 100 MB of user stuff equals close to 100 MB on Google's servers (even though it may be 3 compressed versions at ~30MB each).
So how much storage does Google need to plan for just for Google Desktop?
The answer is simple: For each byte of user data, they need a byte. So if there are a half billion hard drives worth of data out there, Google will need a half billion hard drives to hold it.
Creating more DC's wouldn't help because each DC has a complete copy of the index. The bottleneck may be in the algorithm itself. Every algorithm in parallel computing reaches network saturation at some point. In that case it doesn't help to scale out, i.e. buy more nodes. This is because each node causes more network traffic in an already congested network. Maybe they have reached that point already.
Also, considering the "average person's data storage growth of 250 megabytes per year" (according to GE [ge.com]) and that worlwide storage needs have an "expected annual growth rate of 55%" (according to Seagate [seagate.com]) Google's mission to organize it all is going add fuel to the fire, since they at some point in the process will need to hold every piece of it even if they don't cache a full copy.
I suspect someone simply couldn't handle large numbers.
Google say (somewhere) that their data centers have around 10,000 standard PCs. Let's assume that means they have on average a 100gigabyte hard drive each.
That sounds like a lot of storage -- 10,000 * 100gig = a cool petabyte.
Most of which was probably empty, or used for hot backup (see the various papers on the Google Filing System).
So they thought: how can we make use of this infinite amount of spare disk space?
And they came up with things like gmail.
But just million users with an average of a gig each is a petabyte. That's all the storage (not just the spare) gone in a trice.
(My numbers are order-of-magnitude: maybe it's 5 million gmail users each using only 200meg, and that compresses to a third of the size.....But you get the idea. Lots of seemingly clever free applications to leverage that infinite amout of disk space; and it's gone in under two years).
<theBear's rules rule=3>
Well if they invent a way to mathematiclly prove that in general there exists a compression function that always decreases the number of bits to encode an item then they will need zero hard drives for Google Desktop.
I think the most important point made in this thread was that the time to process the data on more hard disk space is not linear to the amount of new hard disk space and highly likely asymptotic.
Then you want a backup of this data ..
g1smd, thats what I was trying to say - big daddy is the attempt at removing data that allows Google to continue at least in the short term. And that is why they can't correct the problems that Googleguy said would take a "few weeks to correct by speaking to the crawl team" a few months ago. Lets be honest we all know that these sort of problems occurred a long time ago (years) and I believe this is the academic fix rather than the "throw more cash at the hardware fix" - and it could have worked, if the fix didn't have bugs in it. I say that because when I worked in that type of industry we found storage requirements started to explode when we added new data requirements when were using the original techniques - the solution was to re-write the existing method of storage as it became un-economical as the type of data that you were storing changed, but that became a false economy as in the long term you still hit the storage barrier - and it arrives faster than you think.
It seems to me they are discarding data, that is why people are seeing odd cache dates, removal of pages, duplicate content filters increased. But this is amplified by problems with canonical pages etc. so the end result is that if you take a harsh line on the data you want to retain and then totally remove "duplicates" from your index and get it wrong which pages you remove - you can't get them back, and you can't recrawl them because you havent got enough space or you aren't sure which ones to recrawl because you arent sure which ones are real anymore (when you find that there is a bug with your handling of duplicate content and canonicalisation).
End result, chaos.....
And people have noted things like "i am sure the phds and smart guys should have predicted this". I am sure they did, I am also sure that the money guys at Google said - we aint spending $1 billion dollars on new hardware, find another solution.
And they did.
And now they need $1.5 billion.
Sometimes I'm a bit slow on the uptake.
This time I'll blame it on being 2 a.m. here now. Bedtime!
and there it was Google's new data center, formerly known as the Republic of Cuba. :)
It's quite ironic: Google made a big point of indexing every page they could find for a long time, so the total number they always trumpeted would constantly increase. In the course of it, they vacuumed up billions of URL versions for what were actually just millions of pages (to hell with determining which is the right URL - we'll list them all!), billions of "directory" pages, billions of scrapers, billions of blogs, and now both their search results and machines have choked on all that useless dross.
Imho, it's a good thing. If this is what it has taken to hose out all the crap that pollutes the internet, great stuff.
...Until it's your site that disappears with the crap for no reason. That is what people are on about.
A search engine without spam is truly a great thing - however a search engine without websites is a pile of cow dung.
Ha ha, g1smd.
Night night, 3 am for me - not that it's a competition!
Google isn’t “two guys working in a garage” anymore. They are now a publicly traded corporation with responsibilities to their investors.
In my opinion, Schmidt made a huge blunder, stating publicly that their machines are full. It just makes him and all their top management look like fools who failed to allow sufficient provisions for the company’s ongoing expansion requirements.
This is very disturbing to say the least and for the first time ... my faith in Google has been seriously shaken. It may be time to put "plan B" into action!
Truly scary stuff this!
Very simple fix, if a page gets a 404 return when googlebot comes by, then attempt to come back two more consecutive days. If the page is still 404 then strike three and its out of the index. If it comes back at a later date, then it can be reindexed then. How many websites are down 3 days in a row.? Not many at all.
It shows how serious this problem is if a comment like this can be made - you can imagine what it is like internally.
All through this time people have made an inaccurate assumption that there is some sort of masterplan. But if you think about all the huge business failures out there - I am sure they said the same thing (i.e. what were the management doing etc.)
The truth is when a company gets to this size it is probably only the "phds" on the shop floor that know the implication of what is happening and as that message gets passed up the tree it gets watered down until it gets to bursting point!
So, everything people were talking about in relation to Google for the past few years about capacity may be true - they are after all human!
By the way, I have had my plan B in action since October last year - and it is just as good as plan A.
If anyone cares, give me a shout I don't mind sharing.
We could also assume that the folks at Google are, in fact, smart. I personally believe this... Something else that might be going on here is that we are taking what was said literally:
|As Google grows, so does its need to store and handle more Web site information, video and e-mail content on its servers. "Those machines are full," Mr. Schmidt, the chief executive, said in an interview last month. "We have a huge machine crisis." |
This statement could have been meant figuratively. Perhaps what he is stating is that all of the capacity they currently have is accounted for. Meaning it's already earmarked for a product / project. He may simply be pointing out the future need for even more infrastructure.
webmasterworld was down for quite some time--was it 4 days until they moved to rackspace? In any case, by the logic of one of the posters above, all of webmasterworld's information would have been removed which would have harmed both google (their results would no longer have some of the great posts in this site) and would also harm webmasterworld. definately not a win-win situation in my opinion. Perhaps they can have a greater period of time for the sandbox and index those sites in far less depth than others. that too seems like a simple solution which wouldn't necessairly work.
Yes of course they are smart. That is not in dispute.
But having lots of smart people does equal things like communication, risk management, business forcasting etc.
What I am saying is that Google is made up of amazing smart people - but, given a risk scenario how good is the "organisation" at dealing with the wider issue.
That is exactly what happened to companies like Microsoft and Oracle in the late eighties - they had to deal with the problems of moving from a "small" development house to co-ordinating large product releases - and they didn't do it smoothly.
Just having smart people doesn't create a "smart" business.
CEOs have a knack for understatement when they have problems. I'm sure this is no different. As so many other threads have illustrated, they've got big problems. I think Schmidt's statement was premptive damage control,for the sake of their stock price.
rohitj, webmasterworld is one of those sites that get special treatment - just like BMW when they were banned for a few days.
It is not good press for a major search engine to not index the major sites on the web - e.g. CNN or the biggest webmaster websites like webmasterworld or searchenginewatch.
These guys are bulletproof, just - and Brett knows that, so does Matt, Googleguy and Danny Sullivan and friends. When you become important to Webmasters you become important full stop. In my opinion thats the way it should be.
However, with all the extreme problems webmasters have faced during this several month Google debacle, not once has Brett or Danny faced Google up with this. And that is just the way life is.
guru5571, couldn't have said it better myself.
I have always had the greatest respect and admiration for Google. And, I have always believed that the fact that they rarely stand still is a good thing.
|"Those machines are full," Mr. Schmidt, the chief executive, said in an interview last month. "We have a huge machine crisis." |
... is hard to take any other way but literally. He didn't say, "Our machines are going to be full to capacity around this time next year."
He used the word "crisis" for Pete's sake and clearly stated that they need to make a 1.5 billion dollar hardware investment ... presumably to keep up with their current growth rate!
Well that's a fine kettle of fish don't cha think?
|"Those machines are full!" |
That kind of statement would not make me run right out and buy Google stock, but you can be certain I would be on the phone to my broker in the shake of a lamb's tail to tell him/her to sell my stocks and cut my losses PDQ!
Am I the only one who feels a statement like that is more than a little bit scary? The implications of Mr. Schmidt's comments are immense in my opinion.
[edited by: Liane at 3:12 am (utc) on May 4, 2006]
Liane, no you are not alone - I hope I have made my feelings clear too in my past threads. This is a huge admission from a public quoted company.
This is the sort of thing that happens in a large company that is immature as a corporate.
>> Perhaps what he is stating is that all of the capacity they currently have is accounted for. Meaning it's already earmarked for a product / project. He may simply be pointing out the future need for even more infrastructure.
which still is a problem: it means that Google faces a serious problem in keeping up with the data. The data increases, whether Google indexes it or not.
|The truth is when a company gets to this size it is probably only the "phds" on the shop floor that know the implication of what is happening and as that message gets passed up the tree it gets watered down until it gets to bursting point |
The problem with this is that Eric Schmidt is Dr Eric Schmidt. Ph.D. in computer science from the University of California-Berkeley.
Then you have
Dr Shona Brown
Dr Alan Eustace
Dr W. M. Coughran, Jr.
Dr Urs Hölzle
Dr Vinton G. Cerf
Dr Douglas Merrill
But I can follow your notion, in any crisis blame the MBA's ;)
Do you really have to take it so literally.
What I am saying is that out of potentially hundreds of people (whatever their qualifications, or their bosses qualifications) a communication culture is the issue.
If it is a communication culture (wow, do PHDs or MBAs follow the same rules as human nature) then at some point there was a break down between R & D, Implementation and/or management.
I was talking about a company - being smart does not pre-qualify the environment, culture and infrastructure to build a large organisation.
Otherwise anyone could have a great idea, employ a bunch of smart people and form a company. Companies don't fail because of lack of intelligence or qualifications - they fail because of the lack of internal procedures, communication and management - something that a PHD is waste of time unless it is any of those disciplines.
Oh, I forget:
And thats me - oh by the way, I am a crap multi-tasker and I never like talking to my boss about what I am doing because I like living in my own world. Oh no, what happens if I don't update my boss straight away that we are running out of space - I know keep it a secret until I work out a way of getting round it. Done it, told my boss, he went with it, I gave it a go - users reporting problems. Who cares, I successfully compressed x megabits into x kilobits and used x algo to deviate the data from memory to the web farm..... Sound crap - welcome to my mates, the programmers, they don't care but they are PHDs too.
Oh, and by the way the gloves are off and I am a real Dr - maybe I got lucky becoming a manager away from the dark side of "the farce".
Just a little note if you weren't serious - sorry about that small chip on shoulder!
Story now in The Register thanks to Andrew Orlowski.
5 pages, and no one has worked the numbers backwards yet?
If they use this to buy $10,000 servers(which would be a very bulky, name brand server - dual proc, couple gigs of ram, 6x300GB SCSI drives give or take) they can buy 150,000 servers. That's 225,000,000 Gigabytes of uncompressed information storage on servers that can handle thousands of requests per second EACH.
If they went no-name dual-proc server, SATA, couple gigs of ram, you're talking $2000 a piece with the same storage amount as the SCSI server, which will give you 1,125,000,000 Gigabytes of storage on 750,000 servers. You're beyond petabytes there.
Something else is going on here.
| This 183 message thread spans 7 pages: < < 183 ( 1 2  4 5 6 7 ) > > |