Forum Moderators: Robert Charlton & goodroi
I was reading the New York Times article Microsoft and Google Set to Wage Arms Race [nytimes.com] and there was paragraph that caught my eye on page 2 that quoted Eric Schmidt (Google CEO) admitting that they have problems with being able to store more web site information because their "machines are full" (see page 2 of NYT article).
I am a webmaster who has had problems with getting / keeping my webpages indexed by Google. I follow Google's guidelines to the letter and I have not practiced any blackhat seo techniques.
Here are some problems I have been having;
1. Established websites having 95%+ pages dropped from Google's index for no reason.
2. New webpages being published on established websites not being indexed (pages that were launched as long as 6-8 weeks ago).
3. New websites being launched and not showing up in serps (as long as 12 months).
We're all well aware that Google has algo problems handling simple directives such as 301 and 302 redirects, duplicate indexing of www and non-www webpages, canonical issues, etc.
Does anybody think that Google's "huge machine crisis" has anything to do with any of the problems I mentioned above?
[edited by: tedster at 5:03 pm (utc) on May 3, 2006]
[edit reason] fix side scroll potential [/edit]
In my post above, I was maybe thinking about the current problems being bugs in the final rollout of an upgrade happening right now. I now see that what you are talking about is getting new machines in the next year. There is no upgrade to hardware at this time. In fact Matt Cutts said that in his blog just a few weeks ago.
In that case, it puts a whole new spin on things. This new "infrastructure" that Matt Cutts talks about in his blog for the last few months is then merely lots of band-aids on the existing kit, and more clever ways of utilising the existing kit until such time as the new stuff can come online many many months from now...
Now, that would explain why they have added a crawl-cache. They haven't got enough kit to spider the entire web any more, and the bandwidth needed was becoming too great. It might also explain why pages from some sites are disappearing en-masse. They are deleting "unimportant" stuff to make way for new stuff; but have got some of it wrong. It doesn't explain why a very large pile of very obvious junk does remain indexed though.
I do have a theory that supplemental results for 404 pages and for expired domains are kept cached to use them to compare for duplicate content when a spammer sets up a new copy of a banned site on a new domain hoping to "start again" without being noticed, but that is for another time.
In order for Google Desktop to store a mirror copy of a user's files, Google needs to have an equal amount of storage space as each user. "But they'll compress it!" Yes, but Google will also save multiple backups of each compressed chunk of bytes. So, I bet that in the end, 100 MB of user stuff equals close to 100 MB on Google's servers (even though it may be 3 compressed versions at ~30MB each).
So how much storage does Google need to plan for just for Google Desktop?
The answer is simple: For each byte of user data, they need a byte. So if there are a half billion hard drives worth of data out there, Google will need a half billion hard drives to hold it.
Google say (somewhere) that their data centers have around 10,000 standard PCs. Let's assume that means they have on average a 100gigabyte hard drive each.
That sounds like a lot of storage -- 10,000 * 100gig = a cool petabyte.
Most of which was probably empty, or used for hot backup (see the various papers on the Google Filing System).
So they thought: how can we make use of this infinite amount of spare disk space?
And they came up with things like gmail.
But just million users with an average of a gig each is a petabyte. That's all the storage (not just the spare) gone in a trice.
(My numbers are order-of-magnitude: maybe it's 5 million gmail users each using only 200meg, and that compresses to a third of the size.....But you get the idea. Lots of seemingly clever free applications to leverage that infinite amout of disk space; and it's gone in under two years).
It seems to me they are discarding data, that is why people are seeing odd cache dates, removal of pages, duplicate content filters increased. But this is amplified by problems with canonical pages etc. so the end result is that if you take a harsh line on the data you want to retain and then totally remove "duplicates" from your index and get it wrong which pages you remove - you can't get them back, and you can't recrawl them because you havent got enough space or you aren't sure which ones to recrawl because you arent sure which ones are real anymore (when you find that there is a bug with your handling of duplicate content and canonicalisation).
End result, chaos.....
Imho, it's a good thing. If this is what it has taken to hose out all the crap that pollutes the internet, great stuff.
In my opinion, Schmidt made a huge blunder, stating publicly that their machines are full. It just makes him and all their top management look like fools who failed to allow sufficient provisions for the company’s ongoing expansion requirements.
This is very disturbing to say the least and for the first time ... my faith in Google has been seriously shaken. It may be time to put "plan B" into action!
Truly scary stuff this!
It shows how serious this problem is if a comment like this can be made - you can imagine what it is like internally.
All through this time people have made an inaccurate assumption that there is some sort of masterplan. But if you think about all the huge business failures out there - I am sure they said the same thing (i.e. what were the management doing etc.)
The truth is when a company gets to this size it is probably only the "phds" on the shop floor that know the implication of what is happening and as that message gets passed up the tree it gets watered down until it gets to bursting point!
So, everything people were talking about in relation to Google for the past few years about capacity may be true - they are after all human!
As Google grows, so does its need to store and handle more Web site information, video and e-mail content on its servers. "Those machines are full," Mr. Schmidt, the chief executive, said in an interview last month. "We have a huge machine crisis."
This statement could have been meant figuratively. Perhaps what he is stating is that all of the capacity they currently have is accounted for. Meaning it's already earmarked for a product / project. He may simply be pointing out the future need for even more infrastructure.
But having lots of smart people does equal things like communication, risk management, business forcasting etc.
What I am saying is that Google is made up of amazing smart people - but, given a risk scenario how good is the "organisation" at dealing with the wider issue.
That is exactly what happened to companies like Microsoft and Oracle in the late eighties - they had to deal with the problems of moving from a "small" development house to co-ordinating large product releases - and they didn't do it smoothly.
Just having smart people doesn't create a "smart" business.
It is not good press for a major search engine to not index the major sites on the web - e.g. CNN or the biggest webmaster websites like webmasterworld or searchenginewatch.
These guys are bulletproof, just - and Brett knows that, so does Matt, Googleguy and Danny Sullivan and friends. When you become important to Webmasters you become important full stop. In my opinion thats the way it should be.
However, with all the extreme problems webmasters have faced during this several month Google debacle, not once has Brett or Danny faced Google up with this. And that is just the way life is.
But ...
"Those machines are full," Mr. Schmidt, the chief executive, said in an interview last month. "We have a huge machine crisis."
... is hard to take any other way but literally. He didn't say, "Our machines are going to be full to capacity around this time next year."
He used the word "crisis" for Pete's sake and clearly stated that they need to make a 1.5 billion dollar hardware investment ... presumably to keep up with their current growth rate!
Well that's a fine kettle of fish don't cha think?
"Those machines are full!"
That kind of statement would not make me run right out and buy Google stock, but you can be certain I would be on the phone to my broker in the shake of a lamb's tail to tell him/her to sell my stocks and cut my losses PDQ!
Am I the only one who feels a statement like that is more than a little bit scary? The implications of Mr. Schmidt's comments are immense in my opinion.
[edited by: Liane at 3:12 am (utc) on May 4, 2006]
which still is a problem: it means that Google faces a serious problem in keeping up with the data. The data increases, whether Google indexes it or not.
The truth is when a company gets to this size it is probably only the "phds" on the shop floor that know the implication of what is happening and as that message gets passed up the tree it gets watered down until it gets to bursting point
The problem with this is that Eric Schmidt is Dr Eric Schmidt. Ph.D. in computer science from the University of California-Berkeley.
Then you have
Dr Shona Brown
Dr Alan Eustace
Dr W. M. Coughran, Jr.
Dr Urs Hölzle
Dr Vinton G. Cerf
Dr Douglas Merrill
But I can follow your notion, in any crisis blame the MBA's ;)
Dr Matt
What I am saying is that out of potentially hundreds of people (whatever their qualifications, or their bosses qualifications) a communication culture is the issue.
If it is a communication culture (wow, do PHDs or MBAs follow the same rules as human nature) then at some point there was a break down between R & D, Implementation and/or management.
I was talking about a company - being smart does not pre-qualify the environment, culture and infrastructure to build a large organisation.
Otherwise anyone could have a great idea, employ a bunch of smart people and form a company. Companies don't fail because of lack of intelligence or qualifications - they fail because of the lack of internal procedures, communication and management - something that a PHD is waste of time unless it is any of those disciplines.
Oh, I forget:
Dr Reid.
And thats me - oh by the way, I am a crap multi-tasker and I never like talking to my boss about what I am doing because I like living in my own world. Oh no, what happens if I don't update my boss straight away that we are running out of space - I know keep it a secret until I work out a way of getting round it. Done it, told my boss, he went with it, I gave it a go - users reporting problems. Who cares, I successfully compressed x megabits into x kilobits and used x algo to deviate the data from memory to the web farm..... Sound crap - welcome to my mates, the programmers, they don't care but they are PHDs too.
Oh, and by the way the gloves are off and I am a real Dr - maybe I got lucky becoming a manager away from the dark side of "the farce".
Just a little note if you weren't serious - sorry about that small chip on shoulder!