Welcome to WebmasterWorld Guest from 188.8.131.52
I was reading the New York Times article Microsoft and Google Set to Wage Arms Race [nytimes.com] and there was paragraph that caught my eye on page 2 that quoted Eric Schmidt (Google CEO) admitting that they have problems with being able to store more web site information because their "machines are full" (see page 2 of NYT article).
I am a webmaster who has had problems with getting / keeping my webpages indexed by Google. I follow Google's guidelines to the letter and I have not practiced any blackhat seo techniques.
Here are some problems I have been having;
1. Established websites having 95%+ pages dropped from Google's index for no reason.
2. New webpages being published on established websites not being indexed (pages that were launched as long as 6-8 weeks ago).
3. New websites being launched and not showing up in serps (as long as 12 months).
We're all well aware that Google has algo problems handling simple directives such as 301 and 302 redirects, duplicate indexing of www and non-www webpages, canonical issues, etc.
Does anybody think that Google's "huge machine crisis" has anything to do with any of the problems I mentioned above?
[edited by: tedster at 5:03 pm (utc) on May 3, 2006]
[edit reason] fix side scroll potential [/edit]
As Google grows, so does its need to store and handle more Web site information, video and e-mail content on its servers. "Those machines are full," Mr. Schmidt, the chief executive, said in an interview last month. "We have a huge machine crisis."
This is exactly what Big Daddy is purposed to fix, I thought. And we are now seeing the effect of "filling" the new machines on an expanded infrastructure. This is data handling on a mega-scale that scares the stuffing out of me, and I'm not surprised to see some bits and bytes falling off the new baskets from time to time. Doesn't mean I like it, but I do feel it's aimed at a future improvement and is a necessity for Google.
Last month, when reporting its quarterly earnings, Google reported a doubling in its rate of capital investment, mainly in computer servers, network equipment and space for data centers, and said it would spend at least $1.5 billion over the next year.
$1.5 billion in one year for infrastructure alone. Wow!
The numbers are truly impressive and eye opening,
even more when you realise the simplicity of it all from
the user end.
Well said. And even from the site owner's perspective, Google appears to be MUCH simpler than it really is. We just want to know "how many of my pages do you have" and so on. Questions like that seem like no-brainers, until the issues of mega-scale come into play.
Actually, once a problem like that had been identified at Google, it must have spawned an update project of enormous proportions.
Something like that could not be planned and implemented in just a few months, so I guess the DOCID people were probably right...
This goes to show that, if the "next Google" is born in a garage, the garage will have to be the size of a stadium parking ramp--and the founders won't be able to fund their venture by hocking their Playstations. :-)
Unless of course the new start-up has no infrastructure what-so-ever. How about, hummm, maybe... Distributed?
G has had many resource problems (e.g. remember the Analytics fiasco) where G underestimated user demand?I would be highly surprised if they'd let their search DCs to be under-resourced.
.. although anything is possible.
Hence the dropped sites.
All those that agreed were seriously slated and laughed at.
I wonder if those pious souls will now admit to being wrong?
Get in line.
I have to say that I'm amazed that will take $1.5 Billion, just in hardware a year, to run Google; I would have never thought it was so much.
[edited by: walkman at 7:36 pm (utc) on May 3, 2006]
1. DROP the freaking pages deleted 2 years ago, and index the current ones
And what about Gmail? I don't know how many users they have, but it seems to have turned into a free file storage service.
I don't think anyone could sum up Google's problem better than Webmasterworld - just look at a few of the main forum topics at the moment:
1. "Google CEO admits, "We have a huge machine crisis"
2. "Pages Dropping Out of Big Daddy Index"
3. "Somethings Up Right Now! -- 30 domains just went "home page only"
At the end of the day, in some part it can explain some of the things that are going on (including the new caching proxy where multiple crawlers cache one view of the data).
To me, if you have a huge space problem you solve it in 2 ways - more space, less data. Now then, if you come up with an amazing technique to store less data (i.e. duplicate content removal, canonical improvements) then you have saved a huge amount of cash, but.... if you underestimate what the implications are of your assumptions, you removed millions of web pages by accident. And better yet, if you do it "live" you have a bigger problem - but you have no choice but to do it live because you are running out of space.
But, you have loads of other services that need data storage - which ones do you sacrifice (the free ones or the ones that generate your whole income). Thats right, it is a better business decision to compromise your free search product (nobody may notice after all if you do it right) than your paid one (adsense and adwords need the space and no compromises can be made on this).
And there is the difference between Google of 1997 and the Google of today.
I bet this problem could be solved by more efficient compression? In any case, I did read somewhere that the Google file system stores 7 copies of any one document for backup and load-balancing sake. Mabe they can reduce that to 4 or 5 as a very temporary solution...
By "bigging up" the scale and cost etc. they justify any problems that may be faced in that "gigantic mission" they have to index the web and therefore find sympathy and understanding from the community by presenting the challenge they face. Which, there is no doubt is a big challenge, but if you have hit this sort of crisis in 2006 - don't run a search engine.
Even large sites doing it would probably make little difference, but some of the megasites doing it might have an effect.
I've been thinking of doing it by default myself, but mainly to prevent people accessing old information, now that Google's update cycle seems to have slowed so much.
When faced with these sort of storage problems you need to keep a check on the refresh rate of your data - especially if your algo has a "time" factor i.e. link rate, domain history etc. because then your data storage requirement explodes.
True. I was for quite a time involved in large systems performance (Vice-Chairman of UK CMG) and nasty things happen in queueing theory when resources get tight - the "knee of the curve" effect. If Google is tight on storage, it will be spending ever more time and resources finding places to stick things. Rules of thumb (crude, I know) suggest 70% is about the best utilisation of a storage resource you'll ever get if you want the system to perform.
I've commented in other places about a site of mine where Google's response to my sitemap updates has been odd. I've now realised exactly what's happening - my Googlebot is faithfully downloading changes I notified via new sitemaps around five weeks ago. It's just worked its way through March 28 and started on March 29 - in order.
The corollary is that Google is delivering results to users based on old data. At least five weeks old, on this one site. It won't take long for Joe Public to start to realise that Yahoo, Ask Jeeves, etc., deliver more pertinent results than Google. In some consumer markets, five weeks is the lifetime of a product.
[edited by: Phil_Payne at 8:47 pm (utc) on May 3, 2006]