Welcome to WebmasterWorld Guest from 126.96.36.199
I was reading the New York Times article Microsoft and Google Set to Wage Arms Race [nytimes.com] and there was paragraph that caught my eye on page 2 that quoted Eric Schmidt (Google CEO) admitting that they have problems with being able to store more web site information because their "machines are full" (see page 2 of NYT article).
I am a webmaster who has had problems with getting / keeping my webpages indexed by Google. I follow Google's guidelines to the letter and I have not practiced any blackhat seo techniques.
Here are some problems I have been having;
1. Established websites having 95%+ pages dropped from Google's index for no reason.
2. New webpages being published on established websites not being indexed (pages that were launched as long as 6-8 weeks ago).
3. New websites being launched and not showing up in serps (as long as 12 months).
We're all well aware that Google has algo problems handling simple directives such as 301 and 302 redirects, duplicate indexing of www and non-www webpages, canonical issues, etc.
Does anybody think that Google's "huge machine crisis" has anything to do with any of the problems I mentioned above?
[edited by: tedster at 5:03 pm (utc) on May 3, 2006]
[edit reason] fix side scroll potential [/edit]
Myself, this is exactly my thing and I can totally understand it - as to blow my own trumpet I was head of research at a large PLC creating data warehousing, retrieval and analysis systems - indexing terabytes of data per day (just felt I should back up my comments with real world experience!)
[edited by: Swanson at 8:43 pm (utc) on May 3, 2006]
In my opinion the biggest saving would be to drop supplementals - and then delete historic cache data (another post alluded to the fact the old caches were appearing).
All of these are somewhat recent introductions and they are all resource hogs when compared to a search index. As these become increasingly popular, the demands on their infrastructure increse.
They're also looking for ways to push their ads out to devices such as cell phones. As innovative as Google is, they are still a one-trick-pony when it comes to a revenue stream. If they are going to grow revenues, they've got to push their ads everywhere too.
It won't take long for Joe Public to start to realise that Yahoo, Ask Jeeves, etc., deliver more pertinent results than Google.
... until Y, Ask and every other engine catches up with G's massive volume of data and experience the same problem as G is experiencing. It's always much easier to refine a smaller dataset.
Is it better to have a smaller dataset? I don't think so.
Will Y, Ask and the rest have the money to rapidly expand capacity, as G will do? I guess we'll see who's got the better machine when the other engines finally hit that speed bump, if they ever get that big.
But something like MSN is built with modern hardware, technology and software. If you see what Gigablast is doing indexing billions of pages on literally a few machines (and a cache) then you start to wonder what the hell google is doing.
Why would you say you need to spend billions on machines for all the other things that don't earn revenue - as a PLC that would not get past the shareholders, "lets spend a billion on upgrading storage and capacity on all these great new innovations that make us really cool but create limited usage and take us away from spending on core ad products that MS and Yahoo are trying to compete with us on"
Just work out how much time is spent merely wiring up 100 000 boxes and plugging them in and copying the OS on to them. Enough work to keep 100 people employed for at least several months, and that is just a small part of the overall job.
And the Maps are especially critical for pushing local sales. I've already seen google sales people walk into local stores and push their adwords program--they use examples like google maps to show how critical google is to local businesses.
What i am wondering is how Google hasn't commented on bandwidth. Getting tier-1 bandwidth even in volume can be costly.
What Google is doing apart from search will not help them achieve market share and revenue from their core product - and their core product is their only differentiation from their competitors - search. Take away that and its nothing but a bunch of experiments supplemented by an ad company that gets to distribute it's ads on the biggest search property in the world.
ogletree - we agree then on the fact that Google needs to concentrate on core product. We disagree on how much space it takes to store the index and cache - what I am saying is that Google is storing each and every crawl of every page forever, yes compressed - but think about 10 crawls of 1 page in 10 days that uncompressed is 100k (if you multiply that out by a few billion pages every few days then it outnumbers the other content massively)
What I am saying is that Google had this problem a few years ago and everything they have been doing since has been with an eye on this scenario - big daddy being the implementation of a year or two's work.
And yes, storage does not explain everything - but implementing such agressive dupe filters affects a lot of things - especially the calculation of link patterns and page rank as suddenly links disappear.
What I think is that when the debate about storage of DocIds a year or so ago it was taken as a ludicrous suggestion by many that Google could have technical issues such as that.
All I know is that if, when such a large company, mentions storage as an issue (which lets be honest is a ridiculous admission as we all take it as read that you need lots of storage) when operating a "large scale search engine" then behind the scenes there certainly have been very serious for a long time.
I just think it speaks volumes when the company says "DOH! Forgot about the fact we needs loads of machines - we were too bothered in releasing new products that were really cool". As a business they should have communicated between departments saying something like "no we won't launch Google Page Creator yet until we have scaled up the infrastructure because that would be silly"
I got into a spiral like this twenty or more years ago.
"We need ten people to support five mainframes."
"We need to sell ten mainframes to pay ten people."
"We need twenty people to support ten mainframes."
Short and sweet - someone took their eye off the ball. Surprising datum - this is their core product.
If I were an investor in this publicly traded, mega company, I would not be happy at all with numbers like that. This situation hasn't just snuck up on them ... Sounds like a bit of a smoke screen to me.
We have a huge machine crisis - those machines are full
I'm not buying what he's selling. If his statement is true, then if I were him, I would be looking for another job! Any CEO who would allow a company the size of Google to reach "critical mass" (so to speak) has no business being in that position.
Those machines are full indeed! Just wait, there is going to be more disclosures about this particular subject very soon and suddenly, that statement will make a lot more sense. My guess is that investors (hoping for dividends) will not be very happy in the near future.
Let's say that they realized at least two years ago (pre-IPO) that their storage needs were growing exponentially. As a business owner, how much money would you allocate for storage needs if you were Google, two years ago? 50x your current budget? 100x? More than likely you would allocate a number that reasonable accountants would agree to, like say 10x.
As soon as the money is allocated (and even before), you begin drawing up plans for the expansion that include any sort of known technology and even tech that was rumored to be on the horizon but not yet available. Has that resource pool changed at all over the past two years? Oh, yes. You bet it has, both in terms of what is available and in terms of how much it costs.
Then you grab all of the IT folks you can find to get the plans implemented. How about that metric? Up or down since 2004?
Then you've got to find property to hold all of those servers (10,000 before they stopped reporting it), in some of the most expensive parts of the planet to buy real estate ... unless y'all figger those IT folks are going to move to the puckies. What do you think? Have real estate prices gone up or down in the past 2 years?
Someone noted that G could probably purchase several existing data centers at rock-bottom prices, but are they of the quantity, quality and in the locations that G absolutely must have for their setup? Dunno. Maybe.
Frankly, I think a likely scenario is not one where G was caught flat-footed, but rather one in which the financial system and its increasing complexity since the IPO have stood in the way of performing the tasks they very well knew were necessary to complete, and have hampered G's ability to provision the required storage.
It's always fun to giggle at the boss when their fly is open, but if anyone actually believes that G's business plan is the problem here, well, you're not paying very close attention to the company you're giggling at in this case. G has been anything but in disarray, or forgetful, or floundering.
With regard to the other services (Earth, et al.) ... your suggestions are to forget about all of that shinola and go back to being a simple search engine? Have you learned nothing from the past?
Stagnation = Death.
Do you start on ordering the new machines when the current ones are half full, and get the new stuff running long before the end of the useful life of the present kit? If you do, then you might have over-spent.
Or, do you wait a bit longer and then buy a better spec'd machine at a far lower price next year (spending less, and spending it later), but then run the risk of the new stuff coming online slightly too late, and actually running out of steam on the old kit a month or two beforehand?
This isn't a new problem that has "just" cropped up. This is something that had been noticed a long time ago, and has been planned probably a year ago, and now in the final stages there are problems with the roll-out.
Just do the maths on how long it would take to build 100 000 or 200 000 PCs, copy the OS and software on to them, plug them into the mains, and wire them all up to the network, etc. The installation phase alone would involve hundreds of people for very many months.
I guess this is a project running a few weeks or months late, with a few unexpected bugs popping up in the final stages. Just another typical day in a typical IT company really.
I guess this is a project running a few months late, with a few unexpected bugs popping up in the final stages.
This is not something which is a few months late ... try a few years late! They haven't even bought the machines yet! He is saying they need another 1.5 billion to buy more machines because those they have are full. I can't even imagine the time it will take to buy, build, deliver and then get the new machines hooked up and on line.
No, there is nothing typical about this at all. And that kind of investment is not what I would call "a few unexpected bugs"! There is much more to this than meets the eye and my guess is that it all comes down to boardroom politics and investor manipulation.
So if poor interim results relevance , supplementals, page drops etc are upsetting webmasters and users alike, this is being factored against a product that has to be bigger, better and more profitable than before and they reckon the pain is worth it.
Matt, Virginia and their crew will do what they can to keep us smiling, but i think they are going "full bore" in the overall upgrade mission, while balancing the relationship of webmaster/site owners and users.
I'm sure this acceleration sharpens Yahoo and MSN who are about to apply themselves as well .... I think Bill Gates earmarked an initial couple of billion last week to compete with Google didn't he?
They can't afford the gap to be too wide because it strategically weakens them.
I guess this is all obvious ....but it does make holding onto the ship a bit difficult through this lenghtly storm .... just hold on folks!
Not that easy. As any [mainframe] capacity planner will tell you until terminal boredom sets in (even if the bar is free) scalability is never linear. If you buy as many machines again as you have, you will never achieve twice the throughput.
Designing an application for multi-processor scalability is not trivial. Make it multi-system within a datacenter and then multi-datacenter and you get into the realms of inspiration and prayer.
The "Google dance" was an obvious manifestation of these effects, but that was essentially a SERPs-only effect. Now it's spread to the other functionality - crawling, indexing, vetting for transgressions, etc.
Spending, as has been said, $1.5 billion is one issue. But when the kit is delivered, will Google's infrastructure exploit it?
These estimates err on the side of caution so we can assume that most of this investment (99% or so) is for the storage of material other than web pages. I wonder how Google plans to sell that fact to its shareholders.
"it isn't hard to purchase datacenters--there are quite a few for sale at basement prices from companies going under."
It is my understanding that the Google phenomenon, for lack of a better word, was born almost equally from unique, proprietary software and unique, proprietary hardware configurations.
I doubt that someone else's liquidated equipment could fill the bill.
And if this is truly the case, they have been managing this through varying periods of growth since Day 1.