Phil_Payne, very good observations (with very relevant experience) - and this is exactly the sort of thing that Google does not want getting to joe public without a reason.
Myself, this is exactly my thing and I can totally understand it - as to blow my own trumpet I was head of research at a large PLC creating data warehousing, retrieval and analysis systems - indexing terabytes of data per day (just felt I should back up my comments with real world experience!)
[edited by: Swanson at 8:43 pm (utc) on May 3, 2006]
What effect would it have if Google reversed its "robots" metatag cache default?
By default not retaining the cache version would make a huge difference. But, I am not sure if Google actually discard the cache even when there is a "nocache" directive - it is just like the fact that even a robots.txt ban on urls will mean that Google indexes or "Lists" the url but does not crawl it (URL only results). If you keep the cache of everything you can make all sorts of historical analyses.
In my opinion the biggest saving would be to drop supplementals - and then delete historic cache data (another post alluded to the fact the old caches were appearing).
I would think that the need for more machines is being driven by other products - not search.
All of these are somewhat recent introductions and they are all resource hogs when compared to a search index. As these become increasingly popular, the demands on their infrastructure increse.
They're also looking for ways to push their ads out to devices such as cell phones. As innovative as Google is, they are still a one-trick-pony when it comes to a revenue stream. If they are going to grow revenues, they've got to push their ads everywhere too.
|It won't take long for Joe Public to start to realise that Yahoo, Ask Jeeves, etc., deliver more pertinent results than Google. |
... until Y, Ask and every other engine catches up with G's massive volume of data and experience the same problem as G is experiencing. It's always much easier to refine a smaller dataset.
Is it better to have a smaller dataset? I don't think so.
Will Y, Ask and the rest have the money to rapidly expand capacity, as G will do? I guess we'll see who's got the better machine when the other engines finally hit that speed bump, if they ever get that big.
No, but the way Google indexes and scales is totally different to the others. At the beginning, that was mean't to be the selling point - lots of little machines etc.
But something like MSN is built with modern hardware, technology and software. If you see what Gigablast is doing indexing billions of pages on literally a few machines (and a cache) then you start to wonder what the hell google is doing.
All take space - and earn a big fat zero revenue in the scale of things.
I don't think their storage problems has anything to do with how they index things. Their storage problem is they are storing videos, emails, and web page caches. I have heard G employees say many times they don't throw anything away. I'm sure that is one of their problems as well. I bet the new holographic storage will help with their need to save things.
I think is totally related to web site indexing.
Why would you say you need to spend billions on machines for all the other things that don't earn revenue - as a PLC that would not get past the shareholders, "lets spend a billion on upgrading storage and capacity on all these great new innovations that make us really cool but create limited usage and take us away from spending on core ad products that MS and Yahoo are trying to compete with us on"
My google analytics has been crippled for weeks. The support email back to me said that people who have a lot of traffic can't see some of the reports currently because they don't have the resources.
I'm afraid the fullness issue is no excuse for the problems 1-3 you mentioned. These problems started at least five years ago. In my humble opinion the problems are due to quality control and algorythmic difficulties.
I don't understand how Google couldn't predict this in time. They have 100% control over how much data they index and can tell on a daily basis how fast their index is growing and how fast their storage is filling up. This seems like pretty basic stuff to me, so why the huge catastrophe for a billion dollar company?
Well they could also say why do we need to spend 1.5 billion on new machines when you could just cut back on the stuff that does not make money. Those things do take up space that could be used for index stuff. The index is not that big. It is just text. You don't run out of room storeing text. It does not take up that much room and you can compress the heck out of it. Video and Picures is what takes up space. Google has lost it's way. They have so much money not they feel they have to spend it. They don't need to cache web pages, or store videos, or email. They could get rid of all those and it would not stop the flow of income at all. As a matter of fact it would make them a fortune because they would have to stop paying for the bandwidth. They should focus on making money not making cool toys.
I think they did realise what was going on, some 9 to 12 months ago, and have been working on fixing it since then, and most of the "visible" work happening in the last 4 or 5 months.
Just work out how much time is spent merely wiring up 100 000 boxes and plugging them in and copying the OS on to them. Enough work to keep 100 people employed for at least several months, and that is just a small part of the overall job.
|My google analytics has been crippled for weeks. The support email back to me said that people who have a lot of traffic can't see some of the reports currently because they don't have the resources. |
then again markus007, you are WAY up there when it comes to traffic ;-)
excuse me for my ignorance, but one of the thread posters said that maps, videos, earth, etc., are all show big fat zero revenue for google. If that were the case, they wouldn't be doing it. These things instill brand loyalty which brings more revenue--and last time I checked maps did show ads sometimes?
And the Maps are especially critical for pushing local sales. I've already seen google sales people walk into local stores and push their adwords program--they use examples like google maps to show how critical google is to local businesses.
What i am wondering is how Google hasn't commented on bandwidth. Getting tier-1 bandwidth even in volume can be costly.
Matt Cutts commented on their new crawl-cache designed to conserve their bandwidth, in his blog, just a few days ago.
Brand loyalty on maps, video earth etc. is like saying MSN brand loyalty like Hotmail equals a better position in the search market.
What Google is doing apart from search will not help them achieve market share and revenue from their core product - and their core product is their only differentiation from their competitors - search. Take away that and its nothing but a bunch of experiments supplemented by an ad company that gets to distribute it's ads on the biggest search property in the world.
ogletree - we agree then on the fact that Google needs to concentrate on core product. We disagree on how much space it takes to store the index and cache - what I am saying is that Google is storing each and every crawl of every page forever, yes compressed - but think about 10 crawls of 1 page in 10 days that uncompressed is 100k (if you multiply that out by a few billion pages every few days then it outnumbers the other content massively)
What I am saying is that Google had this problem a few years ago and everything they have been doing since has been with an eye on this scenario - big daddy being the implementation of a year or two's work.
And yes, storage does not explain everything - but implementing such agressive dupe filters affects a lot of things - especially the calculation of link patterns and page rank as suddenly links disappear.
g1smd, I think the crawl cache is intended to solve bandwidth and storage - because by doing both it reduces costs.
What I think is that when the debate about storage of DocIds a year or so ago it was taken as a ludicrous suggestion by many that Google could have technical issues such as that.
All I know is that if, when such a large company, mentions storage as an issue (which lets be honest is a ridiculous admission as we all take it as read that you need lots of storage) when operating a "large scale search engine" then behind the scenes there certainly have been very serious for a long time.
I just think it speaks volumes when the company says "DOH! Forgot about the fact we needs loads of machines - we were too bothered in releasing new products that were really cool". As a business they should have communicated between departments saying something like "no we won't launch Google Page Creator yet until we have scaled up the infrastructure because that would be silly"
> I just think it speaks volumes when the company says "DOH! Forgot about the fact we needs loads of machines - we were too bothered in releasing new products that were really cool". As a business they should have communicated between departments saying something like "no we won't launch Google Page Creator yet until we have scaled up the infrastructure because that would be silly"
I got into a spiral like this twenty or more years ago.
"We need ten people to support five mainframes."
"We need to sell ten mainframes to pay ten people."
"We need twenty people to support ten mainframes."
Short and sweet - someone took their eye off the ball. Surprising datum - this is their core product.
Ha ha, thats a great point - everyone assumes that these guys do have their eye on the ball, but it is just a bunch of people at the end of the day being paid to do stuff!
If 1.5 billion is what they need to handle the increases in data storage, then in my opinion, and with absolutely no knowledge of what the requirements are ... they have been letting this problem get out of hand for far too long and their top management should be held accountable for this "crisis"!
If I were an investor in this publicly traded, mega company, I would not be happy at all with numbers like that. This situation hasn't just snuck up on them ... Sounds like a bit of a smoke screen to me.
|We have a huge machine crisis - those machines are full |
I'm not buying what he's selling. If his statement is true, then if I were him, I would be looking for another job! Any CEO who would allow a company the size of Google to reach "critical mass" (so to speak) has no business being in that position.
Those machines are full indeed! Just wait, there is going to be more disclosures about this particular subject very soon and suddenly, that statement will make a lot more sense. My guess is that investors (hoping for dividends) will not be very happy in the near future.
buy more machines, and fast.
You guys really think that? Hmmm ...
Let's say that they realized at least two years ago (pre-IPO) that their storage needs were growing exponentially. As a business owner, how much money would you allocate for storage needs if you were Google, two years ago? 50x your current budget? 100x? More than likely you would allocate a number that reasonable accountants would agree to, like say 10x.
As soon as the money is allocated (and even before), you begin drawing up plans for the expansion that include any sort of known technology and even tech that was rumored to be on the horizon but not yet available. Has that resource pool changed at all over the past two years? Oh, yes. You bet it has, both in terms of what is available and in terms of how much it costs.
Then you grab all of the IT folks you can find to get the plans implemented. How about that metric? Up or down since 2004?
Then you've got to find property to hold all of those servers (10,000 before they stopped reporting it), in some of the most expensive parts of the planet to buy real estate ... unless y'all figger those IT folks are going to move to the puckies. What do you think? Have real estate prices gone up or down in the past 2 years?
Someone noted that G could probably purchase several existing data centers at rock-bottom prices, but are they of the quantity, quality and in the locations that G absolutely must have for their setup? Dunno. Maybe.
Frankly, I think a likely scenario is not one where G was caught flat-footed, but rather one in which the financial system and its increasing complexity since the IPO have stood in the way of performing the tasks they very well knew were necessary to complete, and have hampered G's ability to provision the required storage.
It's always fun to giggle at the boss when their fly is open, but if anyone actually believes that G's business plan is the problem here, well, you're not paying very close attention to the company you're giggling at in this case. G has been anything but in disarray, or forgetful, or floundering.
With regard to the other services (Earth, et al.) ... your suggestions are to forget about all of that shinola and go back to being a simple search engine? Have you learned nothing from the past?
Stagnation = Death.
There is a very important cost and performance trade off to be analysed here.
Do you start on ordering the new machines when the current ones are half full, and get the new stuff running long before the end of the useful life of the present kit? If you do, then you might have over-spent.
Or, do you wait a bit longer and then buy a better spec'd machine at a far lower price next year (spending less, and spending it later), but then run the risk of the new stuff coming online slightly too late, and actually running out of steam on the old kit a month or two beforehand?
This isn't a new problem that has "just" cropped up. This is something that had been noticed a long time ago, and has been planned probably a year ago, and now in the final stages there are problems with the roll-out.
Just do the maths on how long it would take to build 100 000 or 200 000 PCs, copy the OS and software on to them, plug them into the mains, and wire them all up to the network, etc. The installation phase alone would involve hundreds of people for very many months.
I guess this is a project running a few weeks or months late, with a few unexpected bugs popping up in the final stages. Just another typical day in a typical IT company really.
|I guess this is a project running a few months late, with a few unexpected bugs popping up in the final stages. |
This is not something which is a few months late ... try a few years late! They haven't even bought the machines yet! He is saying they need another 1.5 billion to buy more machines because those they have are full. I can't even imagine the time it will take to buy, build, deliver and then get the new machines hooked up and on line.
No, there is nothing typical about this at all. And that kind of investment is not what I would call "a few unexpected bugs"! There is much more to this than meets the eye and my guess is that it all comes down to boardroom politics and investor manipulation.
I don't know much about the technics , but strategically, what i do see is some massive commercial/operational juggling going on ..... done on the run with operational fall out along the way.
So if poor interim results relevance , supplementals, page drops etc are upsetting webmasters and users alike, this is being factored against a product that has to be bigger, better and more profitable than before and they reckon the pain is worth it.
Matt, Virginia and their crew will do what they can to keep us smiling, but i think they are going "full bore" in the overall upgrade mission, while balancing the relationship of webmaster/site owners and users.
I'm sure this acceleration sharpens Yahoo and MSN who are about to apply themselves as well .... I think Bill Gates earmarked an initial couple of billion last week to compete with Google didn't he?
They can't afford the gap to be too wide because it strategically weakens them.
I guess this is all obvious ....but it does make holding onto the ship a bit difficult through this lenghtly storm .... just hold on folks!
> buy more machines, and fast.
Not that easy. As any [mainframe] capacity planner will tell you until terminal boredom sets in (even if the bar is free) scalability is never linear. If you buy as many machines again as you have, you will never achieve twice the throughput.
Designing an application for multi-processor scalability is not trivial. Make it multi-system within a datacenter and then multi-datacenter and you get into the realms of inspiration and prayer.
The "Google dance" was an obvious manifestation of these effects, but that was essentially a SERPs-only effect. Now it's spread to the other functionality - crawling, indexing, vetting for transgressions, etc.
Spending, as has been said, $1.5 billion is one issue. But when the kit is delivered, will Google's infrastructure exploit it?
Let's say storage costs $1.00 per Gigabyte.
1 billion dollars buys 1018 bytes.
Let's say the average page size is 104 bytes.
So 1 billion dollars buys storage for 1014 web pages.
These estimates err on the side of caution so we can assume that most of this investment (99% or so) is for the storage of material other than web pages. I wonder how Google plans to sell that fact to its shareholders.
"it isn't hard to purchase datacenters--there are quite a few for sale at basement prices from companies going under."
It is my understanding that the Google phenomenon, for lack of a better word, was born almost equally from unique, proprietary software and unique, proprietary hardware configurations.
I doubt that someone else's liquidated equipment could fill the bill.
And if this is truly the case, they have been managing this through varying periods of growth since Day 1.
Liane: Hmm, I didn't read the original NYTimes article. It requires a log in.
In my post above, I was maybe thinking about the current problems being bugs in the final rollout of an upgrade happening right now. I now see that what you are talking about is getting new machines in the next year. There is no upgrade to hardware at this time. In fact Matt Cutts said that in his blog just a few weeks ago.
In that case, it puts a whole new spin on things. This new "infrastructure" that Matt Cutts talks about in his blog for the last few months is then merely lots of band-aids on the existing kit, and more clever ways of utilising the existing kit until such time as the new stuff can come online many many months from now...
Now, that would explain why they have added a crawl-cache. They haven't got enough kit to spider the entire web any more, and the bandwidth needed was becoming too great. It might also explain why pages from some sites are disappearing en-masse. They are deleting "unimportant" stuff to make way for new stuff; but have got some of it wrong. It doesn't explain why a very large pile of very obvious junk does remain indexed though.
I do have a theory that supplemental results for 404 pages and for expired domains are kept cached to use them to compare for duplicate content when a spammer sets up a new copy of a banned site on a new domain hoping to "start again" without being noticed, but that is for another time.
| This 183 message thread spans 7 pages: < < 183 ( 1  3 4 5 6 7 ) > > |