| This 183 message thread spans 7 pages: < < 183 ( 1 2 3  5 6 7 ) > > || |
|Google CEO admits, "We have a huge machine crisis "|
Google admits that they have a problem with storing the world's information
Google CEO admits - "We have a huge machine crisis - those machines are full".
I was reading the New York Times article Microsoft and Google Set to Wage Arms Race [nytimes.com] and there was paragraph that caught my eye on page 2 that quoted Eric Schmidt (Google CEO) admitting that they have problems with being able to store more web site information because their "machines are full" (see page 2 of NYT article).
I am a webmaster who has had problems with getting / keeping my webpages indexed by Google. I follow Google's guidelines to the letter and I have not practiced any blackhat seo techniques.
Here are some problems I have been having;
1. Established websites having 95%+ pages dropped from Google's index for no reason.
2. New webpages being published on established websites not being indexed (pages that were launched as long as 6-8 weeks ago).
3. New websites being launched and not showing up in serps (as long as 12 months).
We're all well aware that Google has algo problems handling simple directives such as 301 and 302 redirects, duplicate indexing of www and non-www webpages, canonical issues, etc.
Does anybody think that Google's "huge machine crisis" has anything to do with any of the problems I mentioned above?
[edited by: tedster at 5:03 pm (utc) on May 3, 2006]
[edit reason] fix side scroll potential [/edit]
5 pages, and no one has worked the numbers backwards yet?
If they use this to buy $10,000 servers(which would be a very bulky, name brand server - dual proc, couple gigs of ram, 6x300GB SCSI drives give or take) they can buy 150,000 servers. That's 225,000,000 Gigabytes of uncompressed information storage on servers that can handle thousands of requests per second EACH.
If they went no-name dual-proc server, SATA, couple gigs of ram, you're talking $2000 a piece with the same storage amount as the SCSI server, which will give you 1,125,000,000 Gigabytes of storage on 750,000 servers. You're beyond petabytes there.
Something else is going on here.
|CEOs have a knack for understatement when they have problems. |
That observation sounds like an overstatement to me. :-)
Don't read too literally into everything the boss says. Remember he's playing big time - Investors , Customers and Profits , Strategic Plays.
The knives are out - Bill G's talking of a fighting fund of billions to compete and prepared to wear a share price drop of 10% to say it. So folks - this is serious.
This is a big time battle in play and guys, yes you and me - we just live on the ground watching the troops run left and right, fixing our fences, playing with collatoral damage.
I kinda wonder if Matt and the team are doing a high wire act, balancing relations with webmasters and customers whilst getting shot daily by the corporate executive team for not fixing the issues fast enough. Welcome to "big time" IT.
Whatever Google's greater objective, they've gotta be confident that they'll pull the product way in front of their competitors - so stick with the ride - if you can hold on - perhaps months!
[edited by: Whitey at 6:27 am (utc) on May 4, 2006]
|...they can buy 150,000 servers. |
I have seen exactly that theory in print -- can't find it right now (I may be reading too much material at the momnet!). The guess was approximately 100,000 servers, plus networking hardware and the physical room to hold it all.
By the end of 2004 or so, the previous infrastructure was already guesstimated at 60,00 to 100,000 servers. Of course, Google is not confirming or denying those numbers, but we can be sure that "10,000 servers" of 4 years ago is not even close to today's count.
Let's think about 18 billion urls and growing, with stored cache information for several versions of most of them. Plus related data tables, with the raw crawl information sharded into useful pieces, tagged and with some basic analysis already performed. And a lot of this now replicated across how many data centers?
Plus the dedicated crawler machines and so on. And all of that just for search. Now throw in the rest of the products and data types they offer. We are getting very big, very fast.
I once worked in a data center for a global financial corporation. All we did, all day long, was collect transaction data from around the world, cobbling together inputs from about 70 heterogeneous technologies. We would clean the data on AS/400 machines and then upload the (hopefully) proven data to a mainframe.
We were a staff of maybe 80 or so, just trying to preserve data integrity on a much smaller (and much more stable) pile of critical data than Google wrestles with. Making and storing backup tapes alone was a nasty, boring and demanding work. This experience gave me at least a small sense of the issues involved with large data sets.
So yes, I can appreciate that they have a challenge right now at Google. Given the magnitude of their current task, it's amazing that the "regular user" hasn't noticed all that much disruption! And then there's the issue of coordinating their rapidly mushrooming work force. I don't envy the job one bit.
I am also pulling for Google to get on top of their game again. It won't happen tomorrow, but maybe, soon, we can hope.
[edited by: tedster at 6:22 am (utc) on May 4, 2006]
so, the next question is what timeline are we talking about here? cos I for one want to see the old Gbot about gathering data and ranking sites properly again
tigger - who knows - it's been a long 5 months, how about another 5 :)
|Something else is going on here. |
It is not just about storage. It is also about synchronising all this data across 100's of thousands of servers while making sure the integrity of all other services is intact.
It is about good experienced management and a responsible process of decision taking. It is about experimenting with one set of algo and only porting it to all other servers once it is 100% tested. Same goes to new crawlers.
It is about trying to integrate too many services too quick (ex. Adsense, Adwords, Search, plus Analytics) into one huge system without first making sure there is enough infrastructure to support it. Kind of lets test on the go and see, we can always add more servers (yea right).
Storage space is the least of their problem if you ask me. Good managment is.
[edited by: Web_speed at 6:32 am (utc) on May 4, 2006]
I lost 165,000 listings for one domain in the last 24 hours. No page shows up for this domain in Google.
Other sites with hundreds or thousands of pages listed have dropped down to 1 page or a handfull.
What's going on? I posted on MC's blog but no one answers.
Anyone have any answers? I'm very worried right now.
I just can't help thinking how happy the boss is.
"we told 'em the results are crap -
- now they can see it -
- now they know we won't fix it for a while -
-once the word get's around the big ad spenders we'll strap up that revenue depletion we've been worrying about
-look it's already working ad revenues are jumping again "
- the bottom line is, there's always a silver lining my friends - but i think it's Google's, winning new business , forcing revenue in & building a better product whilst we wear the collatoral damage. No offence intended, but that's business i suppose.
There's no way they pay anywhere near 2k per server. I pay 2k per server for brand names fully loaded with only a little volume ( ten or so servers a month from a decent sales rep). Google has got to pay a small fraction of that. In fact, there was a powerpoint that described their infrastructure, and i'm pretty sure they believe in very cheap servers. things like ram, raid cards, and hard drives get really cheap when you buy them in bulk.
its probably tougher to find good DCs. there are a ton of them out there--as i said in a previous post--but how many will have the square footage, connectivity, and redundancy required for a site like google? i know they have a big one in ann arbor which seems to be a pretty recent thing and i'm sure they'll continue to use some of their build more.
First post here .Hi to everybody.
"lost 165,000 listings"
In an other thread you mention that your site is 6 weeks old ,you didn't tell us what kind of site that is,another MFA scraper ,a rich content site ,what.
I have a a few months old heavy content and I mean heavy content travel site but with less than a hundred pages and a handful of backlinks that has not lost any pages ,instead , already ranks in the top of many 2 or3 keyword searchesin a few millions SERPS.What does that means.First Google is clever enough to index rich content quality sites with lot of information second there is not a sandbox unless the site is crap with thousands of backlinks.Now a bit out of topic the above but on topic ,can you all imagine the web without Google and only monopoly of MS .That will be a nightmare.OK maybe Google at the moment faces some kind of problems and maybe not.There is nothing wrong with the serps for the end user at the moment all first pages results on almost every kind of serch are more accurate then both Yahoo or MSN .Old established Pages with content are there and they have not lost positions at last the for 3-4 years .I just smell here that all this Anti Googlism in many threads are mostly from pages that have been droped due to scraping ,MFA ,farmlinking ,spam tec and so on ,I don't say that all site owners that lost rankings belong to those categories ,maybe some good sites have been droped but I strongly believe that for useful sites is temporary ,Google is reindexing at the moment and I am sure Google will fix all at the final step of Big Daddy with a new huge PR BL update.
And finally instead of moaning about Google remember just this.No Google No money No honey.
It's a directory, sort of like DMOZ and the funny thing is, there's only 55,000 pages, all unique and content rich including RSS feeds.
I've heard others say they have 5,000 page sites with 19,000+ listings recently, no explanation for that.
I also have old sites that have had hundreds or thousands of listings that have dropped to one page or just a handful. Some are blogs and some are like rate-me sites, but every page is unique and not all of them carry AdSense code or any advertiser for that matter other than a single site sponsor.
Something is definitely wrong. Some of the caches are dating back to Jan '04.
I watched one site go from 260 listings down to a single page in just a few hours.
I never have done anything outside of the guidelines. Something IS wrong.
Some people say "Yahoo and MSN provide more relevant search results than Google". Personnaly I won't switch to those search engines. Too many times my queries showed no results on Yahoo/MSN while Google gave me what I was looking for. Google Image is also in my opinion far more powerful than Yahoo Image or MSN Image. Frankly I don't think Google would stay #1 if their search results were so irrelevant :-) Spam is everywhere, and taking the example of MSN I have noticed than it's very easy ranking #1 if the search phrase is in the domain name (then you just need a few backlinks to rank#1). Also MSN has problems indexing pages which have many hyphens in the url. Yahoo has relevant results but it can be very slow to index new pages, moreover the number of indexed pages on Yahoo is weak compared to Google's, I often get no results when I use it.
This is my personnal opinion.
If this thread gets much longer google will have to spend yet more $ to index it :)
could Google expand there capacity by moving to shared hosting?
maybe we should all chip in and help these guys out, afterall they don't have any money to fix the problem (sorry about the sarcasim but I could not resist)
I also think there space troubles, is why we SO early see the omitted results, I once saw, after 6 pages came the omitted results link out of 55mill sites, thats so NOT professionel, but lets hope they get more space soon.
IMHO - when a company goes public, it is because they need the capital for expansion purposes. The need for money to buy the equipment was probably known before the release of the IPO. Probably the statement was made to prepare the shareholders and investors that cash would be put into the company, rather than paying out of higher dividends. To an outsider (shareholder or investor) they would see a reduction in the normal cashflow - panic would set in - stock prices drop and then a big sell off of stock.
Someone mentioned that Billy G mentioned that MSN was looking to spend some big cash to compete with Google and expected a 10% reduction in share value - is Google preparing the public for the same thing?
There is a word battle taking place between these 2 giants and sometimes we need to read between the lines and look beyond the technical aspects. Who knows, maybe this is the beginning of the implementation plan for all the new hardware?
|Oh, and by the way the gloves are off and I am a real Dr - maybe I got lucky becoming a manager away from the dark side of "the farce". |
Just a little note if you weren't serious - sorry about that small chip on shoulder!
I was serious about my PhD and there are sureley enough often "bosses" that might have not the relevant theoretical or practical education, BUT it doesn't seem to be like that with Google, since I doubt that so many PhD's in computing science and electrical engineering have no idea what the potential problems of their products are.
Of course a PhD doesn't make you perfect, but if one would have to assess the likelihood, that they understand their company, i'd guess that there is a high enough chance. Higher as if their CEO would have some Harvard MBA or law degree, imo. On the other hand Bill Gates, with mostly practical experience, made it quite big too.
|On the other hand Bill Gates, with mostly practical experience, made it quite big too. |
err... He's made it bigger than anyone else on the planet... sans PhD.
So, Google has a storage problem eh? Could it be because they are out indexing every single thing they can get their hands on? Could it be that 30-40% of their index is duplicate content in one form or another? How about cleaning up the index first and then worry about increased storage. Forget about the BIG NUMBERS game for a while and focus on quality. That is what we've come to know Google for, the quality of the SERPs.
There is so much going on right now with the Google index and it appears like it all ties in together. "We have a huge machine crisis." Sites are dropping from the index, pages are not getting crawled like they should be, people complaining left and right. Something has to come to a head here very shortly.
If we look at the activity of Googlebot over the past year, we can clearly see that the bot is ravenous and will index just about anything. I wonder if Googlebot hit a few (thousand) sites and got caught in a loop and couldn't get out. ;)
|err... He's made it bigger than anyone else on the planet... sans PhD. |
Which was kinda the point.. doh.. ;)
Nevertheless given the chances of a random sample of 100 people with PhD vs a random sample of people w/o PhD's one would still expect higher knowledge of a subject in the PhD group, given they did their PhD in the subject.
On the other hand should you compare the PhD group vs a purely practically experienced group things might be different.
Really depends on which shoulder you wear your chip on. :)
[edited by: mattg3 at 1:14 pm (utc) on May 4, 2006]
|Forget about the BIG NUMBERS game for a while and focus on quality. |
Exactly. Bigger is not better - it's just stupid. Now they have a royal mess on their hands and no short term solution in sight.
I expect this most recent SNAFU (as detailed in the missing pages thread) to last throughout the summer.
Christ how could they not have seen this coming and been more prepared for it? Their engineers are writing checks their skills can't cash.
they can drop my old homepage which is offline and redirected (301) since november 2004 - that would save them app. 20'000 pages.
not much - but a start ;-)))))
They definately need to drop old pages, I have pages indexed and cached that have not been around for 6 months, all 404s.
You know, if they really want to save some room they should not cache the following words:
The, and, or
Those three words probally account for 5% of the data kept in the cache. Thats a quick fix.
For "your" information:
New York Times (2006-04-21): Google Posts 60% Gain in Earnings [nytimes.com]
This is the "interview last month" with the original "huge machine crisis" quote...
|Google continued to make substantial capital investments, mainly in computer servers, networking equipment and space for its data centers. It spent $345 million on these items in the first quarter, more than double the level of last year. Yahoo, its closest rival, spent $142 million on capital expenses in the first quarter. |
Google has an enormous volume of Web site information, video and e-mail on its servers, Mr. Schmidt said. "Those machines are full. We have a huge machine crisis."
Need to register to login & read the article? Visit BugMeNot [bugmenot.com].
or pure, pure google!
someone should open "Save Google!" paypal account and i'll make a $5 deposit.
So will new sites get more sandboxed?
I quickly went thru this whole thread and found a number of postings brought in wrong figures about the number of machine's udsed at google's, although the link given in the original post contained interesting information on just that
|Google does not disclose technical details, but estimates of the number of computer servers in its data centers range up to a million. |
In figures: 1.000.000 PCs!
Sorry if someone else has clarified this before and I overread. BTW the NY-Times article did not require any log-in yesterday. Today it does. Mysterious. I recall a thread at WW on the statistics of error-probability in large computer-clusters. So my theory is: Hal has taken over and forced Mr Schmidt to buy even more;)
|So, Google has a storage problem eh? Could it be because they are out indexing every single thing they can get their hands on? Could it be that 30-40% of their index is duplicate content in one form or another? How about cleaning up the index first and then worry about increased storage |
But don't they need to index the duplicate content to know it's duplicate content? They get rid of duplicate content by identifying and filtering the duplicated pages from search results--not by removing the data from their hard drives.
|But don't they need to index the duplicate content to know it's duplicate content? |
Not really. In this instance, I'm referring to query string duplicate content. As g1smd pointed out in another topic, vBulletin presents 12 different URI scenarios to the spider that all lead to the same content. Googlebot has been digging deeper and deeper into queries over the past 18 months. Personally I think it goes to deep into query strings and that's where part of the problem lies.
| This 183 message thread spans 7 pages: < < 183 ( 1 2 3  5 6 7 ) > > |