Welcome to WebmasterWorld Guest from 184.108.40.206
I was reading the New York Times article Microsoft and Google Set to Wage Arms Race [nytimes.com] and there was paragraph that caught my eye on page 2 that quoted Eric Schmidt (Google CEO) admitting that they have problems with being able to store more web site information because their "machines are full" (see page 2 of NYT article).
I am a webmaster who has had problems with getting / keeping my webpages indexed by Google. I follow Google's guidelines to the letter and I have not practiced any blackhat seo techniques.
Here are some problems I have been having;
1. Established websites having 95%+ pages dropped from Google's index for no reason.
2. New webpages being published on established websites not being indexed (pages that were launched as long as 6-8 weeks ago).
3. New websites being launched and not showing up in serps (as long as 12 months).
We're all well aware that Google has algo problems handling simple directives such as 301 and 302 redirects, duplicate indexing of www and non-www webpages, canonical issues, etc.
Does anybody think that Google's "huge machine crisis" has anything to do with any of the problems I mentioned above?
[edited by: tedster at 5:03 pm (utc) on May 3, 2006]
[edit reason] fix side scroll potential [/edit]
If they use this to buy $10,000 servers(which would be a very bulky, name brand server - dual proc, couple gigs of ram, 6x300GB SCSI drives give or take) they can buy 150,000 servers. That's 225,000,000 Gigabytes of uncompressed information storage on servers that can handle thousands of requests per second EACH.
If they went no-name dual-proc server, SATA, couple gigs of ram, you're talking $2000 a piece with the same storage amount as the SCSI server, which will give you 1,125,000,000 Gigabytes of storage on 750,000 servers. You're beyond petabytes there.
Something else is going on here.
CEOs have a knack for understatement when they have problems.
That observation sounds like an overstatement to me. :-)
The knives are out - Bill G's talking of a fighting fund of billions to compete and prepared to wear a share price drop of 10% to say it. So folks - this is serious.
This is a big time battle in play and guys, yes you and me - we just live on the ground watching the troops run left and right, fixing our fences, playing with collatoral damage.
I kinda wonder if Matt and the team are doing a high wire act, balancing relations with webmasters and customers whilst getting shot daily by the corporate executive team for not fixing the issues fast enough. Welcome to "big time" IT.
Whatever Google's greater objective, they've gotta be confident that they'll pull the product way in front of their competitors - so stick with the ride - if you can hold on - perhaps months!
[edited by: Whitey at 6:27 am (utc) on May 4, 2006]
...they can buy 150,000 servers.
I have seen exactly that theory in print -- can't find it right now (I may be reading too much material at the momnet!). The guess was approximately 100,000 servers, plus networking hardware and the physical room to hold it all.
By the end of 2004 or so, the previous infrastructure was already guesstimated at 60,00 to 100,000 servers. Of course, Google is not confirming or denying those numbers, but we can be sure that "10,000 servers" of 4 years ago is not even close to today's count.
Let's think about 18 billion urls and growing, with stored cache information for several versions of most of them. Plus related data tables, with the raw crawl information sharded into useful pieces, tagged and with some basic analysis already performed. And a lot of this now replicated across how many data centers?
Plus the dedicated crawler machines and so on. And all of that just for search. Now throw in the rest of the products and data types they offer. We are getting very big, very fast.
I once worked in a data center for a global financial corporation. All we did, all day long, was collect transaction data from around the world, cobbling together inputs from about 70 heterogeneous technologies. We would clean the data on AS/400 machines and then upload the (hopefully) proven data to a mainframe.
We were a staff of maybe 80 or so, just trying to preserve data integrity on a much smaller (and much more stable) pile of critical data than Google wrestles with. Making and storing backup tapes alone was a nasty, boring and demanding work. This experience gave me at least a small sense of the issues involved with large data sets.
So yes, I can appreciate that they have a challenge right now at Google. Given the magnitude of their current task, it's amazing that the "regular user" hasn't noticed all that much disruption! And then there's the issue of coordinating their rapidly mushrooming work force. I don't envy the job one bit.
I am also pulling for Google to get on top of their game again. It won't happen tomorrow, but maybe, soon, we can hope.
[edited by: tedster at 6:22 am (utc) on May 4, 2006]
Something else is going on here.
It is not just about storage. It is also about synchronising all this data across 100's of thousands of servers while making sure the integrity of all other services is intact.
It is about good experienced management and a responsible process of decision taking. It is about experimenting with one set of algo and only porting it to all other servers once it is 100% tested. Same goes to new crawlers.
It is about trying to integrate too many services too quick (ex. Adsense, Adwords, Search, plus Analytics) into one huge system without first making sure there is enough infrastructure to support it. Kind of lets test on the go and see, we can always add more servers (yea right).
Storage space is the least of their problem if you ask me. Good managment is.
[edited by: Web_speed at 6:32 am (utc) on May 4, 2006]
Other sites with hundreds or thousands of pages listed have dropped down to 1 page or a handfull.
What's going on? I posted on MC's blog but no one answers.
Anyone have any answers? I'm very worried right now.
"we told 'em the results are crap -
- now they can see it -
- now they know we won't fix it for a while -
-once the word get's around the big ad spenders we'll strap up that revenue depletion we've been worrying about
-look it's already working ad revenues are jumping again "
- the bottom line is, there's always a silver lining my friends - but i think it's Google's, winning new business , forcing revenue in & building a better product whilst we wear the collatoral damage. No offence intended, but that's business i suppose.
its probably tougher to find good DCs. there are a ton of them out there--as i said in a previous post--but how many will have the square footage, connectivity, and redundancy required for a site like google? i know they have a big one in ann arbor which seems to be a pretty recent thing and i'm sure they'll continue to use some of their build more.
I've heard others say they have 5,000 page sites with 19,000+ listings recently, no explanation for that.
I also have old sites that have had hundreds or thousands of listings that have dropped to one page or just a handful. Some are blogs and some are like rate-me sites, but every page is unique and not all of them carry AdSense code or any advertiser for that matter other than a single site sponsor.
Something is definitely wrong. Some of the caches are dating back to Jan '04.
I watched one site go from 260 listings down to a single page in just a few hours.
I never have done anything outside of the guidelines. Something IS wrong.
This is my personnal opinion.
Someone mentioned that Billy G mentioned that MSN was looking to spend some big cash to compete with Google and expected a 10% reduction in share value - is Google preparing the public for the same thing?
There is a word battle taking place between these 2 giants and sometimes we need to read between the lines and look beyond the technical aspects. Who knows, maybe this is the beginning of the implementation plan for all the new hardware?
Oh, and by the way the gloves are off and I am a real Dr - maybe I got lucky becoming a manager away from the dark side of "the farce".
Just a little note if you weren't serious - sorry about that small chip on shoulder!
I was serious about my PhD and there are sureley enough often "bosses" that might have not the relevant theoretical or practical education, BUT it doesn't seem to be like that with Google, since I doubt that so many PhD's in computing science and electrical engineering have no idea what the potential problems of their products are.
Of course a PhD doesn't make you perfect, but if one would have to assess the likelihood, that they understand their company, i'd guess that there is a high enough chance. Higher as if their CEO would have some Harvard MBA or law degree, imo. On the other hand Bill Gates, with mostly practical experience, made it quite big too.
There is so much going on right now with the Google index and it appears like it all ties in together. "We have a huge machine crisis." Sites are dropping from the index, pages are not getting crawled like they should be, people complaining left and right. Something has to come to a head here very shortly.
If we look at the activity of Googlebot over the past year, we can clearly see that the bot is ravenous and will index just about anything. I wonder if Googlebot hit a few (thousand) sites and got caught in a loop and couldn't get out. ;)
err... He's made it bigger than anyone else on the planet... sans PhD.
Which was kinda the point.. doh.. ;)
Nevertheless given the chances of a random sample of 100 people with PhD vs a random sample of people w/o PhD's one would still expect higher knowledge of a subject in the PhD group, given they did their PhD in the subject.
On the other hand should you compare the PhD group vs a purely practically experienced group things might be different.
Really depends on which shoulder you wear your chip on. :)
[edited by: mattg3 at 1:14 pm (utc) on May 4, 2006]
Forget about the BIG NUMBERS game for a while and focus on quality.
Exactly. Bigger is not better - it's just stupid. Now they have a royal mess on their hands and no short term solution in sight.
I expect this most recent SNAFU (as detailed in the missing pages thread) to last throughout the summer.
Christ how could they not have seen this coming and been more prepared for it? Their engineers are writing checks their skills can't cash.
You know, if they really want to save some room they should not cache the following words:
The, and, or
Those three words probally account for 5% of the data kept in the cache. Thats a quick fix.
New York Times (2006-04-21): Google Posts 60% Gain in Earnings [nytimes.com]
This is the "interview last month" with the original "huge machine crisis" quote...
Google continued to make substantial capital investments, mainly in computer servers, networking equipment and space for its data centers. It spent $345 million on these items in the first quarter, more than double the level of last year. Yahoo, its closest rival, spent $142 million on capital expenses in the first quarter.
Google has an enormous volume of Web site information, video and e-mail on its servers, Mr. Schmidt said. "Those machines are full. We have a huge machine crisis."
Need to register to login & read the article? Visit BugMeNot [bugmenot.com].
Google does not disclose technical details, but estimates of the number of computer servers in its data centers range up to a million.
In figures: 1.000.000 PCs!
Sorry if someone else has clarified this before and I overread. BTW the NY-Times article did not require any log-in yesterday. Today it does. Mysterious. I recall a thread at WW on the statistics of error-probability in large computer-clusters. So my theory is: Hal has taken over and forced Mr Schmidt to buy even more;)
So, Google has a storage problem eh? Could it be because they are out indexing every single thing they can get their hands on? Could it be that 30-40% of their index is duplicate content in one form or another? How about cleaning up the index first and then worry about increased storage
But don't they need to index the duplicate content to know it's duplicate content? They get rid of duplicate content by identifying and filtering the duplicated pages from search results--not by removing the data from their hard drives.