homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

Google Reaches New Records Sorting 1PB Data In 6 Hours

 5:13 pm on Nov 27, 2008 (gmt 0)

Google Reaches New Records Sorting 1PB Data [googleblog.blogspot.com] In 6 Hours
We are excited to announce we were able to sort 1TB (stored on the Google File System as 10 billion 100-byte records in uncompressed text files) on 1,000 computers in 68 seconds. By comparison, the previous 1TB sorting record is 209 seconds on 910 computers.

Sometimes you need to sort more than a terabyte, so we were curious to find out what happens when you sort more and gave one petabyte (PB) a try. One petabyte is a thousand terabytes, or, to put this amount in perspective, it is 12 times the amount of archived web data in the U.S. Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.

It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers. We're not aware of any other sorting experiment at this scale and are obviously very excited to be able to process so much data so quickly.

Just sorting that amound of data is quite something. As they indicate, when you've sorted it, where do you put that information! Apparently, 48,000 hard drives.



 6:12 pm on Nov 27, 2008 (gmt 0)

Seems like a good part of the speed came from innovative handling what they call "stragglers" - parts of their computations that were performing slower than expected. I also was interested to learn that every time they ran a test, at least one of the 4,000 hard drives "broke" - and so the need for triple redundancy.


 7:06 pm on Nov 27, 2008 (gmt 0)

6 hours? I'd go out, eat dinner, take in a movie, do some shopping, come back and it still wouldn't be done. tsk tsk

Now, I can appreciate this is not the sort of thing you or I could do with an off-the-shelf PC (even if I turn off Vista's CPU-hogging sidebar and close ALL my programs!). What if I happen to acquire 1PB of data, and need it sorted. Can I hire Google to do it?

And can I please get the sorted results back as an Excel file?

This story is like nerd erotica


 9:05 pm on Nov 27, 2008 (gmt 0)

Hey! Abuse of power! lol


 9:27 pm on Nov 27, 2008 (gmt 0)

Sorry your early scopp was overlooked Jake. I've crossed linked it now.

In the thread you posted, you asked: "Any ideas on when it will apply to the data pushes? Do you think the data pushes will happen more often, more updates?"

Google's article says:

By pushing the boundaries of these types of programs, we learn about the limitations of current technologies as well as the lessons useful in designing next generation computing platforms. This, in turn, should help everyone have faster access to higher-quality information.

So I'm guessing that they are planning on using their learning in-house, to the degree that it is practical. This time trial is apparently an enhancement of their MapReduce [labs.google.com] program, which they've been using in some form or another since 2004.


 9:37 pm on Nov 27, 2008 (gmt 0)

What amazes me is that they wrote the data 3 times.
To make sure we kept our sorted petabyte safe, we asked the Google File System to write three copies of each file to three different disks.

Could one assume that if the data was written only once, the time would have been cut in half to 3 hours, not 2 hours like one would think..?


 9:52 pm on Nov 27, 2008 (gmt 0)

I thought that 6 hours was about how long it takes the average human to find one decent Google search results page these days, so that ties up quite nicely.


 9:54 pm on Nov 27, 2008 (gmt 0)

I thought that 6 hours was about how long it takes the average human to find one decent Google search results page, so that ties up quite nicely.

:) Cheers!


 10:02 pm on Nov 27, 2008 (gmt 0)

So six times more computers and they could do it in an hour?
Were those computers doing other things at the time like serving web pages?

I want to see the try the same sort in 5 years on the newest computers ;-)


 10:09 pm on Nov 27, 2008 (gmt 0)

In 5 years, they'll be processing an Exabyte in an hour, while serving web pages...


 10:11 pm on Nov 27, 2008 (gmt 0)

I don't think it will leap that far that fast but quad core should be super cheap then and they are probably running only single or dual core at best now on those 4000 pcs. So maybe double the clock rate, double the cores per machine.


 10:21 pm on Nov 27, 2008 (gmt 0)

Well, you figure that a terabyte can hold roughly 1000 hours of standard video, and if 1 million people kept a video diary for 6 months, boom, there is an exabyte. Now if only Google could index video, then I could see the need. I think the amount of data that gets created seems to be expanding, and there's a need for storage space to fill it..
Rough estimates put the size of the internet at a yottabyte, and google is only processing 1/1000th of it in 6 hours, wow..


 12:20 am on Nov 28, 2008 (gmt 0)

Hey c'mon the 6 hour slag to find search results slag is just snarky: I have been digging for programming tips for the last 2 days and never have to go past page 2 on G.

Ok "<specific programming language error code here>" is not exactly a good measuring stick but it seems like it's been worse before.

I just wish they could put this computing power to use to stop the bloodshed and suffering the world, rather than just running down their search results for the sake of it. :(


 6:15 am on Nov 28, 2008 (gmt 0)

Very cool but not very practical. Even if Google manages to index all of the knowledge known to man there can still be only one top spot for any given term... which makes for a lot of not number one's which may never be seen anyway.


 7:11 am on Nov 28, 2008 (gmt 0)

However, storing lots of data is not the same as being able to sort that data very fast. It's an important difference.

I think there's plenty of practical value involved in speeding up computing processes - such as the ability to fold more complex intelligence into the algo. More complexity requires more computing, and a lot of Google's innvoation in recent times has come from just such advances. You want better duplicate handling? It takes faster computing cycles to make that happen. Same thing with catching and dumping spam.


 9:59 am on Nov 28, 2008 (gmt 0)

When nerds get into a "mine is bigger than yours" mood...


 4:19 pm on Nov 28, 2008 (gmt 0)

bubble sort is faster


 12:28 am on Nov 29, 2008 (gmt 0)

Thats fast. Would you imagine a 1 Petabyte of data? It could be the size of all webpages in the world.


 12:46 am on Nov 29, 2008 (gmt 0)

I'm more or less a layman but wouldn't the web be way more efficient if it was binary? What is the deal? <my tag> is infinitely less borg-ish than 0's and 1's.

Considering i'm still getting 'Dreamweaver Extension' marketing emails weekly, I can't expect the argument of usability for designers can be much of a stretch > mom and pop can use software that outputs pages as binary. Voila, huge amounts of bytes saved in crunching.

Someone, who is more geeked out, please explain why this isn't a pressing issue like electric cars. So much electricity, hardware, etc. to be spared, why is WWW not in binary?


 1:00 am on Nov 29, 2008 (gmt 0)

To get a handle on that, you'd need to study the whole development of, first, the Internet, and then the world-wide web. A really quick answer is that the WWW was conceived as a way to give everyone a way to share documents, rather than just geeks sharing data. But in the final analysis, the web is still just 0's and 1's today. It just gets represented to us as something more comprehensible.

Happy research: Internet History [freesoft.org]; The WWW Project [w3.org]

Note that this sorting record that Google set did use "100-byte records" - and that's getting pretty small, but yes, it's still more that just one bit.


 3:31 am on Nov 29, 2008 (gmt 0)

Tedster, quote: "more complex intelligence"... I love that line because I was just thinking it IS still just one's and zero's, on and off, today. I was also thinking the net was designed by the military, not some search company. Military intelligence, more complex intelligence, all the same!

I feel that the search market, as it is now, could easily be turned on it's head in a heartbeat and Google, Yahoo and MSN could be relegated to being good conversational topics but nothing more.


Some enthusiastic college student will develop a new browser that incorporates what PEOPLE really want, not what pleases investors and advertisers, and people will flock to that. (sorry Google, Chrome isn't it). This same enthusiastic kid will also realize that since everyone wants HIS/HER browser more than any other that the major search companies should pay HIM/HER handsomely for the privilege of being the default search engine.

That possibility has to give top executives nightmares, I'm sure each company has a vault of money ready to throw at such a kid (assuming they can't influence their way into the project in advance)... and I can't wait to see it happen. It will take timing, perhaps a significant advance in computer technology timed perfectly with the new browser but it's very possible and perhaps even probable.


 2:44 pm on Dec 2, 2008 (gmt 0)

they should make sorting data an olympic event; let all the pothead basement hackers get a crack @ it.


 3:24 am on Dec 3, 2008 (gmt 0)

pothead basement haxors are too busy working on netflix challenge for a million bux..

Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved