Forum Moderators: open

Message Too Old, No Replies

How big is google?

how many terabytes

         

xcandyman

1:09 pm on May 8, 2003 (gmt 0)

10+ Year Member



I have heard that they handle 100's of terabytes of data has anyone got more of an exact'ish kind of figure.

GG?

Thanks

Steve

takagi

1:53 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Can't find the number of terabytes, but in the Google Revenue to skyrocket to $750million this year. [webmasterworld.com] thread you can find the following information about Google's hardware collection

...said it consists of more than 54,000 servers designed by Google engineers from basic components. It contains about 100,000 processors and 261,000 disks,

creative craig

1:55 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Isnt it the biggest Linux cluster in the world?

Craig

creative craig

1:58 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I found this on the Google site:

www.google.com/press/highlights.html

Gives a good run down of their technical highlights :)

Craig

Critter

2:04 pm on May 8, 2003 (gmt 0)

10+ Year Member



Let's do the math :)

We're working with 3,083,324,652 pages.

The cache of pages, assuming that each page is average 4K and each is compressed using gzip (50% compression typical for text) brings us to 5.74 terabytes.

The forward index is going to be about 60% of this size, and so is the backward index--so we have another 3.44 terabytes for each index = 6.88 terabytes.

Custom indexes like title tag and heading indexes, as well as domain/url indexes (for link/allinurl) are going to be substantially smaller...let's say 10% of the compressed cache = 0.57 terabytes.

And, of course, to be conservative we'll double our total; cause you never know what Google's up to :)

Grand total estimate: 26.38 terabytes. Which, by the way, can be comfortably housed in *one* NetApp NAS cabinet.

Peter

creative craig

2:10 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This page from Stanford may help you, with how they store data and how it is used and called for a search:

The anatomy of a search engine [www-db.stanford.edu]

Craig

takagi

2:11 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What about the 700 million Usenet messages and the 425 million images?

xcandyman

2:21 pm on May 8, 2003 (gmt 0)

10+ Year Member



On their jobs page they state on one job:

Building large-scale distributed file systems and other infrastructure that makes it possible to reliably and efficiently manage and process hundreds of terabytes of information.

Thats where I got the hundreds from.

Steve

creative craig

2:23 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You only gave then 26 terabytes of data though, not hundreds :)

Craig

Critter

2:29 pm on May 8, 2003 (gmt 0)

10+ Year Member



Add in the usenet stuff, and the images (which are resampled to be smaller) and I'll still only give another 10 terabytes or so.

The article didn't seem to definately say that Google *had* 100s of terabytes, just that they were building the infrastructure to handle it.

Remember as well, that if this is duplicated across 8 datacenters or so that we'll have to multiply (30 terabytes x 8 = 240 terabytes).

But as far as non-duplicated data probably only 20-30 terabytes at this time.

Guys...you got to remember that a terabyte is a LOT of information

Peter

brotherhood of LAN

2:35 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What if they were to check pages for duplication content....Is the web really only a couple of megs big? ;)

If they ordered the pages on similarity and just encoded the difference between them I see those terrabytes becoming more manageable. Hats off to them anyway, they do a good job of sorting alot of information they have never read or heard of!

xcandyman

2:36 pm on May 8, 2003 (gmt 0)

10+ Year Member



You have got to remember that all the other data they store which is not available via the search all the data on banned sites. I bet they have information on nearly every static page on the internet.

The mind boggles. How about some stats GG?

Steve

creative craig

2:40 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have just been reading the page I quoted, Anatomy of a search engine, we need a fresher version as that paper is based on a 24 million page index, and the index has grown some what since then. ;)

The amount of info in that article is enough to keep any interested SEO busy for a month.

Craig

takagi

2:50 pm on May 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In the thread started by GoogleGuy to inform webmasters about filtering expired domains [webmasterworld.com] two months ago, he wrote in message 19

.. we're using multiple sources of data stretching back to 2000 in order to cross-check. No one should get caught accidently.

So xcandyman is right about Google having a lot more information about (static) pages.

xcandyman

2:53 pm on May 8, 2003 (gmt 0)

10+ Year Member



Just been looking into size and speed and came accross what I want as a new connection instead of my dsl

Dense Wave Division Multiplexing ( DWDM )
Multiple data signals carried on different wavelengths of light.

10.900 Tbps or 10900000 Mbps

It would take 2 seconds to transfer 2.5 terabytes on this baby

Drool :)