Page is a not externally linkable
- Google
-- Google News Archive
---- Why does the 'Google Lag' exist?


Scarecrow - 11:46 pm on Oct 2, 2004 (gmt 0)


I'm a convert to the idea that Google is migrating to 64-bit Linux with a 64-bit file system. Presently they have a "virtual" 64-bit file system that involves lots of Ethernet links and distribution networking behind the scenes to go out and fetch the data that make up each chunk. With the addressing power of a real 64-bit system, Google would improve performance all across the system, and quite dramatically. If the cost is as low as isitreal says, then it's a complete no-brainer to migrate toward 64-bit computing. Look what they have to go through with their present 32-bit system:

"Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunk servers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range."
"The Google File System," by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Critter, a single inverted index consists of one docID per word per web page. Look at the space required for 4 bytes, assuming an average web page of 300 words. I've multiplied by two here because Google uses an average of two docIDs per word per page. That's because they have both a "fancy" and a "plain" inverted index, and also because the docID is used elsewhere in the system:

4 bytes: 300 * 4 billion * 4 * 2 = 9.6 terabytes

The first thing that happens when a search is requested is a lookup in the inverted index. To distribute this load, multiple copies of this index probably exist in each data center. Multiply the above by some unknown number.

This is a lot of space. That's just the space issue connected with the docID, whether it's all in memory or all on hard disk. Everything I've ever read about inverted indexes mentions the importance of compression. You cannot compress further and get more than 4.29 billion unique ones and zeros in 4 bytes (32 bits).

Now add the performance issue of the extra CPU cycles to fetch an expanded docID. (We're not even talking about calculating PageRank, because I think Google realized this was dead 18 months ago.)

Moving to 64-bit computing makes a lot of sense. They can define a new 5-byte integer type in the math library if they want to save space. But the point is, a 64-bit CPU could fetch this new type in one pass instead of two, and you don't take a performance hit.


Thread source:: http://www.webmasterworld.com/google_archive/25989.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com