homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

Google's "Caffeine" includes a rewrite of the Google File System

 10:13 pm on Aug 14, 2009 (gmt 0)

As it invites the world to play in a mysterious sandbox it likes to call "Caffeine," Google is testing more than just a "next-generation" search infrastructure. It's testing at least a portion of a revamped software architecture that will likely underpin all of its online applications for years to come.

Speaking with The Reg, über-Googler Matt Cutts confirms that the company's new Caffeine search infrastructure is built atop a complete overhaul of the company's custom-built Google File System, a project two years in the making. At least informally, Google refers to this file system redux as GFS2.

"There are a lot of technologies that are under the hood within Caffeine, and one of the things that Caffeine relies on is next-generation storage," he says. "Caffeine certainly does make use of the so-called GFS2."

Reported at The Register




 10:30 pm on Aug 14, 2009 (gmt 0)

This is really good reading... and good technical reporting, too. Thanks tangor and thanks to The Register.

There's an earlier Register piece from Wednesday [theregister.co.uk] that goes more in-depth about GFS2 (Google Filesystem 2). The following quote is from Page 2:

[The original GFS approach of a] single master can handle only a limited number of files. The master node stores the metadata describing the files spread across the chunkservers, and that metadata can't be any larger than the master's memory. In other words, there's a finite number of files a master can accommodate.

With its new file system - GFS II? - Google is working to solve both problems. Quinlin and crew are moving to a system that uses not only distributed slaves but distributed masters. And the slaves will store much smaller files. The chunks will go from 64MB down to 1MB.


 10:50 pm on Aug 14, 2009 (gmt 0)

What I found fascinating was this quote from Matt Cutts in the article:

Matt Cutts is the man who oversees the destruction of spam on the world's most popular search - the PageRank guru who typically opines about the ups and downs of Google's search algorithms. So, on Monday afternoon, when Cutts posted a blog post revealing a "secret project" to build a "next-generation architecture for Google's web search," many seemed to think this was some sort of change in search-ranking philosophy. But Cutts made it perfectly clear that this is merely an effort to upgrade the software sitting behind its search engine.

"The new infrastructure sits 'under the hood' of Google's search engine," read his blog post, "which means that most users won't notice a difference in search results."

On another thread [webmasterworld.com...] there was some discussion about the similarities of "old g" and "new g". I think this answers that question.


 10:58 pm on Aug 14, 2009 (gmt 0)

Interesting reading that.

Caffeine is about the search index. But GFS2 is designed specifically for applications like Gmail and YouTube, applications that - unlike an indexing system - are served up directly to the end user. Such apps require ultra-low latency, and that's not something the original GFS was designed for.

Wave being one of those ultra-low latency applications too presumably.


 3:58 am on Aug 15, 2009 (gmt 0)

Wave being one of those ultra-low latency applications too presumably.

Absolutely, anything beyond a fractionally small amount of latency with voice data and people won't use it.


 4:12 am on Aug 15, 2009 (gmt 0)


I'd bet that a lot of PCIe connected Solid State Drives are being used to do a lot of the things described (and inferred) - knowing Google it's probably designed specifically for them.

If memory is limiting the number of chunks that masters index then you either need to massively increase memory (at a huge cost) or you need to introduce near-memory speed directly connected storage with ultra low latency - which is the definition of Enterprise PCIe Solid State Drives (which are huge and cost efficient for the tasks they are used for).

Having tested such devices I'm of the opinion that they are game changers, in that they will completely change the way people code, and if I'm right that Google master servers are using them then it may already be coming to pass.

It's scary how fast storage systems will be in 3 years (Enterprse now will be commodity by then) - it's time to prepare for that transition now - other people already are.

Google have the cash (and talented people) to do all sorts of amazing things - and the storage hardware is catching up with their ambitions. Data processing on a massive scale relies on overcoming many potential weak points, the largest issue used to be the limitations of mechanical hard drives (not a problem when spooling large files but a big issue when you try to get latency down); that issue has been solved and the processing power, network infrastructure, distributed computing algorithms and massive datacentre build outs are all in place... game on.


 6:44 am on Aug 15, 2009 (gmt 0)

Great article, lots of interesting tidbits. Google continually excels at what they do, no argument from me on that front.

just as Yahoo!, Facebook, and others are working to improve the speed of Hadoop - the open source platform based on MapReduce - Google is eternally tweaking the original

I'd love to spend a few hours at one of the engineers workstations to get an idea of just how efficient the system is.


 9:39 am on Aug 15, 2009 (gmt 0)

I'd bet that a lot of PCIe connected Solid State Drives are being used to do a lot of the things described (and inferred) - knowing Google it's probably designed specifically for them.

Good thinking. I'd say that you were pretty much bang on the money with that, as it all ties up quite nicely:


 1:00 pm on Aug 15, 2009 (gmt 0)

>>I'd bet that a lot of PCIe connected Solid State Drives are being used to do a lot of the things described (and inferred) - knowing Google it's probably designed specifically for them.

I was thinking the same when I read this again. Tweaked SSDs should be far superior to anything that's spinning. In fact, I would think this is an economical solution for Google despite the cost of SSDs.


 12:17 am on Aug 16, 2009 (gmt 0)

Solid State Drives

I agree that SSD's are going to change computing, especially web servers, in a huge way.

However I would be willing to bet that Google's "large quantities of inexpensive machines" philosophy means they have not yet made the switch to SSD.

No doubt they will start plugging them in when prices come down some more, but for the moment I suspect their HDs are still spinning.


 1:01 am on Aug 16, 2009 (gmt 0)

With 1 million servers or more, they'll probably have a mix for quite a while. But here's an article from May 2008:

Intel gains SSD orders from Google [digitimes.com]

[edited by: tedster at 7:02 pm (utc) on Aug. 16, 2009]


 6:24 pm on Aug 16, 2009 (gmt 0)

The speculation in that article was never officially commented on as far as I can remember, but it's almost certainly true - so it's very apt to bring it up.

It would be easy to assume that Google are using SATA Intel SSDs like the X25 M or E - but (as the article speculates) it's almost certainly a bespoke Marvell PCIe conroller with Intel sourced flash chips (given Marvell's June 2008 press release on their PCIe SSD controller). It'll be interesting to see if they stick with them or consider that other PCIe vendors.

It's interesting that the well known aversion to expensive hardware has been sidestepped by Google here - the benefits of Enterprise SSDs are too big to ignore.

brotherhood of LAN

 7:26 pm on Aug 16, 2009 (gmt 0)

With the scales and obviously immensely thought out system, I wonder where the bottleneck is in their crawling/indexing/ranking/serving.


 6:47 pm on Aug 19, 2009 (gmt 0)

I'm noticing some striking changes in the actual search results.
For Ex.
Google - Now
Search Phrase-----------------SERP Pos.------URL----------- # of Results
3 word phrase 1 (singular)-----10----------Home Page----------2,770,000
3 word phrase 1 (plural)------ 179---------Category Page------3,340,000
4 word phrase 1 (singular)-----5-----------Home Page----------1,960,000
4 word phrase 1 (plural)-------7-----------Home Page----------2,260,000

Google - Caffeine
Search Phrase-----------------SERP Pos.------URL------------# of Results
3 word phrase 1 (singular)-----84----------Product Page------2,690,000
3 word phrase 1 (plural)-------3-----------Home Page---------3,760,000
4 word phrase 1 (singular)-----1-----------Home Page---------2,420,000
4 word phrase 1 (plural)-------3-----------Home Page---------3,140,000

Searches were done w/ in 5 minutes total.
The same 3 word and 4 word phrases were used on both the current Google search and the Google Caffeine search.

Note the varying # of Results, SERP Positioning and Landing Page

The disparity between the # of Results and SERP Positioning becomes even greater when using some of our "seasonal" search phrases.

Anyone else seeing this? Even though they claim that it doesn't effect SERP's


 7:06 pm on Aug 19, 2009 (gmt 0)

The way I hear Google on this is that they don't intend Caffeine to affect results - but they certainly know that it might. That's why Matt Cutts asked for feedback about the discrepancies webmasters might see in the search results.

Thanks for the data on this. The changing total number of results is particularly interesting, i think, because it is more reflective of an infrastructure change. We do need to remember, however, that the number reported is only a rough estimate. Still, there may be some clues in there.


 8:54 pm on Aug 19, 2009 (gmt 0)

Right, I know about the notification but I just have a hard time clicking on the "Dissatisfied? Help us improve" link. Cause I'm not necesarilly dissatisfied.
I did some searches and targeted to see where a specific companies URL came up and these are my results. WOW Look at the variance in the numbers on some of the search phrases. The variance on the number of results seems to be averaging higher but the bench domain I was looking at seems to have better SERP Positioning.
Phrase 1 on Google Now and Phrase 1 on Google Caffeine are the same, etc.

Google - Now
Search Phrase----------SERP Pos.----# of Results
Phrase 1 --------------->100---------35,300,000
Phrase 2 --------------->200----------1,310,000
Phrase 3 -----------------86------------990,000
Phrase 4 ------------------5------------598,000
Phrase 5 -----------------84----------1,130,000
Phrase 6 ----------------161----------1,060,000
Phrase 7 ------------------5------------566,000
Phrase 8 ----------------159---------34,800,000

Google - Caffeine
Phrase 1 -----------------19----------8,080,000
Phrase 2 -----------------11----------9,970,000
Phrase 3 -----------------11----------1,350,000
Phrase 4 ------------------3----------1,130,000
Phrase 5 -----------------22----------1,340,000
Phrase 6 -----------------21---------20,100,000
Phrase 7 ------------------2----------1,030,000
Phrase 8 -----------------15----------8,060,000

10 minutes differential on these searches.


 9:20 pm on Aug 21, 2009 (gmt 0)

Having watched the vid and previous mentions from Matt about "turning the dials" on the algo... I think that G has reformulated how their control panel functions.

Think of the old algo like a scale with one fulcrum... now with the new architecture they have many more fulcrums for fine tuning of results based on their analysis of user motive and all of the new integrated search verticals - video, news (results based on time relevancy), profiles etc...

I also think that G is very concerned about market share and competition. They have been sitting on the caffiene update for a bit - waiting for a bump like bing... As soon as the market share started changing - out came the announcement.

I wonder if they have really been working on caffiene vs. the regular algo - thus all of the crap that is G's SERPs right now... less relevant results than ever IMO.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved