Welcome to WebmasterWorld Guest from 54.166.141.206

Forum Moderators: Robert Charlton & aakk9999 & andy langton & goodroi

Message Too Old, No Replies

Google's "Caffeine" includes a rewrite of the Google File System

     
10:13 pm on Aug 14, 2009 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:6897
votes: 377


As it invites the world to play in a mysterious sandbox it likes to call "Caffeine," Google is testing more than just a "next-generation" search infrastructure. It's testing at least a portion of a revamped software architecture that will likely underpin all of its online applications for years to come.

Speaking with The Reg, über-Googler Matt Cutts confirms that the company's new Caffeine search infrastructure is built atop a complete overhaul of the company's custom-built Google File System, a project two years in the making. At least informally, Google refers to this file system redux as GFS2.

"There are a lot of technologies that are under the hood within Caffeine, and one of the things that Caffeine relies on is next-generation storage," he says. "Caffeine certainly does make use of the so-called GFS2."

Reported at The Register

[theregister.co.uk...]

10:30 pm on Aug 14, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


This is really good reading... and good technical reporting, too. Thanks tangor and thanks to The Register.

There's an earlier Register piece from Wednesday [theregister.co.uk] that goes more in-depth about GFS2 (Google Filesystem 2). The following quote is from Page 2:

[The original GFS approach of a] single master can handle only a limited number of files. The master node stores the metadata describing the files spread across the chunkservers, and that metadata can't be any larger than the master's memory. In other words, there's a finite number of files a master can accommodate.

With its new file system - GFS II? - Google is working to solve both problems. Quinlin and crew are moving to a system that uses not only distributed slaves but distributed masters. And the slaves will store much smaller files. The chunks will go from 64MB down to 1MB.

10:50 pm on Aug 14, 2009 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:6897
votes: 377


What I found fascinating was this quote from Matt Cutts in the article:

Matt Cutts is the man who oversees the destruction of spam on the world's most popular search - the PageRank guru who typically opines about the ups and downs of Google's search algorithms. So, on Monday afternoon, when Cutts posted a blog post revealing a "secret project" to build a "next-generation architecture for Google's web search," many seemed to think this was some sort of change in search-ranking philosophy. But Cutts made it perfectly clear that this is merely an effort to upgrade the software sitting behind its search engine.

"The new infrastructure sits 'under the hood' of Google's search engine," read his blog post, "which means that most users won't notice a difference in search results."

On another thread [webmasterworld.com...] there was some discussion about the similarities of "old g" and "new g". I think this answers that question.

10:58 pm on Aug 14, 2009 (gmt 0)

Full Member

10+ Year Member

joined:Aug 12, 2003
posts:204
votes: 0


Interesting reading that.

Caffeine is about the search index. But GFS2 is designed specifically for applications like Gmail and YouTube, applications that - unlike an indexing system - are served up directly to the end user. Such apps require ultra-low latency, and that's not something the original GFS was designed for.

Wave being one of those ultra-low latency applications too presumably.

3:58 am on Aug 15, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 13, 2005
posts:1077
votes: 0


Wave being one of those ultra-low latency applications too presumably.

Absolutely, anything beyond a fractionally small amount of latency with voice data and people won't use it.
4:12 am on Aug 15, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 9, 2004
posts:1435
votes: 0


Hmm...

I'd bet that a lot of PCIe connected Solid State Drives are being used to do a lot of the things described (and inferred) - knowing Google it's probably designed specifically for them.

If memory is limiting the number of chunks that masters index then you either need to massively increase memory (at a huge cost) or you need to introduce near-memory speed directly connected storage with ultra low latency - which is the definition of Enterprise PCIe Solid State Drives (which are huge and cost efficient for the tasks they are used for).

Having tested such devices I'm of the opinion that they are game changers, in that they will completely change the way people code, and if I'm right that Google master servers are using them then it may already be coming to pass.

It's scary how fast storage systems will be in 3 years (Enterprse now will be commodity by then) - it's time to prepare for that transition now - other people already are.

Google have the cash (and talented people) to do all sorts of amazing things - and the storage hardware is catching up with their ambitions. Data processing on a massive scale relies on overcoming many potential weak points, the largest issue used to be the limitations of mechanical hard drives (not a problem when spooling large files but a big issue when you try to get latency down); that issue has been solved and the processing power, network infrastructure, distributed computing algorithms and massive datacentre build outs are all in place... game on.

6:44 am on Aug 15, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:July 29, 2007
posts:1724
votes: 76


Great article, lots of interesting tidbits. Google continually excels at what they do, no argument from me on that front.

just as Yahoo!, Facebook, and others are working to improve the speed of Hadoop - the open source platform based on MapReduce - Google is eternally tweaking the original

I'd love to spend a few hours at one of the engineers workstations to get an idea of just how efficient the system is.
9:39 am on Aug 15, 2009 (gmt 0)

Full Member

10+ Year Member

joined:Aug 12, 2003
posts:204
votes: 0


I'd bet that a lot of PCIe connected Solid State Drives are being used to do a lot of the things described (and inferred) - knowing Google it's probably designed specifically for them.

Good thinking. I'd say that you were pretty much bang on the money with that, as it all ties up quite nicely:
[informationweek.com...]

1:00 pm on Aug 15, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member billys is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 1, 2004
posts:3181
votes: 0


>>I'd bet that a lot of PCIe connected Solid State Drives are being used to do a lot of the things described (and inferred) - knowing Google it's probably designed specifically for them.

I was thinking the same when I read this again. Tweaked SSDs should be far superior to anything that's spinning. In fact, I would think this is an economical solution for Google despite the cost of SSDs.

12:17 am on Aug 16, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 4, 2001
posts: 1264
votes: 12


Solid State Drives

I agree that SSD's are going to change computing, especially web servers, in a huge way.

However I would be willing to bet that Google's "large quantities of inexpensive machines" philosophy means they have not yet made the switch to SSD.

No doubt they will start plugging them in when prices come down some more, but for the moment I suspect their HDs are still spinning.

1:01 am on Aug 16, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


With 1 million servers or more, they'll probably have a mix for quite a while. But here's an article from May 2008:

Intel gains SSD orders from Google [digitimes.com]

[edited by: tedster at 7:02 pm (utc) on Aug. 16, 2009]

6:24 pm on Aug 16, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 9, 2004
posts:1435
votes: 0


The speculation in that article was never officially commented on as far as I can remember, but it's almost certainly true - so it's very apt to bring it up.

It would be easy to assume that Google are using SATA Intel SSDs like the X25 M or E - but (as the article speculates) it's almost certainly a bespoke Marvell PCIe conroller with Intel sourced flash chips (given Marvell's June 2008 press release on their PCIe SSD controller). It'll be interesting to see if they stick with them or consider that other PCIe vendors.

It's interesting that the well known aversion to expensive hardware has been sidestepped by Google here - the benefits of Enterprise SSDs are too big to ignore.

7:26 pm on Aug 16, 2009 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4843
votes: 2


With the scales and obviously immensely thought out system, I wonder where the bottleneck is in their crawling/indexing/ranking/serving.
6:47 pm on Aug 19, 2009 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 8, 2004
posts: 563
votes: 1


I'm noticing some striking changes in the actual search results.
For Ex.
Google - Now
Search Phrase-----------------SERP Pos.------URL----------- # of Results
3 word phrase 1 (singular)-----10----------Home Page----------2,770,000
3 word phrase 1 (plural)------ 179---------Category Page------3,340,000
4 word phrase 1 (singular)-----5-----------Home Page----------1,960,000
4 word phrase 1 (plural)-------7-----------Home Page----------2,260,000

Google - Caffeine
Search Phrase-----------------SERP Pos.------URL------------# of Results
3 word phrase 1 (singular)-----84----------Product Page------2,690,000
3 word phrase 1 (plural)-------3-----------Home Page---------3,760,000
4 word phrase 1 (singular)-----1-----------Home Page---------2,420,000
4 word phrase 1 (plural)-------3-----------Home Page---------3,140,000

Searches were done w/ in 5 minutes total.
The same 3 word and 4 word phrases were used on both the current Google search and the Google Caffeine search.

Note the varying # of Results, SERP Positioning and Landing Page

The disparity between the # of Results and SERP Positioning becomes even greater when using some of our "seasonal" search phrases.

Anyone else seeing this? Even though they claim that it doesn't effect SERP's

7:06 pm on Aug 19, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


The way I hear Google on this is that they don't intend Caffeine to affect results - but they certainly know that it might. That's why Matt Cutts asked for feedback about the discrepancies webmasters might see in the search results.

Thanks for the data on this. The changing total number of results is particularly interesting, i think, because it is more reflective of an infrastructure change. We do need to remember, however, that the number reported is only a rough estimate. Still, there may be some clues in there.

8:54 pm on Aug 19, 2009 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 8, 2004
posts: 563
votes: 1


Right, I know about the notification but I just have a hard time clicking on the "Dissatisfied? Help us improve" link. Cause I'm not necesarilly dissatisfied.
I did some searches and targeted to see where a specific companies URL came up and these are my results. WOW Look at the variance in the numbers on some of the search phrases. The variance on the number of results seems to be averaging higher but the bench domain I was looking at seems to have better SERP Positioning.
Phrase 1 on Google Now and Phrase 1 on Google Caffeine are the same, etc.

Google - Now
Search Phrase----------SERP Pos.----# of Results
Phrase 1 --------------->100---------35,300,000
Phrase 2 --------------->200----------1,310,000
Phrase 3 -----------------86------------990,000
Phrase 4 ------------------5------------598,000
Phrase 5 -----------------84----------1,130,000
Phrase 6 ----------------161----------1,060,000
Phrase 7 ------------------5------------566,000
Phrase 8 ----------------159---------34,800,000

Google - Caffeine
Phrase 1 -----------------19----------8,080,000
Phrase 2 -----------------11----------9,970,000
Phrase 3 -----------------11----------1,350,000
Phrase 4 ------------------3----------1,130,000
Phrase 5 -----------------22----------1,340,000
Phrase 6 -----------------21---------20,100,000
Phrase 7 ------------------2----------1,030,000
Phrase 8 -----------------15----------8,060,000

10 minutes differential on these searches.

9:20 pm on Aug 21, 2009 (gmt 0)

New User

10+ Year Member

joined:June 27, 2006
posts: 3
votes: 0


Having watched the vid and previous mentions from Matt about "turning the dials" on the algo... I think that G has reformulated how their control panel functions.

Think of the old algo like a scale with one fulcrum... now with the new architecture they have many more fulcrums for fine tuning of results based on their analysis of user motive and all of the new integrated search verticals - video, news (results based on time relevancy), profiles etc...

I also think that G is very concerned about market share and competition. They have been sitting on the caffiene update for a bit - waiting for a bump like bing... As soon as the market share started changing - out came the announcement.

I wonder if they have really been working on caffiene vs. the regular algo - thus all of the crap that is G's SERPs right now... less relevant results than ever IMO.