Googles Search Structure & Operation question

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googles Search Structure & Operation question

blaketar

5:17 pm on May 25, 2006 (gmt 0)

This question is more of a fascination than anything else but I am curious to know if someone can explain in simplistic terms how a search giant like google handles billions of queries a day?

I have a search operational on my site and when someone types in a search parameter I massage the keywords then go to my database and look for matching data information then display a page link with the title of the page my query thinks is best. How does google’s programming logic look through billions of data columns and then returns results in seconds? Does each search query look through the entire index or is it split?

I understand for some this may be an elementary question but I am simply trying to wrap my head around the concept from a programming logic side of things. "Here is some words Mr. DB now look through your 18 billion rows and you only have 1 second to do it!"

Thanks everyone!

tedster

9:26 pm on May 25, 2006 (gmt 0)

Excellent question You're certainly correct that Google can't take the approach we are familiar with from our own site search functions.

Google published a 15 page paper pdf called "The Google File System" (GFS) that talks about their architecture and data handling - complete with nice pretty diagrams of the basic GFS cluster. I don't pretend I fully get it. Also the date of the paper is 2003, and I understand that today's GFS is not the same, but a more sophisticated evolution. In fact, we can be pretty sure that Big Daddy recently changed things even further in this area.

You'll see in the paper that each of their multitude of GFS clusters uses thousands of servers - with many client machines accessing a top level master server that then calls out to one of many "chunkserver" -- and even at this point in the flow, there is no data cached. You'll read about "snapshots" and also the fact that data is not overwritten, but instead changes are appended and become read-only.

The paper says "The largest cluster to date [in 2003] provides hundreds of terabytes of storage across thousands of disks on over a thousand machines"

It may be "too much information" but it is authoritative and better than me blapping on about what I think it says. Here's the html page where you can access the PDF download.

Page at Google Labs:
The Google File System [labs.google.com]

g1smd

9:37 pm on May 25, 2006 (gmt 0)

I have seen estimates that Google consists of anywhere from 100 000 servers to 250 000 servers in total.