Forum Moderators: Robert Charlton & goodroi
I have a search operational on my site and when someone types in a search parameter I massage the keywords then go to my database and look for matching data information then display a page link with the title of the page my query thinks is best. How does google’s programming logic look through billions of data columns and then returns results in seconds? Does each search query look through the entire index or is it split?
I understand for some this may be an elementary question but I am simply trying to wrap my head around the concept from a programming logic side of things. "Here is some words Mr. DB now look through your 18 billion rows and you only have 1 second to do it!"
Thanks everyone!
Google published a 15 page paper pdf called "The Google File System" (GFS) that talks about their architecture and data handling - complete with nice pretty diagrams of the basic GFS cluster. I don't pretend I fully get it. Also the date of the paper is 2003, and I understand that today's GFS is not the same, but a more sophisticated evolution. In fact, we can be pretty sure that Big Daddy recently changed things even further in this area.
You'll see in the paper that each of their multitude of GFS clusters uses thousands of servers - with many client machines accessing a top level master server that then calls out to one of many "chunkserver" -- and even at this point in the flow, there is no data cached. You'll read about "snapshots" and also the fact that data is not overwritten, but instead changes are appended and become read-only.
The paper says "The largest cluster to date [in 2003] provides hundreds of terabytes of storage across thousands of disks on over a thousand machines"
It may be "too much information" but it is authoritative and better than me blapping on about what I think it says. Here's the html page where you can access the PDF download.
Page at Google Labs:
The Google File System [labs.google.com]