|The Google Search Query - a technical look|
We rarely link out to unofficial information or opinion about Google. But this post on Google Blogoscoped fills in a much needed area of information, so I'm making an exception. It is well researched and compiles some great information into one compact article.
It's very easy for us, as ordinary or even not-so-ordinary search users, to have a limited view of what happens when we make a query on Google. The results are fast, often faster than searches over our own website data, so we tend not to appreciate all that's going on. This article gives a nice breakdown of the process, ending up with a rather startling statment:
|Today it's estimated that [a single Google query] travels across 700-1000 machines, a figure that has nearly doubled since 2006 perhaps due in part to the introduction of Google Universal. |
The article ends with three reference sources - an mp3 file, a video and a technical PDF document. When you take in even a bit of the complexity - and we're not even talking about the actual ranking algorithm yet - it's a wonder that we don't see more technical problems than we occasionally do.
[edited by: tedster at 1:40 pm (utc) on July 9, 2008]
The 7-page PDF reference is a document not to be missed. Here's a small taste:
|The final result of this first phase of query execution is an ordered list of document identifiers(docids) |
...the second phase involves taking this list of docids and computing the actual title and uniform
resource locator of these documents, along with a query-specific document summary. Document servers
(docservers) handle this job, fetching each document from disk to extract the title and the keyword-in-context
...As with the index lookup phase, the strategy is to partition the processing of all documents by
• randomly distributing documents into smaller shards
• having multiple server replicas responsible for handling each shard, and
• routing requests through a load balancer.
Google has started using the 74.125... IP blocks for the first time ever this year, and massively expanded the number of 209.85... IPs in use. I am given to wonder if this new hardware is different to that which has gone before, running different software and working in a different way. Is this more Big Daddy, or is an Uncle now running the show?
A lot of that hardware has only come online in the last few months, and most of it has NOT been seen in rotation via google.com until the last few weeks. Large chunks were online without any GFE name access (not even assigned a name) for quite some time before that, presumably running in test mode before being properly commissioned.
At the same time, older hardware in the 216.239... and other such blocks has been taken offline and presumably retired.
[edited by: g1smd at 1:01 am (utc) on July 12, 2008]
My own gesswork and suspicions run parallel, g1smd. Something big is shifting, and it may well include a hardware expansion. Not Big Daddy this time, but Dutch Uncle or something. Guess we''l learn about it over time.
A few weeks ago, I found some new Google IPs that I haven't listed any where online yet.
It's a complete new system, and a different pattern to anything that I have seen before. They are methodical guys and girls at the 'plex and so it means something.... but what, I have no idea.
When I get more data I'll post something.