Welcome to WebmasterWorld Guest from 220.127.116.11
Google happens in the space between browser and search engine and destination content server, as an enabler or middleman between the user and his or her online experience. source [oreilly.com]
I type a query from any part of the world and within a blink of an eye the results appear. What is the whole mystery behind this? I am not planning to build one more Google but am bit curious to get an insight.
* Clean interface- no graphic ads all over the place, no bloated code (take a look at the source page some time).
* Google bought an alien technology company several years ago, giving them access to trans-warp hyper-time multidimensional data processing, allowing them to offload the search processing into non-real time then feed the results back into real-time so that users do not notice any time passage in their own perceived space/time continuum. (Note: Matt Cutts and all other Googlers have to deny the existance of this technology because of the NDA they all signed, so they will never admit to using it.)
Google also keeps its entire index in RAM, so disk access times aren't a factor in responding to a query.
This is a myth: logically they don't have to and I have come across with a number of queries that evidently were taken from disk taking 1 second or more - these are not popular queries but not very heavy either - it is a good indication that they come from disk rather than RAM.
More than 99.9% of the time someone else had already searched that term, so they only serve html data.
This is a very high estimate and I consider it unlikely - the problem is that search queries are very fragmented and new ones appear all the time, so caching is not that straightforward: I think Google was saying sometime ago that in a few months about 20-25% of queries are new to them, in other words never been used before.
The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines
Over a thousand machines? And that's just 1 datacentre.
Money loads of money
Big machines with lots of blinking lights
The LED's alone probably take the same power as South America, or some other daft statistic.
They are periodically caching their results based on previous searches
It would be stupid not to.
mad wizards from the Disc World
Lets just hope the Google File System wasn't designed by Bloody Stupid Johnson (0.34 seconds, and I can't believe that's a common search), with particular reference to the Post Office Mail Sorter.
The answer is, they don't. The content has been indexed, and indexing algorithms are able to locate a keyword in an index of millions of entries with a very small number of disk accesses (a handful).
The content has been indexed, and indexing algorithms are able to locate a keyword in an index of millions of entries with a very small number of disk accesses (a handful).
Let's say actual searching - locating numeric IDs of top 10 documents that should be shown takes 0 seconds, ie - can't be faster. Ok, then you have got 10 documents for which you need to pull title and make relevant text snippets: this will cost you at least 1 disk seek per document unless they come from same domain, so we are talking about 10 disk seeks. Disks are very slow for seeking - especially in concurrent environments, these 10 seeks will cost you easily 0.20 msec - this is already higher than many Google's searches take, and actual pulling of documents is not the slowest part - so they do keep some searches cached and definately some frequently hit documents are kept in memory so that they can avoid disk seek per document.
However keeping all database in memory is very expensive even for Google - they definately do not keep everything in RAM, most likely everything in supplemental index is on disk and main index is in memory.
i might be totally off on this but I don't think it's millions. It is a spin of "why", "where" and "WHO" at this point. you here I am there and everybody is else where.
Pinned pointed? Houston, we have an index!
added, which is exactly what Majestic extends.
Relative to how many searches are not "repeats", back in January of this year, Udi Manber, who is Google's VP of Engineering, said that "20 to 25% of the queries we see today, we have never seen before" (see this thread [webmasterworld.com]). So caching of frequent search results can only take them just so far.
Ok, then you have got 10 documents for which you need to pull title and make relevant text snippets: this will cost you at least 1 disk seek per document unless they come from same domain, so we are talking about 10 disk seeks. Disks are very slow for seeking - especially in concurrent environments, these 10 seeks will cost you easily 0.20 msec
I think it's ironic that you used the answer to the mystery in your post, and probably didn't realize it. :)
Who says the seeks are done serially?
They've got to look-up 10 documents. But they don't have to look them up one-at-a-time.
They can be looked-up concurrently, on 10 different disks or disk arrays.
I doubt that anybody outside of Google knows what their search time figures really mean, in any case. Those numbers are pure marketing, and possibly fantasy.
especially in concurrent environments
Actually, the more concurrent the environment, the less seek time matters. The more concurrent the better. Disk servers typically do "elevator seeks". Queued seeks are rearranged in the queue in an optimal order.
Let's say there are queued seeks for tracks 1, 1000, 4, 2000, 7, 1600.
Using elevator seeks, these will be rearranged to 1, 4, 7, 1000, 1600, 2000.
When the head gets to the maximum track, it will work it's way back the other way. So, the head basically continually sweeps back and forth across the disk, rather than going randomly hither and yon in the exact order in which the requests came in.
[edited by: jtara at 1:07 am (utc) on Nov. 2, 2007]
It's mind boggling the storage requirements for a full web search (let's ignore the fact that even G can't index the whole web). Putting it all in RAM could only realistically be done by distributing the task.
When you look at pulling results you will see, as Majestic points out, that waiting for a 7ms disk access is not favourable for tasks that require many accesses (especially when requesting 100 results). It's beyond the scope of this to explain all of the data steps, but you can rest assured that even on well-indexed data there are still many separate data accesses for searches (especially multi-word searches).
This suggests that each machine will have popular index data stored locally but will also house data that is different to other machines (with some data that's not quite popular enough to be cached on every machine residing on more machines than infrequently used information). It would also make sense that the machine that is chosen to perform a search should have at least some of the less common data contained from the search query held locally.
There are 4 machines in a distributed setting with the following index data held locally (with the size of index data being more for terms that appear earlier in the list:
[Machine 1 - widgets, red, furry]
[Machine 2 - widgets, green, wet]
[Machine 3 - widgets, blue, cold]
[Machine 4 - widgets, red, shiny]
A search is done for 'shiny blue widgets'
The correct machine to send this to would be Machine 3 as all machines hold 'widgets' in their cache but Machine 3 has the second largest term 'blue' in memory, hence it only needs to request a small amount of information across the network (possibly in small chunks) from Machine 4 (which holds 'shiny' index data). This is very simplistic but it's possible that only 1/1000th of the documents contain 'shiny' so although it's not held locally it is the data that you would use to step through the others for ID matches (assuming that IDs are sorted in a consistent manner).
It's very likely that the power behind Google is not so much smart matching algorithms, but the very smart data handling that allows everything to scale so well. I'm sure if we know how data is manipulated at Google we'd have a better idea of how some run-time filters work.
They've got to look-up 10 documents. But they don't have to look them up one-at-a-time.
Of course they look it up concurrently but you simply can't make many seeks on one disk - it is a serious bottleneck though.
Having concurrent accesses is never good - you can order them but this is still very difficult and you have to be very lucky to have one seek close enough to another seek to avoid actually making this seek, if you deal with terabytes of data with random 10KB read from it then you can be sure then you won't be able to avoid disk seeks.
Beowulf Cluster [en.wikipedia.org]
If you read the (rather brief, and not terribly accurate) Wikipedia entry, you will get a sense that such clusters are almost tailor made for rapid indexing/searching. If properly coded, they are also highly scalable, and offer redundancy of process, so that if one machine fails in a cluster (or one hard drive), it is irrelevant, the other machines in the cluster pick up the slack.
While the first official "HowTo" was published in 1998, the first cluster was made at NASA years before this (can't remember the exact date), and the techniques and methodology had been floating around the open source community almost from day one.
Given what Google Search is doing, and how fast they are able to do it, I have little doubt that this is the backend architecture they are using. It would allow for much of the "index" to be held in RAM. Keep in mind, the "index" is just that, an index. Cue cards that point to larger stores of Data held on hard drives. When you do a search, the index looks at the words you have entered, and points to specific hard stored data to be retrieved.
I've never seen anyone at G directly reference Beowulf Clusters, but they have let slip a few bits to indicate this is a key backend technology. For one, they have repeatedly referred to their use of cheap, commodity PCs for their backbone. They have also referred in the past to their operating system as a customized version of Linux.
Making such large clusters work, and making the indexing process lean and efficient is still one heckuva feat.
I just wish they'd obey the spirit of the Open Source community and release some of their code optimizations back into the community.
That never used to be the case; and that tells a lot about the changes in their search algorithms.