One major factor in Google's speed is their extremely savvy data management - including their very own Google File System [labs.google.com], specifically tailored to the job that they set out to do.
Google also keeps its entire index in RAM, so disk access times aren't a factor in responding to a query.
* Geographically distributed data centers with lots of processing power and data throughput capacity.
* Clean interface- no graphic ads all over the place, no bloated code (take a look at the source page some time).
* Google bought an alien technology company several years ago, giving them access to trans-warp hyper-time multidimensional data processing, allowing them to offload the search processing into non-real time then feed the results back into real-time so that users do not notice any time passage in their own perceived space/time continuum. (Note: Matt Cutts and all other Googlers have to deny the existance of this technology because of the NDA they all signed, so they will never admit to using it.)
Wow LifeinAsia, your last comment really opened my eyes! I always thought Google was operated by mad wizards from the Disc World and had something to do with creatures from the Dungeon Dimensions and the phenomenal speed of Death's horse Binky, but now I know better. Thanks for clearing that up!
Money loads of money
Big machines with lots of blinking lights.
|I always thought Google was operated by mad wizards ... |
That was Google 1.0. :)
|Google also keeps its entire index in RAM, so disk access times aren't a factor in responding to a query. |
This is a myth: logically they don't have to and I have come across with a number of queries that evidently were taken from disk taking 1 second or more - these are not popular queries but not very heavy either - it is a good indication that they come from disk rather than RAM.
Judging from some recent patent applications, Google may even keep some information in firmware - or at least be planning to. This would particularly apply to semantic analysis and other data that does not change as rapidly.
|Google may even keep some information in firmware |
Firmware is basically "BIOS" and it has got little to do with the speed of search apart from maybe tuning up some things.
Masters of mind who will be emperors of earth tomorrow
When you talk about RAM - how much ram they own?
imagine how many users search @ a random second and how many machines serve this requests...!
It is interesting to note the processing time for various queries amd types of query.
It is either less than a certain figure, or greater than a much higher figure.
There's no in-between.
It's not all in the data management.
It's also where you sit on the net and how many hops your point of presence is from your customer, and how fast the route is, which was the claim to fame for altavista back in the day.
They are periodically caching their results based on previous searches. More than 99.9% of the time someone else had already
searched that term, so they only serve html data.
|More than 99.9% of the time someone else had already searched that term, so they only serve html data. |
This is a very high estimate and I consider it unlikely - the problem is that search queries are very fragmented and new ones appear all the time, so caching is not that straightforward: I think Google was saying sometime ago that in a few months about 20-25% of queries are new to them, in other words never been used before.
|The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines |
Over a thousand machines? And that's just 1 datacentre.
|Big machines with lots of blinking lights |
The LED's alone probably take the same power as South America, or some other daft statistic.
|They are periodically caching their results based on previous searches |
It would be stupid not to.
|mad wizards from the Disc World |
Lets just hope the Google File System wasn't designed by Bloody Stupid Johnson (0.34 seconds, and I can't believe that's a common search), with particular reference to the Post Office Mail Sorter.
Guys...you're all missing the point..
Google is not real-time- they have an index that they get the results from.
In other words, the machines know exactly where to look EVERY time, because everything is in the index.
If you don't understand indexing and fast search algorithms, it can all look rather mysterious. How on earth can they search millions of documents when you do a search?
The answer is, they don't. The content has been indexed, and indexing algorithms are able to locate a keyword in an index of millions of entries with a very small number of disk accesses (a handful).
|The content has been indexed, and indexing algorithms are able to locate a keyword in an index of millions of entries with a very small number of disk accesses (a handful). |
Let's say actual searching - locating numeric IDs of top 10 documents that should be shown takes 0 seconds, ie - can't be faster. Ok, then you have got 10 documents for which you need to pull title and make relevant text snippets: this will cost you at least 1 disk seek per document unless they come from same domain, so we are talking about 10 disk seeks. Disks are very slow for seeking - especially in concurrent environments, these 10 seeks will cost you easily 0.20 msec - this is already higher than many Google's searches take, and actual pulling of documents is not the slowest part - so they do keep some searches cached and definately some frequently hit documents are kept in memory so that they can avoid disk seek per document.
However keeping all database in memory is very expensive even for Google - they definately do not keep everything in RAM, most likely everything in supplemental index is on disk and main index is in memory.
-- How on earth can they search millions of documents --
i might be totally off on this but I don't think it's millions. It is a spin of "why", "where" and "WHO" at this point. you here I am there and everybody is else where.
Pinned pointed? Houston, we have an index!
added, which is exactly what Majestic extends.
Google indexes billions of documents, not millions - it is a massive undertaking, no doubt about it.
Relative to how many searches are not "repeats", back in January of this year, Udi Manber, who is Google's VP of Engineering, said that "20 to 25% of the queries we see today, we have never seen before" (see this thread [webmasterworld.com]). So caching of frequent search results can only take them just so far.
|Ok, then you have got 10 documents for which you need to pull title and make relevant text snippets: this will cost you at least 1 disk seek per document unless they come from same domain, so we are talking about 10 disk seeks. Disks are very slow for seeking - especially in concurrent environments, these 10 seeks will cost you easily 0.20 msec |
I think it's ironic that you used the answer to the mystery in your post, and probably didn't realize it. :)
Who says the seeks are done serially?
They've got to look-up 10 documents. But they don't have to look them up one-at-a-time.
They can be looked-up concurrently, on 10 different disks or disk arrays.
I doubt that anybody outside of Google knows what their search time figures really mean, in any case. Those numbers are pure marketing, and possibly fantasy.
|especially in concurrent environments |
Actually, the more concurrent the environment, the less seek time matters. The more concurrent the better. Disk servers typically do "elevator seeks". Queued seeks are rearranged in the queue in an optimal order.
Let's say there are queued seeks for tracks 1, 1000, 4, 2000, 7, 1600.
Using elevator seeks, these will be rearranged to 1, 4, 7, 1000, 1600, 2000.
When the head gets to the maximum track, it will work it's way back the other way. So, the head basically continually sweeps back and forth across the disk, rather than going randomly hither and yon in the exact order in which the requests came in.
[edited by: jtara at 1:07 am (utc) on Nov. 2, 2007]
It's interesting to note the size of Google data centres - they are getting bigger and bigger. This suggests that having many thousands of machines on the same local network is vital to the operation of new features and keeping speed up. It would not surprise me if these commodity machines were connected to VERY fast networks (optical links with a highly specialised topography) so they can share data at lightning speed. If you have thousands of machines with say 4GB RAM Drives each then you have terabytes of index data on hand with latency restricted by network speed rather than mechanical disks.
It's mind boggling the storage requirements for a full web search (let's ignore the fact that even G can't index the whole web). Putting it all in RAM could only realistically be done by distributing the task.
When you look at pulling results you will see, as Majestic points out, that waiting for a 7ms disk access is not favourable for tasks that require many accesses (especially when requesting 100 results). It's beyond the scope of this to explain all of the data steps, but you can rest assured that even on well-indexed data there are still many separate data accesses for searches (especially multi-word searches).
This suggests that each machine will have popular index data stored locally but will also house data that is different to other machines (with some data that's not quite popular enough to be cached on every machine residing on more machines than infrequently used information). It would also make sense that the machine that is chosen to perform a search should have at least some of the less common data contained from the search query held locally.
There are 4 machines in a distributed setting with the following index data held locally (with the size of index data being more for terms that appear earlier in the list:
[Machine 1 - widgets, red, furry]
[Machine 2 - widgets, green, wet]
[Machine 3 - widgets, blue, cold]
[Machine 4 - widgets, red, shiny]
A search is done for 'shiny blue widgets'
The correct machine to send this to would be Machine 3 as all machines hold 'widgets' in their cache but Machine 3 has the second largest term 'blue' in memory, hence it only needs to request a small amount of information across the network (possibly in small chunks) from Machine 4 (which holds 'shiny' index data). This is very simplistic but it's possible that only 1/1000th of the documents contain 'shiny' so although it's not held locally it is the data that you would use to step through the others for ID matches (assuming that IDs are sorted in a consistent manner).
It's very likely that the power behind Google is not so much smart matching algorithms, but the very smart data handling that allows everything to scale so well. I'm sure if we know how data is manipulated at Google we'd have a better idea of how some run-time filters work.
|They've got to look-up 10 documents. But they don't have to look them up one-at-a-time. |
Of course they look it up concurrently but you simply can't make many seeks on one disk - it is a serious bottleneck though.
Having concurrent accesses is never good - you can order them but this is still very difficult and you have to be very lucky to have one seek close enough to another seek to avoid actually making this seek, if you deal with terabytes of data with random 10KB read from it then you can be sure then you won't be able to avoid disk seeks.
Who gets better football players? Money!
Beowulf Cluster [en.wikipedia.org]
If you read the (rather brief, and not terribly accurate) Wikipedia entry, you will get a sense that such clusters are almost tailor made for rapid indexing/searching. If properly coded, they are also highly scalable, and offer redundancy of process, so that if one machine fails in a cluster (or one hard drive), it is irrelevant, the other machines in the cluster pick up the slack.
While the first official "HowTo" was published in 1998, the first cluster was made at NASA years before this (can't remember the exact date), and the techniques and methodology had been floating around the open source community almost from day one.
Given what Google Search is doing, and how fast they are able to do it, I have little doubt that this is the backend architecture they are using. It would allow for much of the "index" to be held in RAM. Keep in mind, the "index" is just that, an index. Cue cards that point to larger stores of Data held on hard drives. When you do a search, the index looks at the words you have entered, and points to specific hard stored data to be retrieved.
I've never seen anyone at G directly reference Beowulf Clusters, but they have let slip a few bits to indicate this is a key backend technology. For one, they have repeatedly referred to their use of cheap, commodity PCs for their backbone. They have also referred in the past to their operating system as a customized version of Linux.
Making such large clusters work, and making the indexing process lean and efficient is still one heckuva feat.
I just wish they'd obey the spirit of the Open Source community and release some of their code optimizations back into the community.
Google has gone through a transition between true searching and just doing a good job searching. Sometime today, whilst reading an article, pick out one not-particularly-keyword-relevant phrase of about 12 words length; paste it into google. Odds are it won't come up despite the article being indexed.
That never used to be the case; and that tells a lot about the changes in their search algorithms.
Google also uses a Column oriented database [en.wikipedia.org] as opposed to a row oriented database that puny webmasters like us use.
Google has eliminated middle tier operators and plug their data centers directly into the Internet just like top ISPs.
| This 53 message thread spans 2 pages: 53 (  2 ) > > |