Page is a not externally linkable
inbound - 1:06 am on Nov 2, 2007 (gmt 0)
It's mind boggling the storage requirements for a full web search (let's ignore the fact that even G can't index the whole web). Putting it all in RAM could only realistically be done by distributing the task. When you look at pulling results you will see, as Majestic points out, that waiting for a 7ms disk access is not favourable for tasks that require many accesses (especially when requesting 100 results). It's beyond the scope of this to explain all of the data steps, but you can rest assured that even on well-indexed data there are still many separate data accesses for searches (especially multi-word searches). This suggests that each machine will have popular index data stored locally but will also house data that is different to other machines (with some data that's not quite popular enough to be cached on every machine residing on more machines than infrequently used information). It would also make sense that the machine that is chosen to perform a search should have at least some of the less common data contained from the search query held locally. e.g. There are 4 machines in a distributed setting with the following index data held locally (with the size of index data being more for terms that appear earlier in the list: [Machine 1 - widgets, red, furry] A search is done for 'shiny blue widgets' The correct machine to send this to would be Machine 3 as all machines hold 'widgets' in their cache but Machine 3 has the second largest term 'blue' in memory, hence it only needs to request a small amount of information across the network (possibly in small chunks) from Machine 4 (which holds 'shiny' index data). This is very simplistic but it's possible that only 1/1000th of the documents contain 'shiny' so although it's not held locally it is the data that you would use to step through the others for ID matches (assuming that IDs are sorted in a consistent manner). It's very likely that the power behind Google is not so much smart matching algorithms, but the very smart data handling that allows everything to scale so well. I'm sure if we know how data is manipulated at Google we'd have a better idea of how some run-time filters work.
It's interesting to note the size of Google data centres - they are getting bigger and bigger. This suggests that having many thousands of machines on the same local network is vital to the operation of new features and keeping speed up. It would not surprise me if these commodity machines were connected to VERY fast networks (optical links with a highly specialised topography) so they can share data at lightning speed. If you have thousands of machines with say 4GB RAM Drives each then you have terabytes of index data on hand with latency restricted by network speed rather than mechanical disks.
[Machine 2 - widgets, green, wet]
[Machine 3 - widgets, blue, cold]
[Machine 4 - widgets, red, shiny]