It sounds like they are "digesting" various aspects of the document into numeric values that can then be compared.
Let's take the simplest case - where you are only interested in detecting an EXACT duplicate. Of course, this wouldn't work in a practical sense, since EXACT duplicates are unlikely. But let's just do it as an exercise.
This is really easy - simply run a MD5 or similar hashing algorithm on the page. This is quick to do, and is done only once when the page is spidered.
The probability of two different pages generating the exact same MD5 is very low. One can pretty-much assume that two documents that have the same MD5 are identical. In any case, a rote comparison can be done among pages that have identical MD5s to make a final determination.
Simply create a database index of the MD5s, and now you can tell just about instantly whether duplicates exist, and also very quickly access all of the duplicates.
The problem with using a MD5 for this is that if you change just one letter in the document, it will generate a radically different MD5 value. MD5 was designed to do that. But imagine a different hashing algorithm that would generate a nearby number for a document that has one letter different, and a nearby but further number for a document that has one or two words different.
What Google is doing apparently is a bit more complex than this - they mention a vector, so they are digesting a page into a series of values, not just one, then summing those values (each multiplied by a weight).
I would bet that they have a number of algorithms that each boils some aspect of the document down to a value that can be compared linearly, such that the closer the values, the more likely the documents being compared are identical or similar.
The most obvious parameters of a document that might be measured thusly have trivial hashing algorithms - in fact, the hash is equal to the parameters - word count and letter count. If document A is 1000 words long, and document B is 10000 words long, they aren't likely to be identical. If they are 1000 and 1001 words long, more likely.
Now add letter frequencies, most-occurring words and their frequencies, etc.
As another example, there are statistical tests for "grade level" of writing. (Most word-processing packages have this buried somewhere...) A document written at a 10'th grade level isn't likely to be identical to a documented written at a college-graduate level.
Measure spelling errors. Grammar errors. Regionalisms. "Voice". (Oh, webwork will like that one! :) ) etc, etc, etc.
It's a game of divide and conquer. How can you quickly rule-out vast numbers of documents as "not duplicate" using simple, easily-compared and indexed values?
Now, start out by doing a search, but disable the parts of Google's search algorithm that deal with PR, links, authority, trust, etc. In other words, just a search for the terms the user asked for, with no other fiddling. You've already narrowed the universe of possible duplicates, and I'd imagine that's where the rest of the magic takes off from.