---- "Phrase Based Indexing and Retrieval" - part of the Google picture?
tedster - 11:33 pm on Feb 17, 2007 (gmt 0)
One factor that can help minimize false positives, as I understand it at least, is the fact that the "expected number E" will be measured relative to each target phrase, and across quite a wide sample of documents -- so it won't be nearly the same number in the case of diverse search phrases.
 For each of these most significant related phrases, the number of related phrases present in the document is determined, again from their related phrase bit vectors. If the actual number of related phrases significantly exceeds the expected number (using any of the above described tests), then document is deemed a spam document with respect to that most significant phrase...
I would imagine that "thin" pages would scoot right by this test, whereas article pages written with certain target searches in mind might trip the spam test. Do others see it this way?
While it is a new method, it still is layered on and inspired by current algorithmic attempts at providing 'relevant' results at the Big G....
That's how I see it, too... the following quote from the patent seems to say the same thing, although it appear to contain a typo.
 The foregoing approaches to identifying a spam document are preferably implemented as part of the indexing process, and may be conducted in parallel with other indexing operations, are afterwards.
Say what? It only makes sense to me if I read the last phrase as "
At any rate, I don't assume that the 950 phenomenon can be wholly explained by phrase based techniques. The kind of impact on ranking to be expected is highlighted by two examples in the spam patent - and to my understanding, neither of these two steps would send every tagged url to the end of results. Of course, nothing in the patent requires that only these two step are possible.
 If the document is included in the SPAM_TABLE, then the document's relevance score is down weighted by predetermined factor. For example, the relevance score can be divided by factor (e.g., 5). Alternatively, the document can simply be removed from the result set entirely.
However, the frequently mentioned "over optimization penalty" or OOP does seem that it could be accounted for with these approaches.
 ...The document is also added as a spam document for each the related phrases of that good phrase, since a document is considered a spam document with respect to all phrases that are related to each other.
Note that this does not seem to be what 950 sufferers are describing. For at least some of these cases, related phrases still can rank well.