tedster - 10:26 pm on Jul 16, 2010 (gmt 0)
There's no direct connection between LSI and phrase-based indexing which was patented in 2006, although some of the same textual relationships certainly might be surfaced - since those relationships are inherent in the web documents themselves, and not dependent on the technology used for analysis. However, LSI is just too computationally intensive, even for Google's massive power.
But notice the word "phrase-based" and how it indicates a major step away from simple text matching. It identifies meaningful word groups across all the web content. These meaningful phrases or word groups are often called n-grams, as in 2-gram, 3-gram etc. 5-gram seems to be the current cutoff. If you have an appetite for data crunching, you might be interested in the raw n-gram data that Google made available to the public in 2006 as Google's 1 terrabyte n-gram corpus [googlesystem.blogspot.com].
One immediate take-away might be that this - the old concept of "stop words" is oversimplified and not as applicable anymore. If a phrase is identified as meaningful - such as the 3-gram "stars and stripes" - then the word "and" may no longer be thrown away as a stop word for related queries that include that phrase.
There's a whole lot to digest here, but it is technology that Google has been chewing on for over four years. If you are familiar with the concept of co-occurring semantics, note that there is a similar idea in these phrase-based indexing patents - what phrases tend to occur together in the same document, and to what degree of statistical significance. So if you hope to rank well for a given phrase, then the presence of a few related phrases on the page might help.
This would be phrases that are not merely stemmed versions of the original, but made up of completely different words. A page about "making a doctor's appointment" might well include "the nurse secretary" or "writing a prescription" - and if trait is shared across a significant number of pages, it can become a kind of relevance predictor. And if too many of the related phrases are all on the same page, then that fact may be a scraped content predictor.