callivert - 2:01 am on May 15, 2007 (gmt 0)
some questions that have been raised.
One concern that I have is that some very niche topics will be so unique that there won't be sufficient data to find the validating co-occurring phrases.
LSA is supposed to solve this problem. In fact, it was invented specifically to solve the "sparse data" problem. That's what it does. It doesn't need a lot of data for any particular word or phrase to be able to put it into the semantic space.
As long as the entire dataset is really big, then rare words and phrases can be located easily.
validity of the data
As for using web-pages versus books and the validity of the data, this is a relatively minor problem. Google can tell the difference between a kick-ass LSA space or a bad LSA space, and building one is not very difficult. It's been done many times by many different groups.
As for the technology being patented, well, Google have lots of money.
words versus phrases.
With semantic spaces, words count as phrases. It is just as easy to find a document that's similar to a five word (or ten word) phrase as a single word. And that's true of any arbitrary phrase, even if the phrase itself never occurs in any of the documents.
yes, it's computationally costly. However, they should be able to shift most of the cost to the back end, ie the indexing of pages, rather than the retrieval of pages. If anything, it should be faster to retrieve documents if you're just using a vector of 200 numbers to represent every document.