callivert - 10:03 am on Dec 15, 2007 (gmt 0)
tags are a separate issue.
here are the basics of LSI. If you have a really, really big collection of documents (and Google does), you can create a pretty good vector representation of all words in existence. Vectors can be added together, so you can do whole documents too.
This means you have an estimate of...
* how similar two words are to each other;
* how similar any document is to any word;
* how similar two documents are to each other.
Similarity scores are between 0 and 1 (the cosine of the vectors). 1 means "identical". 0 means "not similar at all".
The end result is, you have a better system for matching queries to documents. The search engine can just retrieve the vectors and run a simple calculation. You don't have to do "keyword matches". You don't have to use stemming. The LSI takes care of all that. It takes into account every word in the document, and whether they are of a similar theme to the query.
Simple. Cheap. And (moderately) effective.