I appreciated the writer's ability to explain Google's semantic advances in a way the average person can appreciate. Anyone who has wrestled with site search for, say, a million or more pages has got to be a bit awestruck at the huge job Google has taken on -- and why their successes in this area help them dominate the current market.
Google's synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein's theories about how words are defined by context.
Sometime in 2001, Singhal learned of poor results when people typed the name "audrey fino" into the search box. Google kept returning Italian sites praising Audrey Hepburn... "We realized that this is actually a person's name," Singhal says. "But we didn't have the smarts in the system."
...he had to master the black art of "bi-gram breakage" — that is, separating multiple words into discrete units. For instance, "new york" represents two words that go together (a bi-gram). But so would the three words in "new york times," which clearly indicate a different kind of search. And everything changes when the query is "new york times square."
That talk about bi-grams (n-grams in general) reminds me - is there anyone here who has played with with Google's publicly released 1 terrabyte n-gram data set [googleresearch.blogspot.com], often called the "1T corpus"?
That would take some serious computing power to deal with, but I'd love to have a go at it.
[edited by: tedster at 11:58 pm (utc) on Feb 23, 2010]