|Google synonyms and the tilde operator|
Are the synonyms that google shows me when I do ~command the exact list of synonyms
No, it's not the exact list at all. It's just the top few synonyms that have the strongest statistical correlation. In fact, Google has a huge pile of data about semantic relationships and sometimes the SERPs themselves can be a better tip-off than the tilde operator!
The key discussion and patent to understand is Phrase-based indexing [webmasterworld.com]. Next to that (and possible BEFORE wrapping your barin around that] is hrase Based Multiple Indexing and Keyword Co-Occurrence [webmasterworld.com]
Note, this all started back in 2006, and by now it has become quite mature and advanced.
>> huge pile of data about semantic relationships
Have you played much with Google Correlate? Some interesting things shake out from there if you experiment - some surprising connections showing the promise and limitations of using statistical correlation.
Maybe I should give an example of one experiment I ran. If you grab a table of US obesity rates by state and plug them into correlate, you'll notice that the terms that match that pattern the most closely, according to Google Correlate, mostly relate to rap music.
A friend who is a professional statistician and I were discussing this and his guess at what's happening is
1. You're only starting with 51 data points, so there's a lot of noise in the signal.
2. To save cycles, Google does a first pass approximation and then a second pass (I think - check this). If it finds something promising on the first pass, it explores that avenue some more.
3. So if you get some semi-random connection because of limited data or the random occurrence of two curves that match for no good reason, Google will look at curves for related terms to see which of those match, which means it has a self-reinforcing aspect.
So the end result is that when Google tries to build statistical correlations between a data set and a search, it can get pretty whacky. It can also be used to accurately predict flu outbreaks in the US in advance of official CDC warnings.
In short, it can be useful, but one must exercise caution. As the old saw goes, "correlation is not causation".
|limitations of using statistical correlation |
In the early days of the tilde operator I saw the [~bread] search give results about the Rolls Royce luxury car. Yes, I suppose it does take a lot of "bread" to own and operate one, but it was still a pretty humorous result.
I expect that's a "real" relationship Ted, unlike the "rap music" relationship I found on Correlate which was related only in terms of frequency. Granted correlate is not looking at co-occurrence on pages, which is a much simpler problem. It's mapping far more complex problems with far less data.