Welcome to WebmasterWorld Guest from 54.158.143.40

Message Too Old, No Replies

Google synonyms and the tilde operator

   
10:49 pm on Dec 7, 2012 (gmt 0)

5+ Year Member



Are the synonyms that google shows me when I do ~command the exact list of synonyms
11:14 pm on Dec 8, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



No, it's not the exact list at all. It's just the top few synonyms that have the strongest statistical correlation. In fact, Google has a huge pile of data about semantic relationships and sometimes the SERPs themselves can be a better tip-off than the tilde operator!

The key discussion and patent to understand is Phrase-based indexing [webmasterworld.com]. Next to that (and possible BEFORE wrapping your barin around that] is hrase Based Multiple Indexing and Keyword Co-Occurrence [webmasterworld.com]

Note, this all started back in 2006, and by now it has become quite mature and advanced.
2:33 am on Dec 9, 2012 (gmt 0)

WebmasterWorld Administrator ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



>> huge pile of data about semantic relationships

Have you played much with Google Correlate? Some interesting things shake out from there if you experiment - some surprising connections showing the promise and limitations of using statistical correlation.
2:39 am on Dec 9, 2012 (gmt 0)

WebmasterWorld Administrator ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Maybe I should give an example of one experiment I ran. If you grab a table of US obesity rates by state and plug them into correlate, you'll notice that the terms that match that pattern the most closely, according to Google Correlate, mostly relate to rap music.

A friend who is a professional statistician and I were discussing this and his guess at what's happening is

1. You're only starting with 51 data points, so there's a lot of noise in the signal.

2. To save cycles, Google does a first pass approximation and then a second pass (I think - check this). If it finds something promising on the first pass, it explores that avenue some more.

3. So if you get some semi-random connection because of limited data or the random occurrence of two curves that match for no good reason, Google will look at curves for related terms to see which of those match, which means it has a self-reinforcing aspect.

So the end result is that when Google tries to build statistical correlations between a data set and a search, it can get pretty whacky. It can also be used to accurately predict flu outbreaks in the US in advance of official CDC warnings.

In short, it can be useful, but one must exercise caution. As the old saw goes, "correlation is not causation".
3:51 am on Dec 9, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



limitations of using statistical correlation

In the early days of the tilde operator I saw the [~bread] search give results about the Rolls Royce luxury car. Yes, I suppose it does take a lot of "bread" to own and operate one, but it was still a pretty humorous result.
10:11 pm on Dec 9, 2012 (gmt 0)

WebmasterWorld Administrator ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I expect that's a "real" relationship Ted, unlike the "rap music" relationship I found on Correlate which was related only in terms of frequency. Granted correlate is not looking at co-occurrence on pages, which is a much simpler problem. It's mapping far more complex problems with far less data.