tedster - 7:01 pm on May 1, 2010 (gmt 0) [edited by: tedster at 7:05 pm (utc) on May 1, 2010]
Another short survey question for long tail problems - how many words long, on average, are the phrases that lost their traffic? Are they more than 5 words long?
I'm wondering if Google has made a change in their phrase-based indexing approach - something that the new Caffeine infrastructure makes feasible. Recently there has been more patent activity in that area.
Indexing of phrases is typically avoided because of the perceived computational and memory requirements to identify all possible phrases of say three, four, or five or more words.
For example, on the assumption that any five words could constitute a phrase, and that a large corpus would have at least 200,000 unique terms, there would be approximately 3.2.times.10.sup.26 possible phrases, clearly more than any existing system could store or otherwise programmatically manipulate.
Index server architecture using tiered and sharded phrase posting lists [patft.uspto.gov]
In other words, until recently queries for long phrases may have had something like "best guess" results using some secondary signals -- but now Google has the infrastructure to index longer phrases much more directly.
It's a brainstorm idea at present, and not a solid "statement of fact". But hey, we have to start somewhere.
As a side note, thanks to Google I now remember how to spell caffeine!
[edited by: tedster at 7:05 pm (utc) on May 1, 2010]