Forum Moderators: Robert Charlton & goodroi
Example: search using the words:
kwd1 kwd2 kwd3
gives different results than
kwd3 kwd1 kwd2
Does google filter using first word, then second, then third?
I can't find any on-line documentation to support why this occurs. Anyone know of any? What causes this?
-thanks.
<Sorry, no specific keywords.
See Forum Charter [webmasterworld.com]>
[edited by: tedster at 2:58 pm (utc) on Jan. 20, 2007]
Matching words group (or n-grams) is quite complicated in practice (when you are working with billions of documents), but simple enough to grasp. Take this simplified example:
Search Terms-
A bakers dozen
A dozen bakers
2 documents to search-
The number thirteen is said to be a bakers dozen.
The worlds largest cake took a dozen bakers to make.
Both documents have the same number of matches if you count the words separately. But they talk about very different things (context). If you look for pairs of words (in the first search - "a bakers" and "bakers dozen") or all 3 in order, you can give higher scoring to these matches. In practice you would also want to alter the importance of a phrase by looking at the "informational value" of it, words that appear very often are often of less "value" than ones which are less common (hence "bakers dozen" has more informational value than "a bakers")
Google has made huge amounts of n-gram data publically available (for the linguistics community). There is no doubt that mind-boggling amounts of processing goes on at Google in this area.
Lots of characteristics can be inferred by the order of words. Decisions on the subject of a document can be made without having to "understand" a document, by comparing the frequency of n-grams from the document to a much larger dataset (the web). It's a great way to decide which phrases are "important" on a page that has AdSense on it.
Google makes excellent use of the data it collects from many areas; AdWords is another fine example. Millions of adverts and phrases are grouped together by hand, by advertisers. The reliability of this data is very high, given people are paying for these adverts, hence Google can look at all of the Adverts that are supposed to show for one phrase and statistically predict which other phrases should be similar. Such data is great when looking at how to do 'broad match' etc. Google is better at collecting and manipulating textual data than Yahoo or MSN; hence they have a massive lead when it comes to textual Ad-Serving.
Regardless of how Google algorithms change (and hence favour you or not), you can be sure that a great deal of importance is put on word order, it's also such a huge task to pre-calculate n-gram statistics on a huge scale that you probably should only worry about 3 word combinations at the moment (remembering that 4 word combinations can be fairly well replicated by two 3 word combinations - e.g. "word1 word2 word3" AND "word2 word3 word4" is quite likely to give documents that have "word 1 word2 word3 word4").
Have fun.