Welcome to WebmasterWorld Guest from 126.96.36.199
joined:Jan 27, 2003
I would consider myself a fairly precision searcher - I know exactly what I'm typing. Nonetheless, Google often searches for what they think I really meant, as opposed to what I actually entered – a process sometimes known as query expansion.
Common examples of query expansion:
Google introduced word stemming at least five years ago – [widgets] matches [widgets], [widgeting] and [widgeteering]. The keyword entered is reduced to a root or 'stem' ('widget' in the examples above) and words starting from the same stem can be matched.
An [FAQ] is a set of [frequently asked questions]. Google has impressive mappings of acronyms and initialisms to the full phrase. I think this would be an interesting database if it was ever made available.
Mis-spellings and typos
If you make an obvious typo, Google can include sites that only use the correctly-spelled word (in addition to the "did you mean:" prompt). This is much more obvious with certain queries.
Less common examples of query expansion:
Synonyms may be too narrow a definition, since the search operator for synonyms (~) reveals words that seem to have been derived from co-occurrence data, and have very distinct meanings. Nonetheless, it seems to be possible for Google to expand your query to include related words.
I see reflections of this in the interesting search result translation [translate.google.com] service. In some instances, Google seems to translate search keywords into other languages and return results from that language. I haven't really pinned down the pattern as to which queries (and pages) get this treatment, but I've seen quite a few examples where non-English keywords match English pages.
I occasionally see searches where words appear to have been dropped completely from the query. It's possible that certain keywords might be deemed to lack significance, and can return results with those words omitted from the search. I've only seen a few examples that point directly to this behaviour.
Interestingly, not all content in the index get the query-expansion treatment. I've seen results that suggest a more wide-reaching characteristic of URLs likely to get fuzzy-matching, but that's probably for another day ;)
In most of the common cases, it seems clear that rewriting of the search query can occur, even if the expanded words are not:
Of course, in many cases one of the above conditions is true, which can make finding true examples of query expansion much more difficult. In addition, many of the processes involved seem to be based on aggregated data from content within the index, or based on user search behaviour - which means that there is more or less useful data available depending on the popularity and frequency of occurrence of the search keyword.
Whether query expansion occurs also seems to be related to the entire search query - certain formulations are much more likely to trigger expansion that others. Possibly this has both linguistic (e.g. not expanding a word that is used as part of a common phrase) and statistical (e.g. based on user behaviour) aspects.
Does anyone know any other examples of Google's query expansion capabilities, or have any other observations?
Note to other power searchers - prefix each search keyword with a plus symbol to bypass most query expansion processes.
That happens to me partuclarly in longer technical searches. I often need to remember the + sign to get the word "bug" or "problem" included in the result set. Without the +, that key term can sometimes be ignored even when I make it the first word.
joined:Jan 27, 2003
The highlighting function Google uses (understandably) doesn't support expanded queries the obvious question being - what would it highlight?.
Google seems to dilute results for search phrases between quotation marks with matches that clearly should not be there. Not only when there is NO instance for such a phrase in their index ( e.g. they didn't find not one "widgety widgeteering widget" ) but also when they want to ignore the phrase or just a word from it.
gets on my nerves every time, especially because these are - same as tedster's example -, usually tech (support) related queries.