Page is a not externally linkable
Oliver_Henniges - 7:01 pm on Feb 18, 2007 (gmt 0)
I believe, that the "good phrase list" is NOT computed in the running applications of this patent as described: It had been compiled somewhere else before. Maybe this bears interesting consequences for SEO: Under what circumstances will new phrases make it to the list? Can you regain some limited control over the algo by artificially helping certain phrases over this threshold? The list is not compiled the way the patent describes (though the figures given there might be helpful), but it neither can be static, because otherwise new topics would never be recognized, and I'd really be surpsised to hear that spammers don't target keywords mentioned in the news at present. An alternative key-hole to regain influence might be the assymetry of the co-occurance matrix, which the algo produces after zeroing out those pairs of related phrases, which stay below the information-gain-threshold. This assymetry directly flows into the data of the phrase-clusters compiled later: I could imagine that areas exist, where only ten or twenty pages containing "Bill Clinton" AND "purse designer" WITHOUT "Monica Lewinsky" might force google to reevaluate this cluster-matrix, though of course not for this particular example. The patent does not say, that violations against this assymetry would trigger a filter, co-occurances below the threshold are simply deleted. I'd speculate that the adwords keyword suggestion tool provides interesting data for an analysis of this assymetry.
Today I tried to understand "Phrase identification in an information retrieval system," which seems somewhat basic to the spam-patent. [0104] For example, assume the good phrase "Bill Clinton" is related to the phrases "President", "Monica Lewinsky", because the information gain of each of these phrases with respect to "Bill Clinton" exceeds the Related Phrase threshold. Further assume that the phrase "Monica Lewinsky" is related to the phrase "purse designer". These phrases then form the set R. To determine the clusters, the indexing system 110 evaluates the information gain of each of these phrases to the others by determining their corresponding information gains. Thus, the indexing system 110 determines the information gain I("President", "Monica Lewinsky"), I("President", "purse designer"), and so forth, for all pairs in R. In this example, "Bill Clinton," "President", and "Monica Lewinsky" form a one cluster, "Bill Clinton," and "President" form a second cluster, and "Monica Lewinsky" and "purse designer" form a third cluster, and "Monica Lewinsky", "Bill Clinton," and "purse designer" form a fourth cluster. This is because while "Bill Clinton" does not predict "purse designer" with sufficient information gain, "Monica Lewinsky" does predict both of these phrases.