tedster - 7:53 pm on Jul 20, 2011 (gmt 0)
The purpose of the Panda process is to have an accurate predictive model of "shallow" content as opposed to "high quality" content. To do that, Google first asked their quality raters to create two seed sets of web pages - those that are clearly top quality and those that are obviously not.
Now it comes time to find a predictive algorithm, and the process begins by ranging over all the data signals Google stores for those two seed sets (whether the data signals are currently used for ranking or not.)
The machine learning program needs to generate a combination of factors, mixed and matched in various ways, that generates those two seed sets automatically - essentially duplicating the ratings that the humans had already assigned. Once the combination of factors is found that predicts the good/bad quality with a high degree of accuracy, then that combination of factors is run over the signals data for other pages that were not previously scored by human raters. The accuracy of the data model can then be tested further by looking at these results.
After more and more tweaking of the predictive algorithm, the ideal is to arrive at a highly accurate model that can be used live.
The "decisions" are whether any given signal actually belongs in the predictive model, along with how it gets combined with other signals. Maybe signal A predicts quality content but only when signals B and C are also present. Or maybe the prediction is more accurate when signal C is absent but D is present. Maybe one set of signals can be grouped, and if they reach a certain threshold, then loop back to an earlier position in the tree and wipe out or modify some earlier score.
The number of signals or factors involved is massive and the possible ways of combining them is also immense. The machine learning program would automate all kinds of possible "trees" seeing which one does the best job of predicting. The final decision tree is the outline of how those factors all inter-relate in the final predictive model.
A friend of mine is involved in a contest building a similar model for predicting movie box office hits based on known pre-release factors. That's a lot smaller data set, and a lot less dynamic than Google's, too. But still, he tells me that it's immensely challenging.