What is the "Decision" in Decision Tree Processing?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

What is the "Decision" in Decision Tree Processing?

Lapizuli

6:59 pm on Jul 20, 2011 (gmt 0)

The Google engineer who supposedly worked on the "Panda" update is said to be an expert in "decision tree processing."

I've been trying and trying to figure out what that is, looking here:

[en.wikipedia.org...]

and here

[research.google.com...]

And here:

[seobythesea.com...]

And other places written purely in technospeak.

What I haven't been able to get a grasp on is what decisions are actually the decisions in question.

Are they the decisions the algorithm makes in interpreting the data?

Are they the decisions of the user, which are better understood by the algorithm?

Something else?

Can anyone (Tedster?) explain?

tristanperry

7:38 pm on Jul 20, 2011 (gmt 0)

I haven't read those links, but I've always learnt (computer science degree/graduate) that it's the decisions that the algorithm makes when it comes to understanding/acting on a set of data.

So yep, from the perspective of the algorithm.

tedster

7:53 pm on Jul 20, 2011 (gmt 0)

The purpose of the Panda process is to have an accurate predictive model of "shallow" content as opposed to "high quality" content. To do that, Google first asked their quality raters to create two seed sets of web pages - those that are clearly top quality and those that are obviously not.

Now it comes time to find a predictive algorithm, and the process begins by ranging over all the data signals Google stores for those two seed sets (whether the data signals are currently used for ranking or not.)

The machine learning program needs to generate a combination of factors, mixed and matched in various ways, that generates those two seed sets automatically - essentially duplicating the ratings that the humans had already assigned. Once the combination of factors is found that predicts the good/bad quality with a high degree of accuracy, then that combination of factors is run over the signals data for other pages that were not previously scored by human raters. The accuracy of the data model can then be tested further by looking at these results.

After more and more tweaking of the predictive algorithm, the ideal is to arrive at a highly accurate model that can be used live.

The "decisions" are whether any given signal actually belongs in the predictive model, along with how it gets combined with other signals. Maybe signal A predicts quality content but only when signals B and C are also present. Or maybe the prediction is more accurate when signal C is absent but D is present. Maybe one set of signals can be grouped, and if they reach a certain threshold, then loop back to an earlier position in the tree and wipe out or modify some earlier score.

The number of signals or factors involved is massive and the possible ways of combining them is also immense. The machine learning program would automate all kinds of possible "trees" seeing which one does the best job of predicting. The final decision tree is the outline of how those factors all inter-relate in the final predictive model.

A friend of mine is involved in a contest building a similar model for predicting movie box office hits based on known pre-release factors. That's a lot smaller data set, and a lot less dynamic than Google's, too. But still, he tells me that it's immensely challenging.

Lapizuli

8:56 pm on Jul 20, 2011 (gmt 0)

Hmmm. Okay, so if I were an algorithm and wanted to predict whether a user would find a site low quality, and I just got better at deciding how to use the data I have for doing that, enough to spur my creators to tell webmasters, essentially, "Minimize your quality-signaling SEO - we don't need your backend signals as much as we used to - and start really focusing on your users" (which is the common theme in Google insiders' advice since Panda, and yes, it's the advice given before Panda, too, but not quite so adamantly, it seems), then probably the data I just got better at wasn't any specific on-page factors, but that surrounding user behavior. Right? I mean, what else would it be, but matching algorithm's decisions to user's decisive actions?

I don't do it very well yet, though. My decisions might be more sweeping in variable-quality UGC sites in which I assume all that glitters is pyrite.

And my confidence in my decisions might be overrated - for example, if I associate erroneous signals with quality, such as assuming all valuable sites are those with which one would entrust one's credit card number (after all, people will trust their neighbors' advice on how to deal with menopause, but wouldn't let their neighbors have their credit card numbers, and would trust drugstores to take their credit cards, but not to give honest, disinterested advice on hormone therapy.)

But generally, my focus would be on better interpreting the actions of users. And there is no SEO for that - only end results.

If memory serves, Google has actually warned that they can't share the details about Panda because it involves something that could be gamed. Yet it's not supposed to be something easily tweaked by webmasters. That leaves user behavior, which can be gamed, if with difficulty.

Is that reasonable to suppose?

slawski

11:32 am on Jul 21, 2011 (gmt 0)

Hi Lapizuli

I found a tutorial on Decision Trees from Google's Pittsburgh office director Andrew Moore that does a nice job of showing how decision trees work:

[autonlab.org...]

It's still somewhat technical, but one of the simpler explanations of how decision trees can be used, with some very good examples.

aristotle

2:11 pm on Jul 21, 2011 (gmt 0)

Tedster said:
The purpose of the Panda process is to have an accurate predictive model of "shallow" content as opposed to "high quality" content. To do that, Google first asked their quality raters to create two seed sets of web pages - those that are clearly top quality and those that are obviously not.

Since one of Panda's main targets is so-called "content farms", I wonder if these quality raters may have also been asked to create a "seed set" of these content farms.

Most spam and very low quality sites should be fairly easy for Panda to identify. But many big content farms have "medium quality" content which could be harder to evaluate. So a seed set of these sites would reveal their special characteristics, and this would help make the decision easier.

I also don't think that Panda enforces a strict cutoff point somewhere between low and high quality that determines whether a site is demoted or not. Instead, I think Panda gives some kind of "quality score" to each site that determines the extent of its demotion. Thus, the lower the score, the greater the demotion.

Lapizuli

8:51 pm on Jul 21, 2011 (gmt 0)

Slawski, Tedster, tristanperry, et al,

Thanks for the explanations and references. I'm getting a better sense of things, especially with tristanperry's thread linking to a recent Google patent filing. [webmasterworld.com...]

Definitely food for thought...

Rasputin

3:57 am on Jul 22, 2011 (gmt 0)

An exceptionally insightful explanation Tedster, thanks.

Google first asked their quality raters to create two seed sets of web pages - those that are clearly top quality and those that are obviously not

In the model described you refer to the seed sets consisting of web pages - but Panda is said to be a site wide assessment.

Do you think they would assess individual pages on a site using the method you describe and then somehow sum those results to decide an overall 'panda score' for a site...

...or could their 'seed sets' be complete websites rather than individual pages, which would allow factors such as 'number of pages viewed by an average visitor' or 'how much traffic this site gets from social networks' to also be factors?

tedster

4:33 am on Jul 22, 2011 (gmt 0)

There is definitely some fuzziness in the sitewide versus page specific rankings.

micklearn

6:04 am on Jul 22, 2011 (gmt 0)

Great links and thoughts in this thread. The page vs. sitewide seemed very fuzzy in the most recent patent filing linked above. There were a few 'or's in there that stood out. Also, I can't help but wonder if the somewhat recent investment by Google Ventures (and a few other big names) in Hubspot should be taken into consideration when thinking about the decision tree. Hubspot has a scoring system for pages. Couldn't that be easily turned into a score for an entire site? Score all of the pages = site score.