Welcome to WebmasterWorld Guest from 54.158.195.220

Forum Moderators: Robert Charlton & aakk9999 & andy langton & goodroi

Message Too Old, No Replies

What is the "Decision" in Decision Tree Processing?

     
6:59 pm on Jul 20, 2011 (gmt 0)

Full Member from US 

5+ Year Member Top Contributors Of The Month

joined:Oct 9, 2009
posts:301
votes: 6


The Google engineer who supposedly worked on the "Panda" update is said to be an expert in "decision tree processing."

I've been trying and trying to figure out what that is, looking here:

[en.wikipedia.org...]

and here

[research.google.com...]

And here:

[seobythesea.com...]

And other places written purely in technospeak.

What I haven't been able to get a grasp on is what decisions are actually the decisions in question.

Are they the decisions the algorithm makes in interpreting the data?

Are they the decisions of the user, which are better understood by the algorithm?

Something else?

Can anyone (Tedster?) explain?
7:38 pm on July 20, 2011 (gmt 0)

Full Member

5+ Year Member

joined:Sept 14, 2010
posts: 205
votes: 0


I haven't read those links, but I've always learnt (computer science degree/graduate) that it's the decisions that the algorithm makes when it comes to understanding/acting on a set of data.

So yep, from the perspective of the algorithm.
7:53 pm on July 20, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


The purpose of the Panda process is to have an accurate predictive model of "shallow" content as opposed to "high quality" content. To do that, Google first asked their quality raters to create two seed sets of web pages - those that are clearly top quality and those that are obviously not.

Now it comes time to find a predictive algorithm, and the process begins by ranging over all the data signals Google stores for those two seed sets (whether the data signals are currently used for ranking or not.)

The machine learning program needs to generate a combination of factors, mixed and matched in various ways, that generates those two seed sets automatically - essentially duplicating the ratings that the humans had already assigned. Once the combination of factors is found that predicts the good/bad quality with a high degree of accuracy, then that combination of factors is run over the signals data for other pages that were not previously scored by human raters. The accuracy of the data model can then be tested further by looking at these results.

After more and more tweaking of the predictive algorithm, the ideal is to arrive at a highly accurate model that can be used live.

The "decisions" are whether any given signal actually belongs in the predictive model, along with how it gets combined with other signals. Maybe signal A predicts quality content but only when signals B and C are also present. Or maybe the prediction is more accurate when signal C is absent but D is present. Maybe one set of signals can be grouped, and if they reach a certain threshold, then loop back to an earlier position in the tree and wipe out or modify some earlier score.

The number of signals or factors involved is massive and the possible ways of combining them is also immense. The machine learning program would automate all kinds of possible "trees" seeing which one does the best job of predicting. The final decision tree is the outline of how those factors all inter-relate in the final predictive model.

A friend of mine is involved in a contest building a similar model for predicting movie box office hits based on known pre-release factors. That's a lot smaller data set, and a lot less dynamic than Google's, too. But still, he tells me that it's immensely challenging.
8:56 pm on July 20, 2011 (gmt 0)

Full Member from US 

5+ Year Member Top Contributors Of The Month

joined:Oct 9, 2009
posts:301
votes: 6


Hmmm. Okay, so if I were an algorithm and wanted to predict whether a user would find a site low quality, and I just got better at deciding how to use the data I have for doing that, enough to spur my creators to tell webmasters, essentially, "Minimize your quality-signaling SEO - we don't need your backend signals as much as we used to - and start really focusing on your users" (which is the common theme in Google insiders' advice since Panda, and yes, it's the advice given before Panda, too, but not quite so adamantly, it seems), then probably the data I just got better at wasn't any specific on-page factors, but that surrounding user behavior. Right? I mean, what else would it be, but matching algorithm's decisions to user's decisive actions?

I don't do it very well yet, though. My decisions might be more sweeping in variable-quality UGC sites in which I assume all that glitters is pyrite.

And my confidence in my decisions might be overrated - for example, if I associate erroneous signals with quality, such as assuming all valuable sites are those with which one would entrust one's credit card number (after all, people will trust their neighbors' advice on how to deal with menopause, but wouldn't let their neighbors have their credit card numbers, and would trust drugstores to take their credit cards, but not to give honest, disinterested advice on hormone therapy.)

But generally, my focus would be on better interpreting the actions of users. And there is no SEO for that - only end results.

If memory serves, Google has actually warned that they can't share the details about Panda because it involves something that could be gamed. Yet it's not supposed to be something easily tweaked by webmasters. That leaves user behavior, which can be gamed, if with difficulty.

Is that reasonable to suppose?
11:32 am on July 21, 2011 (gmt 0)

New User

10+ Year Member

joined:Dec 22, 2004
posts: 37
votes: 0


Hi Lapizuli

I found a tutorial on Decision Trees from Google's Pittsburgh office director Andrew Moore that does a nice job of showing how decision trees work:

[autonlab.org...]

It's still somewhat technical, but one of the simpler explanations of how decision trees can be used, with some very good examples.
2:11 pm on July 21, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member aristotle is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Aug 4, 2008
posts:2814
votes: 132


Tedster said:
The purpose of the Panda process is to have an accurate predictive model of "shallow" content as opposed to "high quality" content. To do that, Google first asked their quality raters to create two seed sets of web pages - those that are clearly top quality and those that are obviously not.



Since one of Panda's main targets is so-called "content farms", I wonder if these quality raters may have also been asked to create a "seed set" of these content farms.

Most spam and very low quality sites should be fairly easy for Panda to identify. But many big content farms have "medium quality" content which could be harder to evaluate. So a seed set of these sites would reveal their special characteristics, and this would help make the decision easier.

I also don't think that Panda enforces a strict cutoff point somewhere between low and high quality that determines whether a site is demoted or not. Instead, I think Panda gives some kind of "quality score" to each site that determines the extent of its demotion. Thus, the lower the score, the greater the demotion.
8:51 pm on July 21, 2011 (gmt 0)

Full Member from US 

5+ Year Member Top Contributors Of The Month

joined:Oct 9, 2009
posts:301
votes: 6


Slawski, Tedster, tristanperry, et al,

Thanks for the explanations and references. I'm getting a better sense of things, especially with tristanperry's thread linking to a recent Google patent filing. [webmasterworld.com...]

Definitely food for thought...
3:57 am on July 22, 2011 (gmt 0)

Full Member

10+ Year Member

joined:May 25, 2006
posts: 237
votes: 12


An exceptionally insightful explanation Tedster, thanks.

Google first asked their quality raters to create two seed sets of web pages - those that are clearly top quality and those that are obviously not


In the model described you refer to the seed sets consisting of web pages - but Panda is said to be a site wide assessment.

Do you think they would assess individual pages on a site using the method you describe and then somehow sum those results to decide an overall 'panda score' for a site...

...or could their 'seed sets' be complete websites rather than individual pages, which would allow factors such as 'number of pages viewed by an average visitor' or 'how much traffic this site gets from social networks' to also be factors?
4:33 am on July 22, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


There is definitely some fuzziness in the sitewide versus page specific rankings.
6:04 am on July 22, 2011 (gmt 0)

Full Member

5+ Year Member

joined:May 30, 2009
posts:233
votes: 6


Great links and thoughts in this thread. The page vs. sitewide seemed very fuzzy in the most recent patent filing linked above. There were a few 'or's in there that stood out. Also, I can't help but wonder if the somewhat recent investment by Google Ventures (and a few other big names) in Hubspot should be taken into consideration when thinking about the decision tree. Hubspot has a scoring system for pages. Couldn't that be easily turned into a score for an entire site? Score all of the pages = site score.