| This 65 message thread spans 3 pages: < < 65 ( 1 2  ) || |
|Panda Metric : Google Usage of User Engagement Metrics|
User action is everything
| 8:09 pm on Apr 21, 2011 (gmt 0)|
I am surprised and amused at some in the community that point to a couple of individual metrics as the ones that caused xyz to happen in the most recent update. I think it is time to look at the greater picture of data that Google has available for analysis and interpretation.
Eric Enge had a presentation at PubCon Austin that he felt Panda pumped up to 20% user engagement metrics into the algo. It really got me thinking about the user engagement aspects indepth. In this socialized world, it just makes sense that Google would start using more engagement metrics such as demographic, psycho-graphic, and behavioral metrics. I started to put together a list of possible data sources Google could use as signals, and the list quickly grew large.
Most of the engagement metrics Google can use, will fall into the realm of user behavior. Those data sets can be combined with a successful search result into a powerful metric for your website. I believe that metric is now replacing Page Rank as the number one Google indicator of a quality site. I have been calling this mythical metric, the User Search Success Rate (USSR) or the Panda Metric (PM). This is the rate at which any search results in a happy searcher.
The metric starts before the user ever types in a query at Google:
1:Referral? How did the user come to Google? Was it from:
- a toolbar (Googles own toolbar, or a branded toolbar from a partner?)
- a partner site (AOL etc),
- a specific browser, Mobile, Desktop, Tablet or something else?
- a link on another site?
- a social association metrics? Did it come from a social site, and do we know who you are already? (Orkut, Twitter, Private Control panel such as wordpress?)
2: Location data
- IP address
- GPS Data available? Depends on device.
- Toolbar location data and history.
- WiFi network, Cell phone network or other ISP like location data.
3: Browser request headers
- Browser agent, platform and device data
- http accept: gzip, java, flash, etc.
- Screen size
- Toolbar metrics tell all (query string often included agent identifiers)
- Toolbar installation history and other history you may have already shared with Google. (such as version of toolbar)
4: Site Tracking and Advertising Tracking:
- What site did you come from and what did you do on that site? (if they were running Google Analytics or other Google Trackable metric)
- Both via Google Analytics and via Google site based advertising like AdSense or analytics (remember, you leak a referral every time you visit a page with Google code on it)
- Coming soon: +1 data from Googles' +1 service.
- My Google or Google Properties, Gmail, Youtube, etc.
- Sites you were logged into while viewing Google advertising from DoubleClick or AdSense. If you click through a login page on Wordpress at "foofoo.com" and then view an adsense at on that site - it is a good signal to track you by
At this point, Google knows who 70-75% (my guess) the users are doing any given query, and can guess accurately at another 15-25% based on browser/software/system profiles (even if your ip changes and you are not logged in, Google can match all the above metrics to a profile on you). That leaves less than 10% of the users in the world, that Google does NOT know. Of that 10%, they can later retro-analyze your profile again when you meet some criteria, such as logging into a Google service such as Gmail. (I'm not saying they care WHO you are specifically, just that you are User xyz that they can track over time)
Finally, after all that data, the user probably types in a query: (if the search didn't come from an off site query box or toolbar to start with).
- various psycho-graphical data. How the user types in queries into the auto fill box is indicative of users education level, often sex, and other psycho/demographic details.
- spelling, language, syntax, format, etc. All the variables of query can give clues as to who and what the users intent is all about.
- Mouse over preview data. What did the user do?
- Mouse tracking via the js mouse over data. (when you mouse over descriptions, that can be sent back to Google) Intent data.
- Multivariate testing metrics. As we have seen the last year, Google is SERP testing constantly.
Offsite Metrics: Finally the user clicks on a result and is taken to a site:
- AdSense or DoubleClick serve ads?
- Google Analytics running?
- What dose the user do while he is there? Click path.
- +1 buttons to come.
- How long does he stay on the page
- Does user visit other pages?
- Does user hit the back button and return to Google, or does he wander off to parts unknown?
- Toolbar data. Tracking and site blocking.
Unknown Crowd Sourced Metrics
- Google has been known to come up with unusual metrics and crowd sourced data. Take for example Google's Captcha system that is actually used to validate corrupted words from Googles book scanning project. I think it is a good assumption to think there is other data Google could mine that we are oblivious too. If you have played the game werewolf with Matt Cutts - you know he is a crafty one ;-)
After all that, we can quantify a metric (I call, The Panda Metric). It is an amalgamation of the above data inputs. This set of inputs would be relative to this query. They could also be weighted to relative queries (siblings, brothers/sisters, parents of the query tree from the root query).
How the Panda Metric would actually be applied only leads to more questions:
- would Google want people to leave a successful query and come back to Google.com? Or where would they want the user to go?
- Does a happy Google user keep using Google?
- What should Google do to retarget followup queries?
- Is personalization all it is cracked up to be?
- Does the Panda Metric result in higher Panda Metric scores?
- Is it a self-fulfilling or self-defeating metric that leads into a feedback loop - almost a race condition?
Any way you look at, that data when analyzed and applied within the algo could lead to a higher happy searcher result. I think the above data is partially what drove the Panda update. Why? Highly successful, high referral, low bounce, quality content, high engagement, and historical pages have seen a solid boost with panda.
| 8:00 am on May 16, 2011 (gmt 0)|
@ Sundaridevi --
That means we can make all flash websites (as Google does not exist) and they would reward you with traffic?
| 10:17 am on May 16, 2011 (gmt 0)|
|That means we can make all flash websites (as Google does not exist) and they would reward you with traffic? |
NO it doesn't. What I said about Matt Cutts was a very general statement taken out of context. The larger statement he makes is usually something like, Google has a page based view of the web (meaning a page as a textual document). So to the extent that you deviate from the page model, you will have less rewarding results.
| 10:39 am on May 16, 2011 (gmt 0)|
Here's a post i made in a 2005 discussion on another forum about Google's penalization of sites that sell links. I think it still applies. It also provides meaningful guidelines on what Google does in different algorithm updates as well as showing that even the original Google Concept took into account the notion of site quality:
Basically, the size and influence of google contributes to 'increasing returns to scale' on the web. What I mean by that is that without what is referred to here as 'spamming', in the natural progression of things, websites that are bigger have better search results. That's because they have had more time to gather links and gather "citations". If the site is well designed with SEO in mind they also have more internal links which boost their PR. Such sites also have the unique opportunity to create other big sites (more cheaply and more quickly) by linking to them from their first big site. So once you have one, it is *much* easier to get a second.
But the size of Google also means that new sites that enter into a crowded field will have a big task ahead of them to get noticed in google.
Google's PageRank is based on citation theory and in the original google prototype founders Page and Brin discuss this theory:
"Intuitively, pages that are well cited from many places around the web are worth looking at. Also, pages that have perhaps only one citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo's homepage would not link to it. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the web."
In section "2.1.2 Intuitive Justification", the paper also describes how a damping factor can be applied to single sites or groups of sites "and can make it nearly impossible to deliberately mislead the system in order to get a higher ranking"
...easier said than done. But I guess google has a lot of people working on it and that is probably the jist of the post made by Mr. Cutts from google...
So it's no surprise that new sites try to find some shortcuts to get noticed. Those shortcuts include purchasing text links. I don't think it's so bad. The site who 'hoards their PR', as was so appropriately described above, is going to have higher PR than the site that dilutes it by selling offsite links for monetary gain. Similarly, in the physical world, huge media giants sometimes dedicate valuable print space to articles and stories that help out small companies with a positive piece of press, even when that small company isn't big enough to justify it. But nobody questions that. In fact I think it is considered ethical and of "human interest". The difference is that on the web we would like to believe that every site should have an equal chance in the eyes of google. In practice that seems optimistic. In any event, the websites who don't have a PR7+ to dole out may feel that it is unfair that the owners of the ones that do can.
| 4:04 pm on May 16, 2011 (gmt 0)|
|Google has been known to come up with unusual metrics and crowd sourced data |
This has been swimming around in my head lately. One of my first inclinations was to think perhaps Google was using some sort of WOT API (Web of Trust) or similar, to see what kind of trust ranking ordinary people give a site, and if not so good, pandalize it. This wouldn't have to come from WOT itself, Google may have its own that we don't know about, or perhaps Google's herd of quality raters have this kind of toolbar system as well. Whatever/however, this kind of crowdsourced rating system might be in play. In addition, they could be using something similar to how crowdflower connects with mturk to crowdsource small tasks. For instance, someone has hired them to crowdsource a question "How relevant are these 33 search results (UK only)?". If Google is crowdsourcing both trust rankings like WOT does, and various other crowdsourced signals such as relevance, etc., then I can see how Panda could be so hard to figure out, and also why it can't be run all the time. Just a thought for a monday morning, y'all. :)
| 7:59 pm on May 16, 2011 (gmt 0)|
It's a great list, Brett. Nonetheless, I tend to agree with some of the folks who have said that 1) this type of list is going to be used by Google more actively in the future, but that 2) Panda is not using this type of input extensively now.
If these factors were being used by Panda, we wouldn't see such bad results in Google now, as somebody mentioned before in this thread. There are lots of sites that intuitively warrant high bounce rates for their bad content that have benefited from Panda. So, it's hard to believe that engagement metrics are at the heart of the new algo.
To me, more than anything else, it seems like they just dialed up their reliance on PageRank, relying on it more so than they had in the past. To develop Panda, they got their sample sets of good and bad sites; then they ran their indices through a wide array of new algo factor weightings (tweaking weightings on old factors and trying some new ones as well); and, after all of this testing and simulation, for whichever weighting set gave the best, tightest correlation on their good versus bad sample set, they annointed that set of weightings to be king.
They were probably very happy to see that a greater emphasis on PageRank was in the winning set. Unfortunately, that change has messed up many of us. A PageRank 5 site with a million pages could do very well when on-page and other non-PageRank factors were weighted heavily; once Google upped its emphasis on PageRank, those sites suddenly didn't have enough PageRank fuel in the tank to continue to rank well. In the new era, the relevance of your page doesn't matter nearly as much. This theory would also account for Google's comments that one bad section of your site can bring down your rankings for pages anywhere on your site. If you have many weak pages without any links, your strongest pages are going to perform poorly regardless of how many links they get. Anyway, that's my theory du jour. Thanks again for the great post.
| This 65 message thread spans 3 pages: < < 65 ( 1 2  ) |