| This 65 message thread spans 3 pages: 65 (  2 3 ) > > || |
|Panda Metric : Google Usage of User Engagement Metrics|
User action is everything
I am surprised and amused at some in the community that point to a couple of individual metrics as the ones that caused xyz to happen in the most recent update. I think it is time to look at the greater picture of data that Google has available for analysis and interpretation.
Eric Enge had a presentation at PubCon Austin that he felt Panda pumped up to 20% user engagement metrics into the algo. It really got me thinking about the user engagement aspects indepth. In this socialized world, it just makes sense that Google would start using more engagement metrics such as demographic, psycho-graphic, and behavioral metrics. I started to put together a list of possible data sources Google could use as signals, and the list quickly grew large.
Most of the engagement metrics Google can use, will fall into the realm of user behavior. Those data sets can be combined with a successful search result into a powerful metric for your website. I believe that metric is now replacing Page Rank as the number one Google indicator of a quality site. I have been calling this mythical metric, the User Search Success Rate (USSR) or the Panda Metric (PM). This is the rate at which any search results in a happy searcher.
The metric starts before the user ever types in a query at Google:
1:Referral? How did the user come to Google? Was it from:
- a toolbar (Googles own toolbar, or a branded toolbar from a partner?)
- a partner site (AOL etc),
- a specific browser, Mobile, Desktop, Tablet or something else?
- a link on another site?
- a social association metrics? Did it come from a social site, and do we know who you are already? (Orkut, Twitter, Private Control panel such as wordpress?)
2: Location data
- IP address
- GPS Data available? Depends on device.
- Toolbar location data and history.
- WiFi network, Cell phone network or other ISP like location data.
3: Browser request headers
- Browser agent, platform and device data
- http accept: gzip, java, flash, etc.
- Screen size
- Toolbar metrics tell all (query string often included agent identifiers)
- Toolbar installation history and other history you may have already shared with Google. (such as version of toolbar)
4: Site Tracking and Advertising Tracking:
- What site did you come from and what did you do on that site? (if they were running Google Analytics or other Google Trackable metric)
- Both via Google Analytics and via Google site based advertising like AdSense or analytics (remember, you leak a referral every time you visit a page with Google code on it)
- Coming soon: +1 data from Googles' +1 service.
- My Google or Google Properties, Gmail, Youtube, etc.
- Sites you were logged into while viewing Google advertising from DoubleClick or AdSense. If you click through a login page on Wordpress at "foofoo.com" and then view an adsense at on that site - it is a good signal to track you by
At this point, Google knows who 70-75% (my guess) the users are doing any given query, and can guess accurately at another 15-25% based on browser/software/system profiles (even if your ip changes and you are not logged in, Google can match all the above metrics to a profile on you). That leaves less than 10% of the users in the world, that Google does NOT know. Of that 10%, they can later retro-analyze your profile again when you meet some criteria, such as logging into a Google service such as Gmail. (I'm not saying they care WHO you are specifically, just that you are User xyz that they can track over time)
Finally, after all that data, the user probably types in a query: (if the search didn't come from an off site query box or toolbar to start with).
- various psycho-graphical data. How the user types in queries into the auto fill box is indicative of users education level, often sex, and other psycho/demographic details.
- spelling, language, syntax, format, etc. All the variables of query can give clues as to who and what the users intent is all about.
- Mouse over preview data. What did the user do?
- Mouse tracking via the js mouse over data. (when you mouse over descriptions, that can be sent back to Google) Intent data.
- Multivariate testing metrics. As we have seen the last year, Google is SERP testing constantly.
Offsite Metrics: Finally the user clicks on a result and is taken to a site:
- AdSense or DoubleClick serve ads?
- Google Analytics running?
- What dose the user do while he is there? Click path.
- +1 buttons to come.
- How long does he stay on the page
- Does user visit other pages?
- Does user hit the back button and return to Google, or does he wander off to parts unknown?
- Toolbar data. Tracking and site blocking.
Unknown Crowd Sourced Metrics
- Google has been known to come up with unusual metrics and crowd sourced data. Take for example Google's Captcha system that is actually used to validate corrupted words from Googles book scanning project. I think it is a good assumption to think there is other data Google could mine that we are oblivious too. If you have played the game werewolf with Matt Cutts - you know he is a crafty one ;-)
After all that, we can quantify a metric (I call, The Panda Metric). It is an amalgamation of the above data inputs. This set of inputs would be relative to this query. They could also be weighted to relative queries (siblings, brothers/sisters, parents of the query tree from the root query).
How the Panda Metric would actually be applied only leads to more questions:
- would Google want people to leave a successful query and come back to Google.com? Or where would they want the user to go?
- Does a happy Google user keep using Google?
- What should Google do to retarget followup queries?
- Is personalization all it is cracked up to be?
- Does the Panda Metric result in higher Panda Metric scores?
- Is it a self-fulfilling or self-defeating metric that leads into a feedback loop - almost a race condition?
Any way you look at, that data when analyzed and applied within the algo could lead to a higher happy searcher result. I think the above data is partially what drove the Panda update. Why? Highly successful, high referral, low bounce, quality content, high engagement, and historical pages have seen a solid boost with panda.
I agree that Google is probably shifting to user metrics as a major indicator.
But I would say that Google's best source of data on user behavior is the Chrome browser. I think one of the main reasons they created it was to be able to collect better data. Tens of millions of people now use it as their main browser. That's enough to be able to collect statistically meaningful results. Some of the signals that chrome can provide included:
1. A user bookmarks a page as a favorite.
2. A user saves a copy of a pge on their hard drive.
3. A user returns to the same page later.
4. A user visits other pages on the same site.
It's really something to see all those factors laid out in one place. And the list doesn't even include buying clickstream data from an ISP or two. That is another real possibility.
This could explain any delay in recalculating.
But then, eHow for example has/had about 2 min at a time and some 75% bounce rate (Alexa) and Wisegeek has/had more or less the same. Broadly speaking, a few seconds more or less.
I think some of the delay in recalculating is that Panda works at a very basic level - it's what Google calls the "document classifier". I have a feeling that particular type of routine does not run as often as the rest of the scoring that is built on top of it. My current research - looking through patents, papers, and posts that mention "document classifiers".
Yeah, this is a great thread.
I think this is probably a good time to clarify 'document' can refer to a page or collection of pages, and could easily be both, imo. EG A page is an individual document and can be evaluated individually, but a site (or imo, even a 'section' of a site) is also a document and can be evaluated as collective whole.
I would guess you're right about classifications not happening as often tedster and, of course, if only a portion of pages (sub-documents?) are changed you could end up with the same overall evaluation of the document (site) as a whole, even though there have been changes to a portion of it.
Plus, if it's 'happy user' based as Brett has outlined, imo, there would have to be time to build a 'new history of happy users' before an override of the previous pattern would kick in for a site's or page's rankings.
I could see even a 20% factor in this, but they are many popular and successful sites that got screwed by Panda, very sticky ones. So I am guessing if your pages are "spam" you are toast. If they aren't, then these metrics may help you.
Then there's Matt Cutts not ranking for his own article
Brett, it sounds like you are saying they do use Analytics to calculate the SERPS. I agree that Panda is a replacement for Page rank - it kind of reminds me of that on a sitewide basis.
If it's indeed 20% of user interaction injected into the search engine's algo it would explain the abnormal results.
|So I am guessing if your pages are "spam" you are toast. |
I believe right now seems spammy pages thrive over original content.
The thing is, the lack of precaution by the end user browsing, could significantly affect search results, as you will have to add to the "user engagement" the compromised/zombie machines available worldwide. And that is a big percentage. At least from the server logs I have access to 80%-90% of visits are artificial. There are lots of new developments for ways on hijacking browsers.
The theory ties in with the "infrastructure" update a few months back where it was speculated that the changes were to facilitate exactly what this thread is talking about.
I don't think they quite have it right yet but if it's the first steps towards it then it's a sensible and astute move. Anything that takes away factors that can be gamed (ie: link exchanges/buying etc) and moves towards ranking based on how people behave and react to what they find makes perfect sense for the SERPS and should help improve website quality over time.
Fingers crossed you are on the money Brett.
I hope this isn't off topic, but I think it is a sign of user engagement/user behavior:
If you search a topic, click on a serp and follow it to the page listed, and then hit BACK to return to the G serps page, this message appears under the listing you had previously clicked: "Block all results from sample.com"
Is this possibly a "NO" vote that will go into the site's profile? G is assuming, I believe, that since we followed a link and bailed back, there is something we don't like about that whole domain.
I leave it to the smarter heads here to comment.
I don't know how much the algorithm is influenced by the user interaction. From what it was earlier said it seems to be the case. And if it does, the guy with the biggest botnet wins of course. Time will tell.
New video from Matt Cutts today on another potential source of data - Google's Public DNS service.
Matt states categorically that the search team does NOT have access to that data.
I think it's only a matter of time before most site data collected by Google is centralised and used in one way or another whether it be for ranking, comparisons or analysis of algo change effect.
If it's information that can be used to improve SERP quality/relevance they'd be silly not to.
> Matt states categorically that the search
> team does NOT have access to that data.
Seriously? That could be some very interesting data to look at. I wonder what it would tell us?
Yeah, it's actually funny, because he said when he read the TOS the only thing he wasn't totally happy with was NOT having access to the data! lol
|I agree that Panda is a replacement for Page rank - it kind of reminds me of that on a sitewide basis. |
Not in the two verticals that I watch closely. The top sites all rank well because of "unnatural" linking, namely;
"SEO Directories" (directories that specifically state that people should submit to them to increase their Page Rank, and charge a nominal fee ~US $20 to get listed.)
Site network interlinking (same owner has several sites that link to one another)
Stayed up until 3:00 looking at about 8,000 total links to my competitors and less than 1% would be considered "freely given" links - I would say it is close to about .25%.
While that is disheartening to someone like me who has a site with low page rank, I realize that I can play their game better than they can, so it won't be long...
>> Is this possibly a "NO" vote that will go into the site's profile? G is assuming, I believe, that since we followed a link and bailed back, there is something we don't like about that whole domain.
I'm an experienced 'searcher' and sometimes I will merely browse through the results, check each one out and eventually find my way to the most irrelevant result and waste much time there.. just because it looks like it may have something I may be looking for. More often than not, I find that I've wasted my time reading rubbish just because the first few sentences made at least some sense and the web site design looked good.
I surely hope that if I wasted my time on such rubbish, went back (again) to the search results and finally did find what I wanted somewhere else in a result I originally dismissed, it doesn't have a negative effect on a web site (or page), just because a rubbish article gained some points because I wasted my time with it!
Admittedly, most people I know will "Google" something, click the first or second result, usually the spammiest looking thing and then rarely read any of the actual content and come up with a magical 'answer' to what they're looking for.. or click an ad. Often the latter.
As someone familiar even with boolean queries, I hope Google doesn't start using user metrics, because users seem to eventually choose the worst, spammiest results.
Honestly, I miss what Google had to offer with its flexibility. I wish FAST Search still offered their tools on AllTheWeb because they really did have some cool technology and something to compete with G.
A few years back I suggested google was using all the above data within their aglo, and people snickered at me like I was a conspiracy theorist [nut].
Perhaps I was a bit too forward thinking at the time, however it seems to be today’s reality.
Today I believe this new algo has a signal to noise ratio built in, and a close study of existing SNR documentation might be helpful in understanding it.
Gosh you guys never listen to me! :-D
I think google have already implemented any ideas we come up with. I know that some of these metrics were put to use years ago.
Looking too much at the tree, and ignoring the forest at large. To me, ignoring basics spells disaster : Panda, anyone?
SE basics dictate:
- spider & store as much as you can
- work on determining content RELEVANCE as much as you can
- monetize in novel, unobtrusive & user-friendly ways
- don't be overly obsessed with PhD bs
- Keep It Simple Stupid!
There is a lot yet to be done on basics, that has not yet been done, before going into truly questionable & minute details about user behaviour.
6: Offsite user metrics ?
- If your site has an active following on facebook or Twitter and most of the conversations take place on those services it's possible to associate/credit that engagement with your site. The opposite is also true, do you have stagnant, little used, social profiles for your site? As evidenced by Google realtime they choose the accounts they display content from carefully which means social profiles are "rated" somehow too.
- Do you "auto-post" your content to social circles? If so, are those links getting visited by those you shared with?
I think it's important to remember that everything is relative. It's not about having the absolute best scores online, it's about having the best experience for your users measured against similar sites. Different site types have a different bar to reach (raise?). What works for one site type may not work for another.
"does the site have a social network account on Twitter? yes=+1" is being replaced with "does the site have a social network account on Twitter? if so is said account popular? if (a+b=yes)+1
A reliance on Google based software (Analytics, Chrome) is inherently flawed. It's like trying to calculate the quality of a site based on its alexa rank.
Also, for international and geographic searches it simply won't work because certain locales may be more prone to using one browser over another.
Finally, what if you don't use any google software? No analytics, no adsense, nothing on your site. And a non-chrome user visits your site. There is no data to be collected. I don't think this is a 10% thing - more like a 60% thing. And aside from the data you collect from what a visitor does once they are on the site, which is 90% of what is relevant to determining whether they found what they were looking for, Google will be in the dark with a lot of sites.
That being said, I'm not saying it isn't going to happen or happening, but Google can't expect results to be holistic if it relies on webmasters and users to use their software. Especially as MS and Yahoo are now encroaching on their search market share, their software usage will reduce likewise.
The big question is what do they do if there is no user data? Assume no one visits the site? Of course not. It's coin flip - if you allow google to track your site, they can see if it sucks or not. if you don't they have to rely on other factors.
Otherwise, clearly, they have to penalize you for not using analytics or adsense, or if your website is not chrome friendly. I don't know legalese, but wouldn't that be an anti-trust issue?
Great thread thank you.
The missing piece above is user explicite feedback on SERP using "Block result from xyz site". G will test the results from that and make the user provide more feedback.
Problem with that is it's a vote for or against a certain site. This information can be gamed.
Hell you could hire people use proxy servers and kill cookies to do it 400 times a day from different IPs.
Unless they restrict it to users who have Google accounts, and are logged in at that exact moment, which moves the feedback rate to a low single digits.
Even logging which result people click on promotes "catchy" titles and meta descriptions which is not indicative, in any way, of whether they found what they are looking for.
The only statistic I see, which is about 90% of what is of interest, is whether a person clicks on a link, goes to that site, and does not perform that search again within the next minute or so.
Which I suppose they could track with cookies... but doesn't that cause privacy issues? Are they allowed to store searches by IP or computer?
|...hire people use proxy servers and kill cookies to do it 400 times a day from different IPs. |
I doubt it would go unnoticed.
|Unless they restrict it to users who have Google accounts... |
|...and are logged in at that exact moment... |
You have to be.
|The only statistic I see, which is about 90% of what is of interest, is whether a person clicks on a link, goes to that site, and does not perform that search again within the next minute or so. |
And is compared relatively to clicks and behaviors on other closely related results?
|Which I suppose they could track with cookies... but doesn't that cause privacy issues? |
I guess you haven't heard if you want privacy you shouldn't be online and definitely shouldn't use Google? lol
|Are they allowed to store searches by IP or computer? |
They personalize results by default; I think they're storing searches by anything they can identify a user by.
Nice theory Brett but this is just wishful thinking. In my competitive area, many of the new entrants are the ugliest, lamest, least useful, out of date, socially inept websites you've ever come across. The idea that these pages are engaging customers better than the entrenched real businesses they ousted would take some believing.
On the otherhand, I can see that this is the future of Google's algo once this initial weeding out (or is that dose of pesticide) has taken effect. It's definitely the way to go, since it makes every web-user a google quality rater (instead of just webmasters, as under Page Rank).
| This 65 message thread spans 3 pages: 65 (  2 3 ) > > |