| This 54 message thread spans 2 pages: 54 (  2 ) > > || |
|Chrome & Analytics Data Use... and Google conspiracy theories|
| 3:46 am on Jan 13, 2013 (gmt 0)|
Let's see if I can end the 'they're lying to us conspiracy theory arguments' or at least refute them wholly to those who can 'get' what I'm saying...
If you run a search engine the size of Google, with the goals of 'determining the one right answer' for people and 'organizing the world's information', you Would Not use Chrome and Analytics data directly in the algo's you wrote.
Why? Because the web is representative of 'the whole', not Chrome (even though it's widely used) and not Analytics (even though it's widely used), so what would happen if you used that data directly is you would not have large enough data samples from Every Site to actually rank Every Site and organize Every Site.
The way you could and likely would use the data is to compare the sites, time on them, visitor habits, etc. to the algo's you wrote and applied to the web as a whole, because if your goals were really to provide people with 'one right answer' and 'organize the world's information' for them, everything you do algorithmically has to apply to the Whole of the web, not the (even though widely used) limited data sets available via Chrome and Analytics.
IOW: You would use them to 'check your work' against reality and see if you're 'hitting the right sites' when people block them, or 'promoting the right sites' according to the ones people visit the most frequently, but you would Not base your rankings directly on their data, because even though there's a ton of it, neither 'scale' to Every Page on Every Site and Every Visitor, but your algorithms have to.
| 5:42 am on Jan 13, 2013 (gmt 0)|
Let me add 'the short version'.
You would Not use Chrome or Analytics data directly.
You could (likely would) use Chrome and/or Analytics data as 'comparison' or 'verification' measures against your algos.
There's a big difference...
And to re-explain: Neither Chrome nor Analytics provides data for Every Visitor and Every Page of Every Site, but Every Visitor is who you're trying to 'organize information and provide the one right answer' for ... When you do it on that scale and you're trying to do it for Every Visitor based on Every Page of Every Site you would not try to apply a 'limited' (although immense) dataset, because anything you do has to apply 'system' (which means Internet - every page you can find on every site) wide. But, you could (likely would) use the 'limited data' from those sources for verification purposes when those are your goals.
BTW: 'and Google conspiracy theories' was not part of my original title, the 'mystery mod' added it, and I wish they would have take the '...' out and made it '& Google Conspiracy Theories' so it matched! Grrr ... LOL ... Thanks for letting it through though :)
| 6:56 am on Jan 13, 2013 (gmt 0)|
LOL TMS. Plagued by "ghost writers" eh?
I follow you on the Analytics part of your theory since not all websites run GA, however with Chrome having about 1/3 of the market share of browsers I am not sure I follow you there since all they have to do is take that into consideration.
| 7:03 am on Jan 13, 2013 (gmt 0)|
Take that into consideration how though is the question? (Taking into consideration is still way different than using directly.)
What they've openly stated (relating to Panda or Penguin), and I believe them, is they checked the sites hit by the algo against the Chrome block list and got 84% or 86% of the most blocked sites right from the start ... They didn't 'use the blocks' in the algo, they verified the algo against the 'blocks'.
As far as Chrome goes, 33% market share leaves 67% 'unknown' and that's a big % to 'not know about' and a Huge number of people when you're talking about 500,000,000 regular Internet users (based on Facebook usage only) and if you figure Chrome users visit a 'relative to the whole' number of sites and pages that's a big number of 'data missing' holes from 125,000,000+ dot com sites alone (never mind how many pages those sites have or any other tlds and their pages).
When you think page/visitor information, there's actually more available via Analytics than Chrome, just based on sheer usage per page numbers ... I don't remember them off the top of my head though, but I know, last I heard, Analytics had way more coverage than Chrome overall, and AdSense likely has the most, but still, it's not 'the whole Internet', which is what they're trying to sort, organize and rank for all their visitors, not only 'certain segments' of it represented by the data they have access to from different avenues.
I still stick with: The algos have to cover the whole thing.
Not trying to rant at you, just explain.
[edited by: TheMadScientist at 7:10 am (utc) on Jan 13, 2013]
| 7:10 am on Jan 13, 2013 (gmt 0)|
I think Google has publicly stated they do not use either Chrome or Analytics data for search engine rankings. I believe that to be accurate.
The concept of them using Chrome or Analytics data for comparison data is plausible, but not necessarily the shortest route to obtain comparison data.
They have SERPs. A click on the first result with no return or first site "bounce" could indicate the searcher found what they were seeking. I see SERP churn all the time, two things I pick up from it are that titles and descriptions are essential to getting an original click, then the presented product has to match user expectations.
I do believe the "one right answer" is the goal, it's just not quite ready for public consumption yet.
| 7:16 am on Jan 13, 2013 (gmt 0)|
Here you go:
|When Panda launched initially, Google said that they didn’t use data about what sites searchers were blocking as a signal in the algorithm, but they did use the data as validation that the algorithm change was on target. They found an 84% overlap in sites that were negatively impacted by Panda and sites that users had blocked with the Chrome extension. |
Now, they are using data about what searchers have blocked in “high confidence situations”. Google tells me this is a secondary, rather than primary factor. If the site fits the overall pattern that this algorithm targets, searcher blocking behavior may be used as confirmation.
So, it's Still (as of Panda 2.0) used as a 'confirmation', but now it's automated into the system, so if fits the pattern AND it's blocked frequently there's no 'independent verification' needed ... It's not actually the 'blocks' in control, it's the other way around ... A site 'fits the pattern' and blocks are automatically taken into account as 'verification' the site shouldn't be there.
IOW: It's the algo that 'starts the ball rolling' and is 'written for the whole', then if there's 'independent confirmation' via blocks, the algo's 'confirmed' and a site can tank, but the blocks aren't 'the driving force' of the algo, they're the 'confirmation' of it.
| 7:20 am on Jan 13, 2013 (gmt 0)|
I think you may have misunderstood my question or I am not understanding your assertion.
With Chrome being as widely used as it is, there seems to me, no problem in gathering data from Chrome users from the whole internet. Unless Chrome specifically does not load certain sites that FF and Exploder will load, or if there is a certain type of site that Chrome users are too dumb to go to ect. The data would be gathered from the entire internet. Just not with different browsers.
| 7:31 am on Jan 13, 2013 (gmt 0)|
They don't have the sample size necessary to make an overall determination and apply it to the Internet as a whole ... They have 67% (roughly) of searches, which is much more reliable than 33% of site/page visitors using Chrome ... If the average Chrome user has average diversity in their browsing habits, there's not enough data spread over 125,000,000+ .com alone sites to make a definitive determination on anything reliably.
To give you an idea about 'sample size', think about an add on TV where all Chrome users agree (which is highly unlikely, because it's 100% of Chrome users agreeing on something, but we'll pretend)...
1 in 3 people definitively agree [blah] is the best soft drink.
Uh, they'd never run that, because only being able to say one out of three people agree is saying you don't really know (or aren't) what 67% of the people really want ... The sample size isn't big enough.
| 7:37 am on Jan 13, 2013 (gmt 0)|
Blocking is a user interaction stating that the user no longer wants to see results from a given site. That requires the search engine to acknowledge information from the user.
I'm not aware of significant other circumstances where Chrome is sending user information to Google. If you have verification that's being done, it would be very interesting to see.
There's a distinction between an algorithm segmenting sites meeting certain parameters and individuals using blocking data to verify algorithm accuracy.
| 7:40 am on Jan 13, 2013 (gmt 0)|
Let me give an idea of why they can use toolbar data to assess site speed ... They Can algorithmically determine if the average load time, template, graphic size, etc. relating to a site is consistent for GoogleBot and relate that to the time a site takes to load relative to the time other sites take to load in the same type niche for visitors, via One Page per site, but they can't do that with Everything relating to a site.
EG A user with the tool bar installed clicks on one page from one site in the results, then clicks on another page from a different site in the results from the same query ... They can make a determination on speed from that.
Because, they can tell:
The template's the same site-wide, the average graphic size is the same site-wide, the server is the same site wide, they sent the visitor to an 'average page' on each site, etc. ... So, if it takes user N452 on connection X 3.2 seconds to download a page from Site A and 9.4 seconds to download a page from Site B, it's Very Safe to assume, overall Site A is roughly 3 times faster than Site B, but that's not a 'conclusion' you can draw for everything 'site wide' from a couple of single page visits by a single user.
| 10:03 am on Jan 13, 2013 (gmt 0)|
No, no, the conspiracy theories aren't tied to rankings, they're about Google gathering personal information. None of the above even remotely tries to defend Google on that front and they DO use every product they own to that end.
| 10:47 am on Jan 13, 2013 (gmt 0)|
Do big brands get a pass on site speed? In my industry the top two results for the main keyword phrase show their sites exceeding 1.5 MB. Download speed on a 56K modem (which probably represents less than 2% now) is over six minutes.
According to websiteoptimization.com everything shows red in the analysis and recommendations area except
Is there something here I'm missing? Is websiteoptimization.com outdated. Are there other sources to rely on?
| 11:22 am on Jan 13, 2013 (gmt 0)|
TMS, I think it is possible to get reliable data from a third of the worlds Internet users..
Surveys are routinely conducted on a representative sample and this has been found to be a reliable indication of what everyone thinks. Apply this to Chrome and in my view it's highly likely with a sample of a third of all Internet users in the world, you could come up with very representative metrics about every site out there.
Add to that the fact that search engineers know all about how to turn small samples of data into something meaningful by combining it with other types of similar data. There was a post here with a link to an article by an ex-google engineer who explained how they do it.
My view is that Chrome data is used in Panda which, as we know, is an 'add on' to the main algo. I think the main algo does all the usual relevance calculations and user metrics then come into play with Panda. I think it's more likely that google use data collected from other means to verify user metric data collected from Chrome.
I really don't see how anything man can manufacture can be more accurate at determining what people like than actual human feedback in the form of user metrics.
Everything I see in google right now tells me rankings are heavily influenced by the way humans react to websites. There are so many factors that humans assess every day in a second when using the web which take into account past experience, changing moods/preferences, trust, quality, etc. and there is no way a computerised process could keep up with that. However, user metrics reflect it every day. Understand what those numbers mean and you can tell a lot about what people think about a site overall.
Compare that to other similar sites and you've got a pretty good quality based scoring system you can overlay onto your main algo.
| 1:42 pm on Jan 13, 2013 (gmt 0)|
|If you run a search engine the size of Google, with the goals of 'determining the one right answer' for people and 'organizing the world's information' |
|I do believe the "one right answer" is the goal, it's just not quite ready for public consumption yet. |
I think they (google) are referring to facts and are already doing this to several things like say weather, dictionary meaning of words, etc. The knowledge graph is another tool used for this. So all the scraping of facts is done towards this objective and it is already live in some form.
| 2:18 pm on Jan 13, 2013 (gmt 0)|
|My view is that Chrome data is used in Panda which, as we know, is an 'add on' to the main algo. I think the main algo does all the usual relevance calculations and user metrics then come into play with Panda. I think it's more likely that google use data collected from other means to verify user metric data collected from Chrome. |
So they lie to us and Search Engine land and Chrome is really the 'driving force' of the algo, not the other way around? ... I'm way more inclined to think they're telling the truth about it.
I was giving reasons for why they wouldn't use browser data directly and you're probably not going to convince me they have enough data from 1/3 of Internet users to score every page on every site, because there's not enough visits to all of them (likely even by all Internet users combined*) and like I pointed out earlier, it's totally different for a metric like speed that's true site wide than it is to 'score a site' based on a visit to a couple of pages.
There's pages on sites I've built I've never even seen (and I'm not sure anyone has), but that doesn't mean they or the sites they're on aren't useful to anyone either ... But a single 'satisfied Chrome visit' wouldn't be enough to 'score' them or the entire site on by any stretch either.
I would venture to say there are way more sites (or pages on those sites) only a person or two with the browser installed visits than there are that actually give you enough to be 'actionable data', because if they had 'short visits' you don't know what was 'wrong' ... Maybe the people didn't like the colors but the text was great. Maybe they're picky and didn't like a certain font. Who knows why a very small relative number 'gave an indication they didn't like the page(s)', but nothing 'strong enough' was wrong to block the site or even if they did block it, there's not a big enough 'pattern' for that info to be reliable and use it for a large number of people.
The converse is also true ... A couple people with a specific browser installed liking a page or two does not mean the whole site should be 'promoted and shown to everyone' in the search results.
Here's another reason to not use browser data:
As soon as they start using data from a specific browser (or site visitor data EG Analytics) as the 'driving force' of the algo, rather than some type of verification, and webmasters find out they will find a way to 'manipulate' the algo by 'cloaking' a site for the browser and Google loses any type of 'reliable sample' of what a site is really like for 67% of people or with Analytics, it's simple to remove it and use some other visitor stats instead.
I fairly sure I can think of more reasons to not 'drive the algo' with a browser or anything other than the algo rather than using the 'other data' as validation or verification of the algo, but I don't feel like it right now.
Using incomplete data as the driving force of the algo doesn't really make sense (to me anyway) ... It leaves too many 'holes' where you don't have that data or enough to 'make a decision' and opens you up to be gamed in a number of new ways, in my opinion.
Browser v Toolbar ... You can detect a browser as a site owner but can't detect the toolbar, and as soon as you base, even speed, on a specific browser you have access to data from and webmasters find out, you could easily indirectly 'shut out' your users from what could be good content, photos, etc. because it would be simple to detect someone's using Chrome and not show all the related photos for a story (for example) to 'give the impression' of speed and if that happens you run the chance of creating a 'less colorful and real' Internet for the users of your browser which means you stand the chance of having people just stop using it, because 'visitors see so much more when they use FireFox or Explorer'.
| 3:18 pm on Jan 13, 2013 (gmt 0)|
They can't even index the whole thing, so I think assuming Chrome visits to sites/pages that have 'surfaced' or even Analytics data are 'telling and definitive' grossly underestimates the size of the Internet and the relative % of pages and sites people actually visit ... In 2005 based on size estimates Schmidt said it would take 300 years to index if all growth stopped. (Of course that was before caffeine, so maybe they got it down to something reasonable, like 30 years - assuming there's not 700,000 pages a minute still being added.)
|The Internet is comprised of approximately 78 million servers that span the globe (That number is quite possibly very low.) Information on the Internet is being measured in Terabytes, and a Terabyte is 1,000 Gigabytes. One estimate in 2005 by Eric Schmidt, CEO of Google, puts the estimate at near five million Terabytes of information on the Web, four years ago. |
Google's search engines managed to index about 200 Terabytes in seven years (as of 2005) as a comparison of how large that really is; 200 Terabytes is only .004% of five million Terabytes!
700,000 new pages of information per minute are added to that tally. If the internet stopped all forward progress it would take another 300 years for Google to index it all.
| 3:58 pm on Jan 13, 2013 (gmt 0)|
|Surveys are routinely conducted on a representative sample and this has been found to be a reliable indication of what everyone thinks. Apply this to Chrome and in my view it's highly likely with a sample of a third of all Internet users in the world, you could come up with very representative metrics about every site out there. |
That was what I was trying to get at. Good Job Claarky!
|There's pages on sites I've built I've never even seen (and I'm not sure anyone has), but that doesn't mean they or the sites they're on aren't useful to anyone either ... But a single 'satisfied Chrome visit' wouldn't be enough to 'score' them or the entire site on by any stretch either. |
I would lay money on the fact that they are not ranked in the top 10 for any quantifiable search term either.
I am inclined to believe that Google engages in subterfuge. Is it a conspiracy? No. It is the culture of corporate America. If they divulged all their sources, then not only can webmasters skew their data set but competitors can replicate it.
| 4:27 pm on Jan 13, 2013 (gmt 0)|
I don't think google have 'lied' to us about Chrome data, I think they were economical with their words. They were asked if Chrome data is used in the algo and their response was along the lines of 'not in the main algo'. Panda is not part of the main algo so it's feasible that chrome data is used to calculate quality score for a site based on overall user behaviour (relative to that of similar sites).
I agree google would not want us to know if this was the case but I think by now they have sufficient data from other sources they could easily spot most tricks people might try and nullify those attempts or penalise those sites.
Just to clarify, I don't think they're judging individual pages based on chrome data, I think they're judging the site as a whole. Rankings of the whole site are then promoted or demoted by your 'score'. So it doesn't matter if individual pages don't get enough traffic to make a judgement about their quality, all pages are trusted (or not) based on overall site quality.
That does however mean poor pages from well known sites can rank very well (which is something being reported regularly). The more browser data google obtains the less this will occur (in other words, yes they do need the other two thirds of browser data to do a better job).
I think this is the best they can do right now, but using browser data gets them closer than not using it.
For the record, I don't think GA data is used at all, mainly because it's too inaccurate and not collected at a detailed enough level for their purposes. It is however useful to webmasters as a guide to the kinds of metrics they are using.
| 5:19 pm on Jan 13, 2013 (gmt 0)|
I don't think any lies are being told about Chrome/GA data directly affecting SERP's either, but I also agree that 1/3 market share of Chrome would be too small, but for a different reason than cited above (unless I missed it somewhere).
Chrome users will be more likely to fit into certain demographics, just like IE, Firefox, Opera etc users might be more likely to fit into different demographics. You'd be skewing the data based on that fact for sure. If you could collect data from a completely random third of Internet users, the aggregate data would be far more representative in my opinion.
| 5:24 pm on Jan 13, 2013 (gmt 0)|
You're not understand the whole 'size of the web' issue claarky and taberstruths ... It's too big to get enough information to rank the whole thing from people's browsers.
Maybe they're up to .008% of it now ... It's just plain too big.
How much faster do you think they can spider than a statistically meaningful sample of people in general (never mind people using Chrome) can even visit sites or pages (assuming people know the they're there, never mind finding them)? Very generously, even with all the computing power Google has, they might have .01% of the web indexed (that leaves 99.99% they don't have) and they can spider in a hurry (way faster than people can find and visit) ... Waiting for a statistically meaningful sample of people (using any browser) to find and visit a site or page to determine a position for ranking it would likely be futile.
If the data from browsers dictates the rankings of sites, where's 'discovery' come in? You don't have data from new sites and pages, so they wouldn't ever be part of the equation if you promoted pages/sites based on 'browser positives', because the 'no browser info' pages/sites would always be ranked too low for anyone to find, so they might as well stop spidering anything not 'already positive', because it won't ever show anyway if they use browser info to directly influence rankings of pages and sites.
It's really not very bright, in my opinion, to use browser info to control the algo, because like I said previously, it leaves too many holes, but you can think they do if you like.
|I would lay money on the fact that they are not ranked in the top 10 for any quantifiable search term either. |
Uh, I would guess it, like most of the rest of the pages on the site ranks at the top for some specific search terms, which is why it was built the way it was. But if you choose to 'put blinders on' and think there's no way it could be helpful to anyone or that Google doesn't rank the site well, that's fine by me ... You have no clue what the site is about anyway, but I can tell you it's linked to by some major colleges and universities (over 50 last I know of) because it does some things quite a bit better than any other site like it, including the .gov...
|Chrome users will be more likely to fit into certain demographics, just like IE, Firefox, Opera etc users might be more likely to fit into different demographics. You'd be skewing the data based on that fact for sure. |
Oh, so the sample isn't representative of the whole? If they had a bigger sample size % would it likely be more representative? I'd think so.
We're both saying the sample isn't accurate, and I put it too simply and used a bad example for some to 'get' earlier, sorry, my bad.
Even setting all the other arguments aside, it still doesn't make sense to me (especially when they Don't Need to) to 'promote or demote' pages or sites based on:
1 out of 3 people who visit this site seem to like it...
What about the other 2 out of 3? No clue...
| 6:57 pm on Jan 13, 2013 (gmt 0)|
TMS, just to address your size of the web and discovery question, as you know, google tests the pages of new sites with new traffic all the time. It knows about these through links etc. and they just throw a little bit of traffic at a new site, see how people respond and determine relative quality from that. If the test reveals the new site is good, more traffic is tested and so on.
I'm open minded to other ideas such as user metrics coming from another source but I dont buy the argument that it's possible to programmatically replicate the million decisions a human makes in a second about a site. Flawed as a third of the stats may possibly be, I think it's unlikely man could generate anything even close by crawling a site and making decisions that way.
In my mind it's not a case of whether it's perfect, it's a case of what better options are there realistically. Google is very keen to have everyone using chrome (on mobile as well) - I think that's significant.
| 7:09 pm on Jan 13, 2013 (gmt 0)|
|I'm open minded to other ideas such as user metrics coming from another source but I dont buy the argument that it's possible to programmatically replicate the million decisions a human makes in a second about a site. |
They don't need to, really ... What they need to be able to do algorithmically to get to their goal is be able to 'match' people to pages, so if they do something along the lines of 'algorithmically assigning each page a wave' based on the variables they have access to for every page on every site and also assign each searcher a 'wave' with the same variables they can use the behavior of the searcher in their results to refine the searcher's 'wave' and can 'dial in' to what page(s) 'wave' matches what searcher's 'wave' best and show specific visitors the 'best wave matches' based on the unique 'wave' each searcher would create from their behavior in the results provided.
I'm probably not explaining what I'm thinking very well, because I've tried to figure out how for a couple weeks and haven't been able to do it, but the point is by using the variables they have access to across the board and assigning those same variables to individual visitors and refining the 'pattern' assigned to their visitors based on behavior then showing the results closest matching the visitor's 'wave' over time they can refine to 'better answers' for each visitor individually without needing anyone to use a specific browser or a specific stat program on their site or anything other than their search results.
What I'm trying to explain is a way more 'broad and inclusive' way of refining results to visitors on a per-searcher-per-query-per-page level rather than 'clouding the picture' with external noise from 'limited data sources'.
I can see them using 'external sources' as verification of direction during refinement, but actually controlling the results (or influencing them) from limited datasets isn't something I think they need to do or would want to, because it's limiting possibilities for results rather than 'narrowing to the right one from the whole' on a personal level for each searcher like they're trying to.
| 8:00 pm on Jan 13, 2013 (gmt 0)|
Let me add...
What I'm thinking and talking about (trying to figure out how to explain) is them being able to 'grab one page' out of the whole they have and show it to an individual searcher based on a 'wave match' (for lack of a better way of explaining) and it wouldn't matter if only 1 person out of 1,000,000 liked the page or not in that type of system, because the specific result shown to a specific visitor (searcher) would not be influenced by other visitor's (searcher's) behavior relating to the page being shown as the result ... The result shown to a specific visitor wouldn't be based on anyone else's likes/dislikes, or anything other than the 'wave of a page' being the closest match to the 'wave of a visitor' ... That's true personalization.
Basing results on 'overall behavior relating to a site' is 'stereotyping' (so are link based systems), not 'personalization', but to 'get to personalization' from where they started and keep visitors happy and results generally relevant, there would have to be a Ton of grouping, segmenting, narrowing and behavior analysis from 'the starting point' first, so 'safe answers' would have to be 'kept in the mix', which is where I think things are...
And, a crazy little 'side-bar' to the 'wave matching' I'm trying to explain is inbound links wouldn't matter a bit, because results would become about the what an individual likes ('responds well to') rather than 'overall vote' driven like the results are when things are based on links ... IOW: The more personalized things get, the less it matters which site 'votes' for which other site and the less 'gameable' and 'manipulatable' things become, because either 'the wave matches' or 'it doesn't' ... Links are dying as a ranking signal, long live links!
| 8:53 pm on Jan 13, 2013 (gmt 0)|
@TMS I was not casting aspersions on your site or it's quality. My comment was that if a page,had no visitors, no traffic and had never been visited by Chrome users, then it probably is either not indexed or not relevant to a quantifiable search term. It had nothing to do with site metrics, but page metrics.
Now let me try to explain what I am trying to get at about the lack of need for Chrome to be universal to be used as a data set for ranking.
Every 4 years we go through an obnoxious election cycle. A whole industry of analysts are employed called pollsters who project who is leading, who will be elected ect. They do these projections based on a small set of data points taken from a small group of people. They are usually pretty correct in their projections based on a small set of samples taken.
Thus Chrome can be a small set of samples that Google uses to project what people will want within a margin of error that most people would find acceptable.
| 9:04 pm on Jan 13, 2013 (gmt 0)|
|My comment was that if a page,had no visitors, no traffic and had never been visited by Chrome users, then it probably is either not indexed or not relevant to a quantifiable search term. It had nothing to do with site metrics, but page metrics. |
I said I'm not sure if anyone has or not ... It's a big site and I don't look to see if every single page on it has ever been visited in the history of it being online, but I know I haven't visited all of them.
Anyway, enough of that, because it's neither here nor there WRT this thread...
My previous two posts explain the best I can why I don't think they would want to or need to use Chrome to generate or influence the results directly ... The 'short version' is: It's a limiting (and stereotyping) factor.
| 9:09 pm on Jan 13, 2013 (gmt 0)|
Okay, let's break this down.
Wave of user = ?
Wave of page = ?
I think user interaction with a site has to be in there somewhere and then comes the question of how you collect that data.
| 9:17 pm on Jan 13, 2013 (gmt 0)|
Wave of page = ?
Everything they can get algorithmically and compare to all other pages they find web-wide ... Basically, every variable they can apply to every page, but also limited to only those they can apply to every page ... 200+ variables at last count and way too many to list, especially when you take into account the image search could (easily in my opinion) be applied to a site's pages, so colors, white space, design, layout, etc. can also be factored in.
Wave of user = ?
Same variables as the wave of a page, likely multiple waves based on determined query intent typing and other 'relationships to queries' (which is likely where the 'knowledge graph' would come into the picture) ... Those waves would then be refined by an Individual's Specific Behavior in the results, not their behavior on the pages of external sites, because you really don't need it and you're limited in the amount of data you can get visitor behavior on external sites from, so why use it when they will tell you consistently in the results what they like and don't like over time?
In my opinion, the most reliable way to personalize results and 'grab the right answer out of the whole of the index' for a specific visitor is to base the personalization of the wave(s) assigned to a visitor on the results you show and how each visitor individually interacts with them, because it's by far the most consistently available information you have access to.
The results are what they're trying to refine and they're trying to do it on a per-user, per-query, per-page basis to get the 'right answer' for an individual out of the whole of their index and the results are what they would absolutely have to 'fall back on' when the visitor uses Explorer and visits a page without Analytics (EG Yahoo! or Apple or Bing or Microsoft, etc.), so why would they ever 'leave the results' for info in the first place rather than figuring out how to 'gauge behavior' (or 'make determinations') within them more accurately by comparing their refinements or method of refining the results to 'other data sources' (such as Chrome or Analytics) for verification of direction?
ADDED: That's Exactly what they did by running Panda and automatically incorporating the blocks from Chrome as 'outside verification of direction' and removing sites that 'fit the pattern' the algo found when they were 'Chrome Block Verified' ... Looking to Chrome users for independent verification of the refinements/method of refining the results is totally different than 'driving the results with Chrome' ... They just 'automated the outside verification of direction' they were going for refinement of the results rather than having people physically look at a list to make sure the algo was 'hitting the right sites' when there was a 'high degree of certainty' the algo was correct in the pattern match.
| 10:01 pm on Jan 13, 2013 (gmt 0)|
But if you ignore user behaviour on a page you have a huge hole which you can only guess about from data obtained from behaviour within the search results. That's a big open door to webspam.
To me, guessing about user behaviour on a page is worse than making assumptions based on a representative sample of actual user behaviour metrics.
The wave idea may well be the way they are headed in terms of personalisation but I just cant see any way it can exclude on page metrics.
| 10:08 pm on Jan 13, 2013 (gmt 0)|
By the way, I don't think browser data is driving the results, I think relevance drives things and user metrics (possibly collected from the chrome browser) is like the final quality check that prevents rubbish getting through by demoting low quality and promoting high quality.
Without that quality control the one right answer could all too easily be webspam.
| 10:14 pm on Jan 13, 2013 (gmt 0)|
Well, it does include quite a few 'on-page metrics' when you take into account they could easily use color, template, design, contrast, etc. and the 'more reliable way' to do it, (like I tried to say earlier but probably didn't very well, because this is tough to explain), is to look at behavior in the results and determine how to 'gauge that behavior' within them more accurately based on external sources, so, spelled out...
You write an algo to refine the results shown for a person based on their behavior in the results. Then you look at the user behavior within the results where you have access to the external data for visitors and use the external sources + in result behavior to see if pages/sites exhibiting positive signals via those external sources move up or down. (IOW: Did you 'get it right' based on 'external confirmation' from data you have access to.)
You could probably even 'do it the other way around' and look at the external sources for 'positive behavior signals', then look at the behavior of those visitors in your results and figure out how to write an algo that applies 'Internet wide on a per-visitor behavior basis' to further refine the results for a person by making sure you move the sites exhibiting positive signals up for people exhibiting similar behavior in the results to those you can 'see after they left' via external sources.
What you would not do, because it doesn't apply Internet wide is actually write the data from the external sources into the algo ... You would use them for 'verification' or 'direction', but not directly, because if you use them directly you 'miss' too much, but if you use them for 'direction' or 'verification' and find a better way to refine your results, then it's an overall win that scales to the entire index easily.
You is of course generic in this post.
| This 54 message thread spans 2 pages: 54 (  2 ) > > |