Simon_H, as I review my last two posts above, they are
very rough, with some obvious key points omitted... so it's not surprising you may be taking these comments further than I intended. I apologize for the confusion.
It's late enough that I'm not going to straighten these out tonight... but let me mention for now some thoughts I think are important, and take it more slowly from here. Let's start with just an overveiw of how seed sets might have been used organically...
Essentially, Panda is a trial and error algorithm, "heuristic" is the word used, where initially gut-level types of quality assessments, initiating in a rigorous, scorable set of questions were increasingly refined to produce an algorithm which returned results that filtered out the initial set of bad sites, and which could then be more easily refined from a smaller group that remained.
Initially, separating out a proliferation of shallow, pointless machine-generated content was the big problem that Google needed to deal with. Content Farms, which were annoying everybody (except perhaps to Jason Calacanis) were also targeted early on in the algo's evolution.
Google initially evaluated those seed sites, I'm guessing, on a test bed offline, refined the algorithm until good quality sites (at least by Google's definitions) remained and bad quality sites were removed... and when results were satisfactory enough, Google tested
the algorithm, as opposed to the list of sites, on a limited area of the web (eg, on a single data center). I'm assuming the initial refinements were query neutral. In all these tests, the idea was to retain the good sites, get rid of the bad, with no false positives. The algorithm is what evolved... the seed sets were used for calibration, and for guiding initial algorithm choices, using an area of AI call Decision Tree learning.
I'm guessing that some of the
zombie results appear at this stage, to sites that are in a limited but public test area in Google's index. Because I'm not a statistician, I can only conjecture that traffic may have been throttled or added to these sites, to allow comparison among some test sites, perhaps with filters on and filters off. With each algo revision, there might be another set of tests. Thus, my guess... the seeming correlation between updates and zombie effects, but not a huge percentage of sites affected. I'm not sure why the same sites get hit frequently, except that they fall in the same grey area, and that Google might choose to follow them over time.
Beyond separating sites with one attribute from sites with an opposite attribute, there must have been evaluation of edge cases that might have kept some sites in the group.
Eventually, though, the algorithm would go wide, to the entire web, then perhaps be further refined, or split into numerous branches. In general, though, I think the emphasis overall would have been to make sure that the algorithmic refinements would scale over a wide range of sites... and allow the next stage of refinements.
Over time, the algo has considered factors of above-the-fold ad density, user intent, user engagement, and factors involving personalization, etc. There is a huge amount of "recursion" in the algorithm... more on that to come.
Also, while there's much that I'd suggest as required reading, the following for me was an important interview, written by Steven Levy, whose book "The Plex" I would also say is required reading for anyone seriously interested in Google. Here's the interview between Levy, Amit Singhal and Matt Cutts.
TED 2011: The 'Panda' That Hates Farms: A Q&A With Google's Top Search Engineers Steven Levy - 03/03/2011 http://www.wired.com/2011/03/the-panda-that-hates-farms/ [wired.com]
This interval is just scratching the surface, but it's beautifully written, and provides a more orderly introduction than trying to get everything into one post, using too many big words because there's no time for description.
Simon_H wrote... Regarding the cyclic AI correction factor, yes, data arrays would presumably be shared across organic and, say, shopping results so each would end up learning from the other and there would be in indirect dependency.
I think that we may be looking at the "dependency" between organic and paid differently, in that I wouldn't have said that there's a dependency
between the two at all.
I'm very much doubting that there's any correction factor after clicks that are shared, and I have no idea of how close the correspondence is that everybody is seeing between organic and PPC. That correlation, I think, belongs in the AdWords forum, and I haven't seen anyone post on it yet.
I think both organic and paid may be receiving the same RankBrain query feed, of extremely odd queries, but that is a big guess. I'd love to hear what kinds of queries members think aren't working any more. My guess is that just as keyword-stuffed titles are being rewritten, long keyword heavy queries, targeting say exact phrases, might not work any more. Feedback would be appreciated.
The other possibility seems to me would be random clicks by human spammers, trying to hide obvious click patterns by introducing lots of distraction. There's a video about something similar being observed on Facebook ads a while back. I'll try to dig up.
Beyond that, there are too many unknowns about the affected queries, and about RankBrain, let alone about Zombies, for me to try to connect the two...
More to come, though slowly. Again, I'd appreciate input from those who've seen and measured the results first hand.
Note: Edited above to add attribution to Simon_H's quote for clarity, and to correct typo.