Robert_Charlton - 10:15 am on Feb 2, 2011 (gmt 0)
pontifex (and several others here) are definitely on to something. I should add, with regard to those "shingles" which pontifex cited, that we're now also into n-grams and vectors. I've noticed some collateral damage with the update that I think may shed some light on these thoughts....
An in-depth consumer information article I wrote about 6 years ago, which has had consistent #2 clustered rankings for a competitive single-word query, has been scraped to death, so much so it's been impossible to follow up with DMCAs. Over the years, it's occasionally dropped out of Google and come back. I could generally check whether it was in the index by quoting a sentence, and/or make it come up by disabling the dupe content filter. Until now, Google has always brought it back.
With this update, the article has basically vanished for its search terms, replaced by a newer and fluffier article on another domain with more social-friendly packaging. No hard feelings that it's been outranked, and I've learned a few things about the packaging. The new article doesn't quote my original article at all. It parallels it quite a bit, but they all do... the story is essentially the same.
What I'm seeing, though, is that the original article also now disappears not only competitively, but also on searches for some quoted segments, though not all of them. Google now is apparently not treating the article as a whole. It's likely... for reasons described in this thread... that looking at any article as a whole is becoming impossible. As pontifex suggests it might be, Google appears to be looking at the article in pieces.
If I search for exact strings, say, sentence by sentence, it also appears that Google is also no longer treating these queries as searches for exact word matches, but may rather be looking at them conceptually.
This is something we've discussed in the page title discussions and have been mentioned in various update threads... I'd have to do some checking to find the references... but each chunk appears to have a different level of competition that's fairly pronounced, not previously the case with a 12-15 word quoted search.
Perhaps this relates to how often a phrase has been scraped... perhaps to how competitive the vocabulary or the "concept" is that's described by the quoted string... or there may be a quirk in Google's phrase-based indexing. I see that the core sentences are those that disappear most often.
I've been seeing parallels to this for a while now on sites that had a lot of internal duplication, used a lot of repetitious boiler plate in their content, had a lot of affiliate duplication or were scraped a lot, etc. This example I'm citing now is sobering enough that I'm thinking that assumed "evergreen" content may be more vulnerable than thought... and, very simply, if material gets duplicated and shifted around long enough, Google may be giving up on identifying the source and be dumping it into a dust bin of history.
I have other thoughts that involve testing I believe the article has gone through from externals I've observed (multiple pages from the site being returned, I think, to test which of several pages should stay), but I wasn't associated with the site last year so haven't had a chance to monitor that particular data, if it was collected.
For now, I can say from serp watching that a lot of search refinements that I've been seeing on various searches, like multiple pages returned for a domain, etc, seem to have shifted with this update... and not just for this set of queries, but for many others... perhaps suggesting that the evaluations which the refinements were a part of have been incorporated into the new algo and the testing shifted to somewhere else.