joined:Apr 6, 2002
It almost seems like Google has expanded the number of indices that it uses to generate results by an order of magnitude or two. Remember when there was a Fresh and a Non-Fresh index? It's like that again, only the indices are now Similar Content Index #0000000001 to Similar Content Index #10000000000, or whatever -- millions of much smaller indices, each of which contains pages that are similar in some material respect.
In essence, my guess is they've spent a ton of time working to identify very similar pages such that they can avoid presenting multiple pages in the top results that essentially cover the exact same ground. The top results can be more diversified in this respect, essentially leading to better results and crushing sites that copy content or don't add much new to a given topic. That's been a stated aim for some time.
If this were the case, many sites with good content that was not materially different from other pages on the the same site or on on peer sites would drop in the rankings -- this would account for "my site is very good but I still got whacked by the new algo" comments in this thread. It would also account for some of your pages getting whacked if you have a good site, but others not getting whacked.
This is admittedly just a wild guess on what's happened, but I base it on a couple of things.
One, I spent some time getting into the heads of a few senior Google engineers, reading their posts on a bunch of sites, looking at books they were recommending, etc. There's a lot to suggest that they've been spending a ton of time on similarity algorithms and content clustering, as well as inference and learning algorithms.
Two, I see that my "site:www.domain.com" page count has shrunk drastically post-algo-change (by almost 75%), but if I do "site:www.domain.com widget" I can see many pages in the results that I believe are no longer in the site:www.domain.com page count. In fact, the count for "site:www.domain.com widget" is now greater than the "site:www.domain.com" count.
In this case, the widget pages are similar in many respects (still each one uniquely valuable to the world, mind you!), so it almost seems like they've been relegated into their own index or some other second-class index that spans other sites. If you've got pages that are very similar, regardless of whether you think they are value, this algo change would knock you down quite a bit -- because it's a new definition of similarity than ones we've seen in the past. It's looking for signs of similarity using sophisticated statistical algoritms, rather than direct verbatim plagiarism.
I'm pretty sure the new algo has a heavy weighting for identifying similar content, which the algo takes as the antithesis of original, unique content. That gets the scrapers but it also can get quite a few well-intentioned innocents.
The expansion of the index count is something I'm less sure of. If this were the case, the results would now be being generated from some hierarchy of indices, and if, for a given search, you didn't make it into one of the top indices relevant to that search, you won't show up in the top results. You might have great content on a certain topic but if some other site has it covered better than you -- or has equivalent content -- you could be low in the results.
I think the takeaway, if any of this is true, is to vary up your content fingerprints so each piece of content is as unique as possible relative to other content on your site and relative to competitor sites. If it looks formulaic or if it's not saying something new, this algo isn't going to like it.
Admittedly, all conjecture above...but what else have we got to go on than hunches? I definitely don't think this is as simple as devaluting internal links or any such thing...the Google engineers are trying to reinvent the game in a way that you really have to have differentiated and useful content to do well. So, er, bravo, I guess.