Whoa - 6:34 pm on Feb 28, 2011 (gmt 0)
Long ago, I did some coding that ran an index of dissimilarity calculation on census data to determine which cities were the most segregated in the country. It boiled a city down to a single number, where 0 was perfectly integrated and 100 was perfectly segregated. I thought that was so cool to boil down a complex issue such as racial segregation into a single number.
Google's doing something similar in this new algo, I believe. It simhashes pages on your site to determine whether they are similar to each other. If they are very similar, those pages don't need to all be in the main index. So let's drop them into some secondary index, and only use them if we really, really need to.
More importantly, the new algo does a simhash on your entire site to see if it is similar to the scraper sites that Google (and we all) hate.
If your simhashed number is similar to a scraper site, well, then, by golly, you are probably a scraper site and should be punished. (Actually, it's probably not a single simhash, it's probaby many different ones concatenated together.)
So, if you got whacked, you are similar to a scraper site in some respects, even if you are a good site. That's my theory. So, think about the attributes of a scraper site and then be the opposite of those bad boys. What does a scraper site not do, because they are robo-created and have crap content, that is a sign of quality? -- figure out the tough signs of quality that a scraper could never do and do more of that. My two cents, anyway.