Shaddows - 5:08 pm on Jun 14, 2011 (gmt 0)
It's not sites, it's SERPs. As in, search engine result PAGES. For the avoidance of confusion, here's some methodology:
Every SERP is a classified dataset. The same SERP is shown to classified traffic sets. User satisfaction is garnered* and the SERP rated. The most satisfying SERP for each 'demographic' is noted.
Now here's the important bit.
The best sites might not make up the best SERP. Like any team sport, its the blend that matters, not how good each individual is on its own. Only user testing can make this determination, not predictive algos. A manager can select a squad, but sometimes the expected star flops in one team, only to shine in another.
Historically, split testing was done on small sets, usually with well understood habits (like the US military - as per my convo with Whitenight AGES ago). Now personalisation is so ingrained, and user metrics so advanced, it makes no sense to test these things in small bubbles. Hence the high visibility recently.
*Click backs plus ad-based tracking cookies work wonders - with both having further refinements. Did you click back and click again? Click back and refine your phrase? Click back and try a related phrase? Click back and try a totally new search?