Thanks for your feedback.
You mentioned millions of pages: I guess that site size/authority reduces negative impacts.
Additional feedback would be highly appreciated since A/B tests are as essential for revenue as getting traffic.
I did my homework and searched for A/B tests on Webmasterworld:
I have heard of people running into ranking problems when using some kind of home brew split testing. But as you say, not with Google's very own solution. I also assumed that they set up the Website Optimizer technology so it wold not cause problems. Could be a very wrong assumption - we know that one area of Google can make problems for other areas - for example, the first AJAX SERPs broke Google Analytics.
keep test pages out of the Google index.
If the ranking problems did come from Website Optimizer, the clue is probably in the phrase "radically different".
IMO, the best option is to split test away from organic search of any kind. A reliable option is to use pay per click traffic, and robots exclude any pages you use for testing.
My current strategy: We can't use Google Website Optimizer but we will feed it with some dummy JS just to let Google know about the test.
We will use a different (additional) GA code for each variation.
We will start with one variation(?) sitewide and watch the SERPs for some weeks.
We want to test sitewide to get authentic results.
User behavior is very different depending on the season. So we want to test all variations at once.
We can't afford paid traffic. So we can't get traffic to noindexed pages without destroying the information architecture ("A/B testing" on all internal links? No way).
We want to test IP-based. This should reduce complications with Google.
We want to use our whole traffic for this so we can end the test sooner.
In my case this 100k pages are most of the site and 50% of the traffic and we want to triple the amount of content per page.