I think I do see a logical problem there, Oliver, but not a infinite loop. The following is what looks like a contradiction to me: (Note that 'bad' here means 'lacking in predictive power'.)
In  it sounds like no new 'good' phrases can ever be added. Then  seems to contradict that. But this must be because of the poorly writte "plain English" patent language. If the 'good' and 'possible' phrase lists really stayed empty, someone would notice.
But this is all in the preliminary stage of identifiying 'good' and 'bad' phrases, so I just let it pass and assumed poor editing and/or proofreading. I'm very willing to grant that a solid list of related phrase is built. What interests me more is how that list of 'good' phrases and documents where they occur is now put to use.
This patented process for spam detection is looking for excessive numbers of related phrases (scraping a top 30 list to create a patchwork page could create that condition). It's also looking for excessive occurances of any one of the 'good' phrases - stuffing in other words.
The thing is that phrase based processing can also be used simply to rank honest documents for relevance to the search phrase. The way I understand it, spam documents identified by this process should be way over the top -- not just a little bit more intense than an honest document.
[edited by: tedster at 3:15 am (utc) on Feb. 18, 2007]