Where would one get bad phrases from as they would go into infinity. So I guess they grabbed their favorite server WP and maybe some other dictionaries. Whatever isn't on them might be bad phrases.
Besides some specific words like SEO, WAREZ, afilliate and so on, I wouldn't know how else you would get a list of words and phrases that would be unrelated.
I think these filters are definitely at work, how else could a wikipedia snippet rank on page 1 (having passed the extrenal duplicate content filter as it's diluted with boilerplate content), while a completely valid school teacher script with 20 times more info is 950ed. The more detail and the better text and the more background info that the simplistic WP doesn't have the more likely an author is to have a bad phrase. An extensive professional script is more likely to hit a filter than all the lexika entries that I see now in the German SERPS.
Given the tabloid formula of life any extensive information would also increase your bounce rate as users want short answers yesterday and Google does what users want and don't seek for quality.
That doesn't mean a long article is 100% cert to sink, but your chances definitely increase.