For what I was doing I was seeding my searches based off of whatever web page I was doing an analysis of. For example, I'd spider a web page and break it down to the words in the original order.
I'd then start to group the words into as big of sets as I wanted to again keeping them in order. Those sets of words then became my lexicon for the moment.
By going in order it expanded my phrases and since most pages are written by a human it worked very well.
On pages not written by a human, or poorly written pages, the co-occurence falls off drastically as the phrase gets longer and I would score them much lower than others. I'm guessing it's those pages that Google considers spam? I guess I did as well which is why they'd be thrown out of my algo.