jmccormac - 5:34 pm on Jan 23, 2011 (gmt 0)
@wheel Google's bright ideas always worry me because they are often half-baked imitations of something done elsewhere. Content farms might be wider rather than deeper with URLs rewritten so that they use / as a delimiter rather than the traditional click depth away from the front page.
There are possibly few really global interest sites. Once you get down to the national or local level, the audience and the link patterns would become clearer. If what is supposed to be a targeted (local/national/niche) site starts getting links from sites that would be very far outside their link environment, then it might be a hint that the site is a content farm. However that kind of thing is very difficult to automate.
I built a development level search engine for Irish websites in October last and most sites were quite shallow in terms of page count/depth. There were some large sites but most sites tend to be brochureware. I think Verisign found something similar in its one of its larger surveys (the Verisign Domain Brief often carries a simple graph of the results). There are also certain HTML and CSS signatures that could be used to determine whether a site is a content farm.