homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 41 message thread spans 2 pages: 41 ( [1] 2 > >     
Content Farm Detection - What Might Google Do?

 5:51 pm on Jan 22, 2011 (gmt 0)

Can we speculate on what attributes Google might look for when seeking out content farms?

1) large volume of pages, maybe 10K+?
2) regular addition of pages?
3) directory-like nav structure? i.e. to distinguish this from busy blogs that I would think would have a more linear structure.
4) incontent links on most pages.
5) some measure of inbound links? (this one I'm concerned about personally) Though I don't have any idea what attribute of inbound links might point to a content farm.



 7:27 pm on Jan 22, 2011 (gmt 0)

These are good questions to muse over, IMHO.

But using those criteria, wouldn't a site like wikipedia qualify as a content farm?


 7:27 pm on Jan 22, 2011 (gmt 0)

A lack of natural links. (Seems to be Google's new phrase.) The rate of new link aquisition and the geographical and or niche type of link might be indicators.



 11:28 pm on Jan 22, 2011 (gmt 0)

bad news jmccormac. I agree, but I think that really means if you have a lot of content, then one thing we are going to have to do is build links. Don't build links, you run the risk of getting penalized. I've already noted that in my niche, links don't come naturally, at all.

Planet13, I wouldn't personally argue too hard if someone felt that wikipedia was a content farm.

I just thought of another one possibly:
- broad topic base, i.e. not in a tight niche. Perhaps one thing to consider, if you've got a content farm, is to have tightly themed silos, or maybe different sites. (I dunno, maybe there's sites out there that are tightly themed).

norton j radstock

 7:14 am on Jan 23, 2011 (gmt 0)

Whilst outwardly looking normal, the obvious factor (to human readers) is bland low value text that really adds nothing new -its sole purpose is to catch search engines, and as such it is likely to have have an excessively high number of related keywords.

I would have thought that a strong indicator is a low proportion of outbound links other than adverts.


 7:21 am on Jan 23, 2011 (gmt 0)

I would have thought that a strong indicator is a low proportion of outbound links other than adverts.

Eh, that's tough though, because if they drop the nofollow links from the link graph as they say they do, then the low quality outbound linking probably looks much like a high quality gov or edu site...

As I said in the other thread, IMO The thing they will have to do to really 'get it right' in this area is find a reliable 'correctness signal' rather than 'quality signals' ... If they can do this they'll nail it, but it's not as easy to determine the 'correctness' of a page as it is to associate 'quality signals' with a page.

It's a real challenge they have if they are going to get it right on this one, because links, publishing frequency, grammatical correctness, etc. don't tell the story of whether a page is a 'correct result' or 'quality result' and IMO there's a BIG difference.


 11:06 am on Jan 23, 2011 (gmt 0)

I think some of what Matt Cutts is saying is simply trying to scare people. I think they will start to remove some really really low quality sites especially autoblogs and content scrapers, since they should be relatively easy to detect. However as the OP shows, detecting content farms and other low-quality sites is quite a subjective matter.

I say this because loads of the bloggers seem to be complaining about sites with small amounts of content, in some cases just a few pages of content in total. So Google could mean that they are going after large sites with thousands of pages of low-quality content, however they could just as easily mean the really small sites with only a few pages of low-quality content.

As someone who puts out quality content, I'm not exactly worried about this blog post however it will be interesting to see what Google does consider content farms to be. The cynic in me unfortunately can't help but think that they will go after the small guys but leave the really large spammy sites such as biz rate and wise geek alone.


 11:19 am on Jan 23, 2011 (gmt 0)

Another thought on this that's more directly at the Topic of the thread...

They might work it backward (I think I might anyway) ... So:
What factors could be used to determine what is NOT a content farm?

1.) Relatively Fewer Pages
2.) Non-constant Additions
3.) Original Information
4.) Singular (or consistent) Writing Style(s)
5.) Updates to Content Rather than Constant Addition of Content


 5:34 pm on Jan 23, 2011 (gmt 0)

@wheel Google's bright ideas always worry me because they are often half-baked imitations of something done elsewhere. Content farms might be wider rather than deeper with URLs rewritten so that they use / as a delimiter rather than the traditional click depth away from the front page.

There are possibly few really global interest sites. Once you get down to the national or local level, the audience and the link patterns would become clearer. If what is supposed to be a targeted (local/national/niche) site starts getting links from sites that would be very far outside their link environment, then it might be a hint that the site is a content farm. However that kind of thing is very difficult to automate.

I built a development level search engine for Irish websites in October last and most sites were quite shallow in terms of page count/depth. There were some large sites but most sites tend to be brochureware. I think Verisign found something similar in its one of its larger surveys (the Verisign Domain Brief often carries a simple graph of the results). There are also certain HTML and CSS signatures that could be used to determine whether a site is a content farm.



 12:16 am on Jan 24, 2011 (gmt 0)

What factors could be used to determine what is NOT a content farm?
3.) Original Information

So many of the problems that legit webmasters have with Google has to do with the difficulty they are having in determining who originated content. If they could do that with a fairly high degree of certainty, they could more easily identify scrapers; if they could identify scrapers, they could more easily identify content farms.

Making life more difficult for scrapers ~ now there's a worthy focus.



 12:18 am on Jan 24, 2011 (gmt 0)

I don't think a bot can recognize a content farm between a "normal" website.

The only common factor between content farm that a bot could find is:
large differences among the topics covered

So if a website treats too many topics would be considered a content-farm, but this problem could be easly avoided by buying more domains and place 1 topic for domain.

So I think this is one of the case where google will use humans to filters it :D


 12:38 am on Jan 24, 2011 (gmt 0)

Yes, it's a hard job to use algorithms to assess quality - but I don't rule it out. For example, there are linguistic algorithms today that do a pretty good job at determining the gender of the author!

Now that Matt Cutts has given us the phrase "document classifier" in this context, I think that's a clue at what direction they'll be testing. So a forum thread would be held to a different measure than a blog post or a product description.


 12:47 am on Jan 24, 2011 (gmt 0)

So many of the problems that legit webmasters have with Google has to do with the difficulty they are having in determining who originated content.

I agree, and to throw another wrench into the 'solving it' works... They've had the problem for sooooo long how do they take the ugliness of some of the worst scrapers / content farmers such as eHow and even About / WikiPedia out of the results on not disappoint visitors?

They've gotten their visitors so used to seeing the big content farmers how can they remove those sites and show the site that's actually producing the content or produced by an actual authority on the subject without disappointing their visitors?


 12:56 am on Jan 24, 2011 (gmt 0)

funny thing is eHow and its network is valued about 2bill... that's a pretty damn lot of money...

I would not stand in the shoes of eHow's creator after reading google blog lol.

Plus i have to say: I heard writers on eHow got paid like 0.0025$ for each ads click on their articles.... but how can they log every adsense click for each pages? Or google is helping the site with a custom solutions (and this would be a nonsense) or they take data from google analytics... but analyzing everymonth a csv file containing more than 20milions of records i guess it's not funny lol


 1:17 am on Jan 24, 2011 (gmt 0)

So many of the problems that legit webmasters have with Google has to do with the difficulty they are having in determining who originated content.

Yes - and solving the scraper site problem has got to be a big piece of this. The part of it that has baffled me for a long time is how a page can rank well for years and then suddenly be displaced from the SERP as a "duplicate" of a new page on a scraper site.


 2:13 am on Jan 24, 2011 (gmt 0)

how a page can rank well for years and then suddenly be displaced from the SERP as a "duplicate" of a new page on a scraper site.

I was wondering too. On one of personal websites I wrote a blog post on some price comparisons on some general items between countries. It was an original post where we decided on a basket of products and actually went to the shop with a list and pencil and researched prices. Wrote an article that was ranking #1 for two years until someone scraped it to Yahoo answers. They have not changed that article, it is word-for-word copy. And this scraped page replaced our page as #1.

I can understand Google issues where a site publishes article and it gets scraped almost immediately. Where the timeframe between original and scraped is short, it is difficult to tell who is the originator as it may depend on how often the site is crawled e.g. if a site that is crawled every few minutes scrapes someone's content, and the originating site is crawled twice per week - there is no way Google will able able to tell with certainty who is the originator.

But for article that was there 2 years and then scraped - this is obvious, isn't it?

The only way I can see this being solved is if Google provides some kind of "ping" service where you ping the site with new URL before you interlink it to your site. In that way Google can get to it, but no-one else knows about it (yet) and you interlink URL after x amount of time.

But as I am writing this, I can see this is too complicated and does not solve the issue of changed content on already interlinked (known) URLs.


 2:22 am on Jan 24, 2011 (gmt 0)

Technically, Google is a content farm, almost nothing original there except their corporate blogs which is insignificant based on the scale of Google.

Pot, kettle, black, all that stuff.


 3:01 am on Jan 24, 2011 (gmt 0)

But for article that was there 2 years and then scraped - this is obvious, isn't it?

Obviously only to us unedukated types... If we had a doctorate of any kind I'm sure we would see the faulty reasoning in ranking the original content source over the later found duplicate.

ADDED: I've gone down this road before, aakk9999 ... Aggressively

IMO It's one of the stupidest things they do... They want to find content farms and weed them out, and IMO it's much simpler if they just rank the originally found source and let the content 'owners' fight it out if there's a dispute over who truly published or authored the content originally... It's what I think they should (have to?) do if they are truly an unbiased information retrieval system.

Before anyone tells me how tough it is or why they can't, please read my arguments in the linked thread...

### # ###

Apologies for getting a bit OT there for a minute.


 11:47 am on Jan 24, 2011 (gmt 0)

I think we merged 2 discussions in one.

the author of this topic asked "what attributes Google [..] seeking out content farms"

someone is analyzing scrapers... for what I have understood conten farms are a thing, scrapers are another.

Google needs 2 completly different measures to handle contents farm and scrapers

in my hopinion scrapers are pretty easy do identify, contents farm not


 2:04 pm on Jan 24, 2011 (gmt 0)

I think there's so much scope for false positives in the majority of the suggestions made so far, so I'm with the people who think this is all just hot air on Google's part.

The bottom line is, it's up to individual webmasters to do something to counter the dominance of content mills. Waiting for any search engine to fix it will be a wait too long.


 11:34 pm on Jan 24, 2011 (gmt 0)

The attributes scoring model posed by Wheel is bound to be inefficient because it examines only symptoms of creating spammy content: features of the content itself and its backlink structure. Thatís analogous to identifying a ship by examining its wake.

I imagine the most accurate and efficient test would come from evaluating the content itself: that is, its semantics, grammar, and originality. That has to be where Google may be headed in the short term.

So how to detect this? In the short term, perhaps by identifying word patterns that are symptomatic of this kind of writing. Maybe thatís progress, but that still doesnít get to the meaning of the content, which is how we humans evaluate. Hereís an example:

There is a kind of content that is grammatically correct and on topic, but lacks any original point of view. That is, it simply rambles on with a list of facts. You know it when you see it. The purpose could be (and often is) to manipulate rank, but it could just as well have been written with genuine intent, but simply immature. Either way, this kind of document is not adding anything unique to human knowledge pool, and therefore would not be highly valued.

Meaning is subjective -purely human, derived at a moment in time from the conditions present in that moment. But hereís the rub: I don't believe computers can determine human value with any kind of accuracy because value constantly changes, based on the workings of the market. And internet connections and Google's ranking are indeed a market.

Google has attempted to use links as a sort of measurable currency, but they are too easily scammed. But systems like eBay or various social media operate on reputation and seem to work pretty well; however they are all closed or restricted in some way. Can a reputation model be applied in a scalable, workable way to the internet, but be relatively free from manipulation?


 2:06 am on Jan 25, 2011 (gmt 0)

You think links are easily scammed? I think content is far more easily scammed.

I don't believe Google can stay ahead of blackhats by algo checking content for grammar. Auto generate your content, run it through a grammar checker. Dark hats were doing this almost a decade ago. I'm sure they're long past grammar checking and on to natural structure - the blackhats, not Google.

Check much beyond that and you're going to penalize everyone who's not a university lit prof.


 8:53 am on Feb 4, 2011 (gmt 0)

Aye, google is a content farm.


How many links per page tops?


 9:32 am on Feb 4, 2011 (gmt 0)

Maybe counting adsense ads. spots and localization for mfa? - publisher id and tracking its mail address on where he is active etc.?


 4:25 am on Feb 4, 2011 (gmt 0)

System: The following 14 messages were spliced on to this thread from: http://www.webmasterworld.com/google/4262600.htm [webmasterworld.com] by tedster - 10:46 am on Feb 4, 2011 (EST -5)

One of the best things an SEO can do is stay ahead of the curve. That means getting an idea of what Google currently considers a problem they want to solve, and then seeing if there's any way that your site might be looked at that way. Whenever a big algo change rolls out, false positives do end up hurting some sites even as the bulk of the issue gets cleaned up.

Where is Google fixing their site right now? Content Farms. If you've got some time, check out this new video of a panel discussion with Matt Cutts from Google, Harry Shum of Bing and Rich Skrenta of Blekko. Bing Faces Off Against Google Over Search Results [bigthink.com] - it's about 40 minutes long, but I found it all quite worthwhile.

The video brings up a big question. Google says they want to resolve the content farm issue algorithmically - rather than just a hand banning the way Blekko has announced they will do. That kind of talk makes me nervous. What would an algorithm be measuring?


 4:31 am on Feb 4, 2011 (gmt 0)

My first thought is that traffic data, user data, should offer some good clues - and we think Google is already measuring and using user data. So I'm thinking that any site with a huge bounce rate (like over 80%) across ALL their landing pages might want to take a look at addressing that.

I'm also thinking that they're probably going to fold this challenge into the work that their human editorial army does. For context, you may want to read our thread Google Patent - human editorial opinion [webmasterworld.com].

I honestly don't know which approach worries me more - a manual ban like Blekko is doing, or trusting an algorithm to do the job with an even hand.


 5:30 am on Feb 4, 2011 (gmt 0)

I don't have any answers either, Ted, and it worries me too. The biggest hazard of manual banning that I can see would be judgements made by people who only have a superficial knowledge of the subject area they're assessing.

Two facts loom large in the present situation: (1) Google has traditionally put a lot of value on independent editorial links, and (2) strong search rankings are some of the best link bait there is.

Several years ago Mike Grehan described what happens in his article Filthy Linking Rich And Getting Richer! [e-marketing-news.co.uk...]

Content whose main virtue is being easy to find will end up getting linked to more often, liked more often, tweeted more often and so on, than better content which was written with less knowledge of how to suck up to the search engines.

A lot of dubious content has ended up with stronger "signals of quality" than it deserves, for no other reason than that the search engines granted it higher visibility. It's famous because it's famous, not because it's good.

An even bigger problem is that when there's so much plagiaristic sludge at the top, it's a strong disincentive for genuine subject matter experts to write much.

"Google's mission is to organize the world's information and make it universally accessible and useful." [google.com...]

That sounds noble ... but Google has been creating distortions in the world's information even as it tries to organize it.


 5:37 am on Feb 4, 2011 (gmt 0)

I haven't studied the backlink support for content farms very closely, but your comments started me thinking that there might be a big footprint there for Google to look at. There was a recent post on Hacker News by a Google engineer with the screen name "moultano [news.ycombinator.com]":

Some really dramatic changes to how we use links are on the way. (Sorry I can't say anything more specific. This is a really sensitive area.)
(thanks to tristanperry for spotting this.)

Makes me wonder if that is part of the Content Farm game plan.


 1:46 pm on Feb 4, 2011 (gmt 0)

What sort of backlinks would content farms have that others don't? I can't imagine anything.

I think they've somehow weakened authority links over the years by allowing large volumes of low quality links to get you ranked over lower quantities of high quality links. But I can't imagine that's what they're talking about (I'd love it if they did though).


 1:55 pm on Feb 4, 2011 (gmt 0)

I strongly suspect that internal links (a large number of them) could be used for this.Especially when there are too many internal links to address many keyword variations.

This 41 message thread spans 2 pages: 41 ( [1] 2 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved