Duplicate content in forum and articles - will I get penalized? - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Duplicate content in forum and articles - will I get penalized?

zshadow

9:47 am on Jul 6, 2008 (gmt 0)

Alright, I run a news site that integrates with a forum, and we mirror the articles from the main page over on the forum.

Will the site get penalized for this? It's been running for around 2 years now and ranks high for several keywords, but I know Google can be unpredictable at times, so I want to avoid any possible penalties.

Robert Charlton

7:49 pm on Jul 6, 2008 (gmt 0)

zshadow - Welcome to WebmasterWorld. You won't get "penalized," as Google doesn't actually penalize for dupe content. It does filter dupe content, by showing the page version that has the highest PageRank, and not displaying the others.

As such, you're in a situation where you may be splitting your link vote or dissipating PageRank to the copies of the content that aren't being displayed. On the other hand, it might be argued, having this content available to users via navigation creates more opportunity for traffic and inbound links.

I myself would tend to display titles and summaries only on one page or the other... and have only one full version of the article on the site.

Take a look at the Hot Topics [webmasterworld.com] section, pinned to the top of the Google Search forum home page, and look at the Duplicate Content/same domain section. There's no one thread there that's going to discuss exactly what you're asking... but I'd take a look at the Wordpress and possibly the Vbulletin threads for starters, because they probably touch on some of the same issues. Here's a link to the Wordpress thread...

WordPress And Google: Avoiding Duplicate Content Issues
What about posts in few different categories?
[webmasterworld.com...]

tedster

8:08 pm on Jul 6, 2008 (gmt 0)

There's been a casual over-use of the word "penalty" around duplicate content. The Google staff tells us that a true penalty for duplicate content is quite rare - and when it's given out it's aimed most at scraper sites and other kinds of spammers.

What is most common, as Robert mentiond above, is that only one of the pages will rank, and the others get filtered out of the results - but without a "penalty" being given. This makes a lot of sense - why would Google want to show essentially copies of the same information on one search result. That would make very unhappy end users.

Let's look at three kinds of duplicate content situations:

1. Different Domain Duplicates
In this kind of situation, the content is duplicated on several different domains. It's easy to have this happen on sites that publish rss feeds, or through press releases and syndicated articles. Google tries to filter out the copies here, and only display the original source of the content.

That is not easy to do, but whether the site is chosen to be displayed or filtered out of a given result, that is not a penalty. If a domain does nothing except re-publish others content, it may well be always filtered out. After a period of demontrating "nothing original to say" that can feel like a penalty. Indeed, there might be some kind of black mark against this kind of domain, and it certainly would be deserved.

2. Same Domain - Intentional Duplicates
This is the situation you described. You are reproducing the information from your news section for discussion purposes in a forum. And currently you are seeing some good search rankings. Clearly, Google is valuing what you do.

It isn't clear whether the Google Search traffic is mostly coming to your news pages, or to your forums, or to a mix of the two. It might well be a mix, since there would be two slightly different emphases here. in any case, it sounds like you're doing something right and I would not suggest a casual change, just made out of fear of the phrase "duplicate penalty". However, you might want to study the situation closely to see if the traffic is coming to the best possible version of the content - and if it isn't, some kind of change could be brainstormed.

3. Same Domain - Accidental Duplicates
This situation occurs when a domain's technology accidentally allows different urls to resolve to the same content page - there's a wide variety of ways this can happen. Here's a list of some of the common ones:

with-www and no-www versions
http and https versions
varying the case of the file path
changing the order of query string parameters
url rewrites that key off a number but allow any keyword spelling in another spot in the file path
common CMS set-ups that allow content to be found through different urls
for more situations and details, see Canonical URL Issues [webmasterworld.com]

Any site should plug as many of these technical holes as they can. For one thing, the multiple url situation wastes the crawl-budget that Google assigns to the domain. And for another, too many of these can rapidly bog down as website in Google, to the point where almost nothing gets into the main index. and then there's the situation where other websites give links to different versions of the same content, splitting the power that a single version should be getting into several little piles of link juice.

If one of the above factors creates two variations, and another creates two, the domain now has four variations of every url (quadruple content). If a third factor kicks in with two more variants, we're up to eight - octuple content. You can see why Google might not be able to do very much for a domain like that.

[edited by: tedster at 10:19 pm (utc) on Aug. 10, 2008]

Robert Charlton

8:37 pm on Jul 6, 2008 (gmt 0)

It does filter dupe content, by showing the page version that has the highest PageRank, and not displaying the others.

I should add to my comment above that attributing application of the dupe filter to PageRank only is an oversimplification, as the duplication filter is applied in a query specific fashion.

So, in the case of different organizations of the same content on a site, different inbound links, and different queries, one page or the other might rank.

It might be an interesting test... one I haven't run... to give identical content two different titles and see whether both pages might rank, but for different queries.

Similarly, additional content or different titles on a forum page might cause Google to view that content in a different enough perspective that the forum page and that article page might rank for different queries, or result in indented results for the same query.

tedster

8:47 pm on Jul 6, 2008 (gmt 0)

For people who are following duplicate and near-duplicate detection methods closely, Google applied for yet another US patent in this area in 2008. The technolgy was created by Monica Henzinger. Here's a Link to the USPTO [appft1.uspto.gov].

Here are some of the references that were in Hot Topics [webmasterworld.com] up to now. I'm listing them in this thread now, and removing them from Hot Topics to help condense things:

Duplicate Content [webmasterworld.com] - get it right or perish
Why "www" & "no-www" Are Different [webmasterworld.com] - the canonical duplicate issue
HTTPS versus HTTP [webmasterworld.com] - one more duplicate area
Domain Root vs. index.html [webmasterworld.com] - yet another kind of duplicate
Custom Error Pages [webmasterworld.com] - beware the server header status code
Vbulletin [webmasterworld.com] & Wordpress [webmasterworld.com] - duplicate content pitfalls

Jez123

9:26 am on Jul 29, 2008 (gmt 0)

On the topic of duplicate content.

I am seeing, in my SERPS, a duplicate copy of a .co.uk site on another domain (often the .com they bought at the same time as the .co.uk I expect). There are about 3 or 4 sites doing this in the top 10 of my SERPs - an identical copy with a few or more links going to their main domain and they cleaning up in the SERPS. One is at #1 and one is at #2 and one is at #5 and I haven't even looked closely at the others. I thought that this was the sort of thing that google was good at spotting?

It seems that the closer the similarity in the sites linking to you the better (which I guess we all know) but if they are identical then it is a winner.

Both the sites at #1 and #2 do have some good links as well but I am certain that they are cleaning up due to the dupe content links back to their main sites.

I am a bit bitter as I used to be at #1 and now I am at #3 with cheaters ahead of me! :-)

It was suggested to me when I posed the question on a news group that google may well have relaxed filters so that the rubbish floats to the top and makes it easier to spot manually - I don't buy this though. Is anyone else seeing dupe content on this scale going unnoticed and unpenalized?