How and when does duplicate content hurt a site?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How and when does duplicate content hurt a site?

steveweber123

6:21 pm on Apr 13, 2006 (gmt 0)

Everything I have read says the number one no-no in SEO is using copied content. I have always written original content for every single page for my site. And I am pleased with how I rank in Google.

However, for some of my most important search terms (which I am on the first page for), several of my competiors rank higher than I do while using duplicate content. In several situations two of them are higher than my page and they are using the SAME 700~ words of content word for word.

How can this be if original content is all important to search engines? It is frustrating to see my original content indexed lower than copied content. By the way, my page rank is 3, theirs is 4. Is this the reason?

freewebsiteideas

6:04 pm on Apr 14, 2006 (gmt 0)

This is a myth. Many Amazon clone website rank high on Google.

ZoltanTheBold

11:27 pm on Apr 15, 2006 (gmt 0)

It is indeed a myth in my view. It just doesn't seem computationally feasible for a search engine to compare pages in this way. I have no doubt identical pages (and I mean identical in every way) may be more vulnerable. However a page with unique menu's, footers, heading etc., is sufficiently different, even with copy/paste body text, to make it difficult to spot.

Incidentally I'm in the same boat. Watching other people soar to the top with little effort when it's taken me 18 months. Truly sickening.

steveweber123

11:59 pm on Apr 15, 2006 (gmt 0)

I really want to believe that multiple sites ranking high with duplicated content are simply flukes. Isn't the ultimate example of spam demonstrated with copy and paste?

JuniorOptimizer

10:17 am on Apr 16, 2006 (gmt 0)

Duplicate content ranks fine and dandy. This has to be the greatest SEO myth going. There are certain SERPS that are now dominated by article directories with the exact content.

tedster

10:52 am on Apr 16, 2006 (gmt 0)

There are several very different things that get called by the same name: duplicate content. I immediately think of two very common types:

First there is the situation where a page on one domain has the same content as a page on another domain. In this case, Google has a complex algorithm that tries to filter the results so that they are not swamped with copies of the same information in the user's top results. In other words, the Google algorithm will try to pick one domain and filter the others to a lower position (or cluster them in a "similar pages" listing) -- but this is not a penalty inflicted on a site, it's a filter on the results. It might feel like a penalty if your domain is the origin of the content and your page gets filtered out of the first page results, but there's no "black mark" against you in this case.

And yes, the filters are not perfect at all, and sometimes copies of the same content can dominate a result.

A second case occurs when, through some poorly thought out code, many differing urls on the SAME domain all resolve to the same content. There are all kinds of errors in planning and configuring a server that can make this happen. A common error, for instance, is putting tracking variables into a query string. A different query string is a different url! Another mistake is using a redirect to serve a "custom error page" instead of returning a 404 http header.

Google tries to select the dupes and then put all but one of them into the "supplemental index". If a domain has just a few instances of duplication like this in the Google index, things tend to go on as normal. But when many, many urls start showing up, all with identical content, then something seems to get tripped at Google and a site can start to see trouble.

I think a lot of that trouble is just a protection for googlebot - to keep from spending all kinds of bandwidth grabbing one copy after another of the same thing. But whatever the exact programming reason, urls that used to bring in traffic can start to drop from the search results.

Case #1 - duplicate content is on different domains
Case #2 - duplicate content is all on the same domain

How to deal with this? In case #1, keep building those "signals of quality" in order to compete against the other domains. Or maybe file a few DMCA complaints! In case #2, locate the problem and fix it -- insure that any bit of content can only be accessed by one unique url, and that every unique url only ever gets one consistent bit of content.

There are other varieties of duplicate content that can get more complex -- for example, where the duplication is across two or more domains that have a "Hilltop" relationship (next to last domain token is the same, a shared IP address, domains are heavily interlinked, etc). This type of duplicate content can cause trouble. Naively configured "domain forwarding" that does not use a 301 redirect, for example, can create this kind of problem, especially if many domains are involved.

For examples of duplicate content handling at Google, I found it can help to search for widely re-published news stories and study how those results are handled.

--------

In the case that steveweber123 mentions in the opening post -- yes, Page Rank of the url can be one factor here. Others might be anchor text within the site or on inbound links, or the kind of template that holds the content, titles of linking pages and so on.

It looks to me like you can beat a higher PR page if you have more natural inbound links that point DIRECTLY to the page. The competition may still have higher PR because they have lots more links elsewhere and they circulate that PR well, but I think I see evidence that Google will value deep and direct inbound links.

theBear

1:43 pm on Apr 16, 2006 (gmt 0)

Case #2 is by far the most damaging.

If Google puts the real page in the supplemental index it is in limbo land. The other copy isn't likely to rank due to lack of links, etc (signals of whatever).

When it comes to software written by wetware to control hardware you should expect some errors somewhere.

jadebox

3:06 pm on Apr 16, 2006 (gmt 0)

Although it's implied in the earlier responses, I think it's important to make it clearer that ....

In "Case #2" external links may point to different URLs which result in the same page of information displayed. So page rank from the links is split among the different URLs instead of all applying to one page. This is the same problem caused by having multiple domains resolve to the same site without a 301 redirect (for example having www.domain.com and domain.com going to the same site without redirecting one to the other).

-- Roger

steveweber123

4:46 pm on Apr 16, 2006 (gmt 0)

Thank you all for your responses. And Tester, thank you very much for the informative reply! What you said really clears up the issue for me.

Thanks again,

steveweber123

Simsi

9:00 pm on Apr 16, 2006 (gmt 0)

Another mistake is using a redirect to serve a "custom error page" instead of returning a 404 http header.

Hi tedster,

Could you expand a little on why this is a no-no?

Thanks

Simsi

g1smd

10:13 pm on Apr 16, 2006 (gmt 0)

Good post Tedster, couldn't have said it better myself.

As for the "Error Page", it MUST return a HTTP response of "404". Nothing else will do. That will signal to the bot that there is nothing at that URL to be indexed.

Far too often I see a site where if you ask for any page that does not exist (either a page that has gone, or just a simple typo in the URL), you are served a 302 redirect which takes you to a page with some basic site navigation on it.

The problem is, is that that URL will be indexed with the Error Page content, and so will all of the other URLs (potentially an infinite number) that lead to that same Error Page content.

So, once your error page content is indexed under thousands of different URLs, you have a big Duplicate Content problem on your hands; especially if you have, say, 100 real pages of content, and 500 "fake" Error Page URLs indexed. Those mis-indexed URLs are going to trip some trigger to cause you a lot of grief.

Make sure your Error Page really does return "404 Not Found" in the HTTP Header. Use WebBug to check that it really really really does this.

Simsi

4:31 pm on Apr 17, 2006 (gmt 0)

Thanks g1smd. Seems logical :)

Edit: Actually, what if you redirected to a PHP page that loaded random product content each time. For me, that would give a potentially better solution to users...would it be okay in SE terms?