|Avoiding duplicate content penalties with republished articles|
Richard Lowe wrote in a post:
|I get quite a bit of traffic from links, which I've gotten by allowing other sites to republish articles as long as they include a link to my site. I'd say about half of my traffic comes from links. |
The issue has been touched on to a degree in other threads about duplicate content, with the possibility of the page with the lower PR being dropped and the one with the higher PR being kept in the index. The wisest course is probably to avoid duplications if possible, which is not always the case when content is taken by others, including competitors.
But how about in cases where there are pieces written that can be distributed, not necessarily as syndication, but limited distribution as valuable content to several other sites, with or without a link back to the source site?
I'm posting this in the Google forum because there's a possibility of this happening and there's no concern other than Google. It's no problem if any of the pages are not included, because there would still be link traffic, if a link were provided and the primary purpose of doing it is strictly to share the information where it will be useful to people.
It's of concern now because there's been a request for a little something for a newsletter that's archived and indexed after its distribution because it's put on the site, and a few sites that approximately (not exactly) the same pieces can be published on.
There's no sense doing something and finding out after that there's been a penalty.
Two questions regarding this:
1. Is there any way to safely have articles or content on several sites without anyone incurring penalties?
2. If the answer is to have a certain percentage unique or different enough to distinguish them and not be considered exact duplicates, how much is required to be changed, and what percentage needs to be unique?
I'm not sure that penalty is the word, Marcia. Users don't want to find the same content listed several times in a search engine, so the engine should try to list only one URL for each.
> 1. Is there any way to safely have articles or content on several sites without anyone incurring penalties?
If you're trying to get the same page listed multiple times, then I can only guess that you are at odds with the search engines.
> ...what percentage needs to be unique?
Good question. Anyone?
Technical information, prescription information, legal briefs, press releases, classified advertising all repeat their content. The first three particularly have reason for mirrored content. In my research Iíve seen these repeated in misc. sites, reproduced exactly with apparently no ill effect. The more popular the theme the less repetition I see of the same content near the top rankings but they do begin to cluster as new content maximizes out.
Whatís the threshold is a good question. I remember Robert_Charlton asked that once about something else and it turned into a pretty good discussion.
Might be worth looking at the MarketPosition newsletter if anyone wants to check out how syndicated content is dealt with. That's probably one of the most widely distributed publications with the clause that "This publication may be freely redistributed if copied in its ENTIRETY."
>If you're trying to get the same page listed multiple times
Just once actually, but I'd prefer that it be the right one. For any other it's immaterial whether it's listed or not. One won't be, for sure, it'll be for a password protected membership area.
In cases where some material would be the same but part would have to be different anyway because of a different audience, there might end up more than one listing, so the threshhold for repetition is what I'm mostly concerned with, what percentage needs to be different to be considered unique.
I think I vaguely remember something like 80%, but that might have been for links pages; it was a long time ago.
I would guess Google would treat duplicate content cautiously.
Unless duplicate content is obviously replicated over several pages on two sites (such as mycompany.de and mycompany.com), Google would suggest to be playing copyright referee if it punished one of the two. Although Google has the right to do what they please they would be then treading a slippery ground.
If a search query would be done within Google for a set of words that exists in the identical content on both pages, but that does not contain words that occur in any internal or external inbound linktexts to these pages, Google could show the page that exists longer in their index first (buy taking the age of links into account and using the unique page content identifier - both ideas from the recent Google programming contest) instead of showing the page with the highest Pagerank first. At least that would probably be fairest.
Discounting the regular penalties, can anyone show me a page which got penalised (grey/white toolbar) for showing the same content on one page as on another non-related site?
> Just once actually, but I'd prefer that it be the right one
Assuming that you can't add a robots exclusion meta tag to the other copies of your page, or that you don't want to because they link back; the 'highest PageRank' approach seems to be the only way Google has of choosing at the moment.
Otherwise, as far as I know your only option is to make the pages different enough. Usually, having someone else's header, footer and navbar is more than enough.
Some very close mirrors of pages do make it into Google.
> ...playing copyright referee if it punished one of the two...
The word 'punished' worries me. The overwhelming impression I get is that Google aren't trying to punish for this, just that they don't want to list a bunch of identical pages for a given search phrase. If they were trying to punish, then surely they wouldn't merge the PageRank.
Pages got the white/grey Toolbar back in December, but that was fixed. Whether it was a penalty or glitch can be debated; I suspect a glitch (or at least something that Google saw as a mistake).
Paynt, I think that you are referring to Rob C's post here [webmasterworld.com] which dealt with a client mirroring its site content on co-branded newspaper sites.
I am trying to hunt down a paper on how Google identifies similar content as well as the mechanism used to make the decision on what is sufficiently dissimilar to avoid penalty. There has been a lot of circumspect discussion around duplicate content.
Something similar to the Altavista paper which outlined their mechanism (which they are trying to patent) that relied heavily on a sites internal and outbound link structure.
Anyone prepared to point me in the right direction?
[edited by: pete at 1:26 pm (utc) on Aug. 5, 2002]
>taking the age of links into account
That's simple, but there can be links pointing to a site that have been around longer, yet specifically it may not have the same age factor as the links to the specific pages in question.
>and using the unique page content identifier
vitaplease, this I'm not familiar with, I must have missed that along the way.
>Usually, having someone else's header, footer and navbar is more than enough.
I'd hope that would be enough, yet a member here lost out on his site totally by someone *taking* his content.
This whole issue of duplicate content, along with questions about multiple domains are almost a constant topic now, and it goes on so it doesn't seem like it's been resolved clearly enough to reach the comfort level for a lot of people.
ciml, a lot is getting by at Google right now, so either they're not as adept at finding it or dealing with it as some would think, or one of these days there will have to be a massive purging.
pete, I hope you find those papers, that would make a very good read right about now.
>>and using the unique page content identifier
>vitaplease, this I'm not familiar with, I must have missed that along the way
Thomas Phelps and Robert Wilensky
I'm not sure if that would do, but that is what I meant.
>That's simple, but there can be links pointing to a site that have been around longer, yet specifically it may not have the same age factor as the links to the specific pages in question.
every single page would start with an original internal link (otherwise it would never be indexed) that could carry a date stamp. The content (main body text) could change overtime, reducing that effect, but the original content identifier would then say it is a "new" page.
In general, showing only the highest pagerank would be unfair, as Joe Blow's original text will most probably have a lower Pagerank than the Newspaper site's page that copies the content.