homepage Welcome to WebmasterWorld Guest from 54.145.238.55
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Avoiding duplicate content penalties with republished articles
Marcia

WebmasterWorld Senior Member marcia us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4543 posted 12:27 pm on Aug 1, 2002 (gmt 0)

Richard Lowe wrote in a post:

I get quite a bit of traffic from links, which I've gotten by allowing other sites to republish articles as long as they include a link to my site. I'd say about half of my traffic comes from links.

The issue has been touched on to a degree in other threads about duplicate content, with the possibility of the page with the lower PR being dropped and the one with the higher PR being kept in the index. The wisest course is probably to avoid duplications if possible, which is not always the case when content is taken by others, including competitors.

But how about in cases where there are pieces written that can be distributed, not necessarily as syndication, but limited distribution as valuable content to several other sites, with or without a link back to the source site?

I'm posting this in the Google forum because there's a possibility of this happening and there's no concern other than Google. It's no problem if any of the pages are not included, because there would still be link traffic, if a link were provided and the primary purpose of doing it is strictly to share the information where it will be useful to people.

It's of concern now because there's been a request for a little something for a newsletter that's archived and indexed after its distribution because it's put on the site, and a few sites that approximately (not exactly) the same pieces can be published on.

There's no sense doing something and finding out after that there's been a penalty.

Two questions regarding this:

1. Is there any way to safely have articles or content on several sites without anyone incurring penalties?

2. If the answer is to have a certain percentage unique or different enough to distinguish them and not be considered exact duplicates, how much is required to be changed, and what percentage needs to be unique?

 

ciml

WebmasterWorld Senior Member ciml us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4543 posted 5:17 pm on Aug 2, 2002 (gmt 0)

I'm not sure that penalty is the word, Marcia. Users don't want to find the same content listed several times in a search engine, so the engine should try to list only one URL for each.

> 1. Is there any way to safely have articles or content on several sites without anyone incurring penalties?

If you're trying to get the same page listed multiple times, then I can only guess that you are at odds with the search engines.

> ...what percentage needs to be unique?

Good question. Anyone?

paynt



 
Msg#: 4543 posted 5:44 pm on Aug 2, 2002 (gmt 0)

Technical information, prescription information, legal briefs, press releases, classified advertising all repeat their content. The first three particularly have reason for mirrored content. In my research Iíve seen these repeated in misc. sites, reproduced exactly with apparently no ill effect. The more popular the theme the less repetition I see of the same content near the top rankings but they do begin to cluster as new content maximizes out.

Whatís the threshold is a good question. I remember Robert_Charlton asked that once about something else and it turned into a pretty good discussion.

skibum

WebmasterWorld Administrator skibum us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4543 posted 12:47 am on Aug 3, 2002 (gmt 0)

Might be worth looking at the MarketPosition newsletter if anyone wants to check out how syndicated content is dealt with. That's probably one of the most widely distributed publications with the clause that "This publication may be freely redistributed if copied in its ENTIRETY."

Marcia

WebmasterWorld Senior Member marcia us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4543 posted 6:10 pm on Aug 3, 2002 (gmt 0)

>If you're trying to get the same page listed multiple times

Just once actually, but I'd prefer that it be the right one. For any other it's immaterial whether it's listed or not. One won't be, for sure, it'll be for a password protected membership area.

In cases where some material would be the same but part would have to be different anyway because of a different audience, there might end up more than one listing, so the threshhold for repetition is what I'm mostly concerned with, what percentage needs to be different to be considered unique.

I think I vaguely remember something like 80%, but that might have been for links pages; it was a long time ago.

.

vitaplease

WebmasterWorld Senior Member vitaplease us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4543 posted 7:56 am on Aug 5, 2002 (gmt 0)

I would guess Google would treat duplicate content cautiously.

Unless duplicate content is obviously replicated over several pages on two sites (such as mycompany.de and mycompany.com), Google would suggest to be playing copyright referee if it punished one of the two. Although Google has the right to do what they please they would be then treading a slippery ground.

If a search query would be done within Google for a set of words that exists in the identical content on both pages, but that does not contain words that occur in any internal or external inbound linktexts to these pages, Google could show the page that exists longer in their index first (buy taking the age of links into account and using the unique page content identifier - both ideas from the recent Google programming contest) instead of showing the page with the highest Pagerank first. At least that would probably be fairest.

Discounting the regular penalties, can anyone show me a page which got penalised (grey/white toolbar) for showing the same content on one page as on another non-related site?

ciml

WebmasterWorld Senior Member ciml us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4543 posted 10:26 am on Aug 5, 2002 (gmt 0)

Marcia:
> Just once actually, but I'd prefer that it be the right one

Assuming that you can't add a robots exclusion meta tag to the other copies of your page, or that you don't want to because they link back; the 'highest PageRank' approach seems to be the only way Google has of choosing at the moment.

Otherwise, as far as I know your only option is to make the pages different enough. Usually, having someone else's header, footer and navbar is more than enough.

Some very close mirrors of pages do make it into Google.

vitaplease:
> ...playing copyright referee if it punished one of the two...

The word 'punished' worries me. The overwhelming impression I get is that Google aren't trying to punish for this, just that they don't want to list a bunch of identical pages for a given search phrase. If they were trying to punish, then surely they wouldn't merge the PageRank.

Pages got the white/grey Toolbar back in December, but that was fixed. Whether it was a penalty or glitch can be debated; I suspect a glitch (or at least something that Google saw as a mistake).

pete

10+ Year Member



 
Msg#: 4543 posted 10:39 am on Aug 5, 2002 (gmt 0)

Paynt, I think that you are referring to Rob C's post here [webmasterworld.com] which dealt with a client mirroring its site content on co-branded newspaper sites.

I am trying to hunt down a paper on how Google identifies similar content as well as the mechanism used to make the decision on what is sufficiently dissimilar to avoid penalty. There has been a lot of circumspect discussion around duplicate content.

Something similar to the Altavista paper which outlined their mechanism (which they are trying to patent) that relied heavily on a sites internal and outbound link structure.

Anyone prepared to point me in the right direction?

Pete

[edited by: pete at 1:26 pm (utc) on Aug. 5, 2002]

Marcia

WebmasterWorld Senior Member marcia us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4543 posted 11:05 am on Aug 5, 2002 (gmt 0)

>taking the age of links into account

That's simple, but there can be links pointing to a site that have been around longer, yet specifically it may not have the same age factor as the links to the specific pages in question.

>and using the unique page content identifier

vitaplease, this I'm not familiar with, I must have missed that along the way.

>Usually, having someone else's header, footer and navbar is more than enough.

I'd hope that would be enough, yet a member here lost out on his site totally by someone *taking* his content.

This whole issue of duplicate content, along with questions about multiple domains are almost a constant topic now, and it goes on so it doesn't seem like it's been resolved clearly enough to reach the comfort level for a lot of people.

ciml, a lot is getting by at Google right now, so either they're not as adept at finding it or dealing with it as some would think, or one of these days there will have to be a massive purging.

pete, I hope you find those papers, that would make a very good read right about now.

vitaplease

WebmasterWorld Senior Member vitaplease us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4543 posted 11:09 am on Aug 5, 2002 (gmt 0)

>>and using the unique page content identifier

>vitaplease, this I'm not familiar with, I must have missed that along the way

Marcia,

[google.com...]

Honorable Mentions

Thomas Phelps and Robert Wilensky

I'm not sure if that would do, but that is what I meant.

>That's simple, but there can be links pointing to a site that have been around longer, yet specifically it may not have the same age factor as the links to the specific pages in question.

every single page would start with an original internal link (otherwise it would never be indexed) that could carry a date stamp. The content (main body text) could change overtime, reducing that effect, but the original content identifier would then say it is a "new" page.

In general, showing only the highest pagerank would be unfair, as Joe Blow's original text will most probably have a lower Pagerank than the Newspaper site's page that copies the content.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved