Forum Moderators: open

Message Too Old, No Replies

Thoughts on Google's Pagerank penalty for duplicate content

discussion on dupe content

         

heretic

3:20 pm on Nov 1, 2002 (gmt 0)

10+ Year Member



I've read a couple of posts on how google penalizes some sites for duplicate content. I'm not sure I quite understand exactly how they're doing this, but if I get it right, if you have 2 pages that differ by only x%, google penalizes both pages.

The problem Google needs to solve:
If there are a thousand sites mirroring DMOZ.org's content, the "correct" way for google to handle this is to ignore for the most part the duplicate content and only "count" DMOZ.org, UNLESS, someone is searching the content from within a particular domain. In summary, it wouldn't hurt to index the dupe content, but you would want their results hidden from searches. You also would NOT want to penalize the original content which is DMOZ.org. In most cases however, google will have no way dynamically to determine which content is the "original" one.

The google Solution:
Penalize all pages with duplicate content (although dmoz.org seems to be the exception). The penalization will result in lower Pagerank.

The problem(s) caused by this solution:
There are many legitimate reasons for duplicate content. For example, mirrors. PHP.net has nice documentation and that is the original page. But if you are in Sweden, you may prefer to view the content from a mirror in Sweden. So perhaps php.net "deserves" the higher pagerank, but should the page in sweden have a "low" pagerank just because it is a mirror? Maybe. But then a search for "PHP manual +Sweden -America" may give less relevant sites a higher ranking due to the lower PR for that mirrored page.

But a stronger example where this causes a problem is where you have a database and want to display the same data in different ways. Let's say an articles database. You may want to sort by author, by category, by title of article. Each view may differ by only 10%. But it is original content, good content. Why should all of the pages have a lower pagerank?

Is there a better solution?:
Seems to me that rather than penalize ALL of the sites with duplicate content by offering a lower PR, it would be better to leave the PR as is and decide a kind of "penalty" based on the search done..so when a search is done, dynamically choose which of the dupe content is "most relevant", "most important" and perhaps "highest PR w/o any penalty" and hide the rest.

What do you google experts think?

P.S. Anyone notice Yahoo putting googlism in their cool websites link today? I searched googlism on googlism and had almost no responses...

JonB

3:35 pm on Nov 1, 2002 (gmt 0)

10+ Year Member



i dont think they penalize or should penalize duplicate content. they should defintelly just ignore the page with lower pr , that is put site with heigher score into results.

otherwise it would be too easy to kill comptetition.you would jsut go to one free web provider,dwonalod entire comptetitor site with let say teleport pro, upload and next update they are both pr 0.

i think theycant jsut do simple "both get pr 0".

jackofalltrades

3:38 pm on Nov 1, 2002 (gmt 0)



I always assumed the dup content issue was enforced to stop sites dominating the top ten with the same content.

Eg, www.sitea.com, www.siteb.com, etc all having the same content, but sitea is number 1 in the SERPs, siteb is number 2, and the overall owner of the sites reaps the benefits of dominating that keyword.

This is what Google is penalising, as it is anti competitive and of no use to the user.

On the issues of databases displaying listings, I dont think there is a problem. Duplicating content within a single site can hold no benefit (other than spammy keyword repetition - but thats a seperate issue), so I dont think Google will pay much attention to it.

Eg, 1000 pages with exactly the same content is not going to do the site any good even if google ignored it. Only one page (well, one page and maybe an index page) will be displayed in the SERPS, so internal dup content holds no benefit and will only serve to alienate visitors.

The DMOZ data driven sites could probably be subject to a penalty for dup content on such a mass scale, but at the end of the day the dup content is links and not information.

IMHO, it is beneficial for the surfer to have such a large information directory in so many locations - its more likely that they will find what they want. However, if information was duplicate on the same scale, it would be a different issue cos it holds no benefit for the repeat visitor.

Am I making sense here? I dont know - its friday afternoon...... :)

JOAT

heretic

4:38 pm on Nov 1, 2002 (gmt 0)

10+ Year Member



I think you are...but I'm finding it a little confusing :)

jackofalltrades

4:42 pm on Nov 1, 2002 (gmt 0)



It must be the time of the day! After all its such a simple concept.... ;)

JOAT

Henley

5:05 pm on Nov 1, 2002 (gmt 0)

10+ Year Member



I tripped over this issue of duplicate content last week (or precisely near-duplicated websites) when I was searching for a business - to- business product supplied by mail order. And another issue cropt up too.
Being the other side of the 'pond' I wanted a UK base company with some real live individuals to talk to, as follow on advice wes imperative.
So I key in UK Long Short Thick Widgets. And what comes up?
7 US based companies
1 UK based co, a subsid of one of the above.
2 Uk companies, one only a consultant in the product, the other in educational courses for it, The five US companies, one was totally off subject, the others all the same co with different websites. The content was changed around. One site was for upgrades, the other for Macs another for Microsoft and so on - but all one product.
Very clever if you can do this and dominate the front page and its regarded OK by Google; but seen from my seat with only one player to chose from a highly unsatisfactory outcome.
The next truly UK based co was several pages down. There are signs that this problem of title devaluation is going to be rectified this time round as I see the UK subsid has gone to the top and the other UK co way down the list has risen to No 5. Has anybody else come across this problem of UK not being recognised as a differentiater in last month's dance?

gmoney

5:11 pm on Nov 1, 2002 (gmt 0)

10+ Year Member



”they should definitely just ignore the page with lower pr , that is put site with higher score into results”- JonB

If Google were to adopt this policy then I think it would unfairly assume that the higher PR page is the originator of the content.

I think it might be best to do as heretic suggests and “to leave the PR as is and decide a kind of "penalty" based on the search done”. However, I think it would be important for Google to have some way of indicating that the rankings were hindered because of duplicate content.

That way the website owners could sort things out on their own. For example if somebody hijacked my content then I could send them a “friendly” email stating:

It has been brought to my attention that you have taken my original, copyrighted content and incorporated it on your website without my authorization. This has resulted in a loss of search engine traffic to my site which translates to a loss in revenue on my behalf. Please remove all of my content from your website immediately to avoid any further legal action regarding this matter.

If the duplicate site/content was owned by the same person then they would have to live with the “penalty” or stop duplicating their own content on multiple sites.

heretic

5:15 pm on Nov 1, 2002 (gmt 0)

10+ Year Member



Excellent addition! Yes, have a way to indicate that duplicate content was filtered. How about even providing us the option to search duplicates as well, kind of like how they have a little link after some search results saying that some results were ignored and allowing us to search ALL results...do you know what I'm talking about?

rfgdxm1

5:26 pm on Nov 1, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



IIRC, rather than penalize for duplicate content, doesn't Google bury such results, and you can only see them by clicking on a link at the bottom of the page to display the dup pages?

gmoney

5:27 pm on Nov 1, 2002 (gmt 0)

10+ Year Member



heretic,
Yea, maybe they could add "excessively similar pages".:)

[edited by: gmoney at 5:28 pm (utc) on Nov. 1, 2002]

JonB

5:28 pm on Nov 1, 2002 (gmt 0)

10+ Year Member



gmoney: sending email to one that intentionaly want to harm your site would do no good, you would not even get response.rememebr they want to do this to hurt your site in google. and most probably you would notice it when you would already have penalty.

of course google would not determine only by PR. there are other factors that google has: how long was site in index,howm any linsk arre pointing to it, quality of links, what category etc. it is easy to filter out total copy .

also what about legal mirrors, that is to divide the traffic and bandwirts? many org site probably have mirrors etc.

i still think punishing BOTH sites would be unfair if there is no croslinking or other proof that would tell it is "cloaking".

of course double resutls or mirrors are no good for search quality. i think google looks on many many factors when ranking pages so compete mirror would jsut get lower rate and thus eliminated from results.

excell

5:33 pm on Nov 1, 2002 (gmt 0)

10+ Year Member



Henley - what you are saying about duplicate information is very common in some industries i.e. travel.

And yes, it is an unsatisfactory outcome for the user when they search on location specific.

heretic

5:47 pm on Nov 1, 2002 (gmt 0)

10+ Year Member



what's iirc?

gmoney:

Yea, maybe they could add "excessively similar pages".

To tell you the truth, I forgot what that link they use @ the bottom is 4...was it for duplicate content filtered out? hmm...

gmoney

5:49 pm on Nov 1, 2002 (gmt 0)

10+ Year Member



sending email to one that intentionaly want to harm your site would do no good

If the first email didn’t help then perhaps the second “not so friendly” email sent by a lawyer might. Perhaps a third “informative” email sent to their hosting company, affiliate sites, sites linking to them etc. might help.

Disclaimer: I am not a lawyer and do not really know what I am talking about.:)

I don’t think Google should play the role of judge and jury when it comes to determining who the originator of the content is. If they tried to do this then they might have to play the role of the defendant. However, I think Google is well positioned to play the role of unbiased evidence gatherers.

gmoney

5:54 pm on Nov 1, 2002 (gmt 0)

10+ Year Member



To tell you the truth, I forgot what that link they use @ the bottom is 4...was it for duplicate content filtered out? hmm...

I think you are referring to the “more results from somedomain.com”. I think this means that there are more than two relevant results within the same domain.

austtr

4:27 am on Nov 2, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If Google thinks its seeing duplication, does it in fact reduce PR or just bury the site? Take the following hypothetical which is based on a real case:

The primary site is a USA destination guide where each state has its own pages of content about regions, attractions, transport etc.

Then several years later we decide the launch a separate, stand-alone domain as a Florida accommodation guide. We put in the accommodation pages and as part of building the content we add tours, theme parks and (this is the important part) re-use the text from the Florida regional pages of the original USA site.

The ODP editors say this is OK, being a valuable addition to their Florida tourism category. Yahoo take the $299 and put it in their directory. The site ranks top 5 with MSN, AV, FAST for "Florida accommodation".

But that search on Google now has the site wwwaaaaayyyyyyy down at the end search results. However, the PR is still the same as it was pre Sept.

So we are left with trying to figure out if the site is being slammed for duplicate content no matter how well intentioned it may be, or is there some other factor involved?

europeforvisitors

8:23 am on Nov 2, 2002 (gmt 0)



I used to be affiliated with a large "network of sites" (we'll call it Snout.com), but we parted ways just over 13 months ago. I repackaged my copyrighted content on an independent site about a month later, but the former host continued to display my articles against my wishes for 8 or 9 months. (The content was finally taken down after a lawsuit was filed.)

During that 8- or 9-month period, the same articles were at the "network of sites" and on my new independent site. There were differences in the package or "container" around the article content, but the underlying content was identical except for some minor updates on the new site. In other words, a four-page article titled "Traveling with Widgets" on the old site was still a four-page article titled "Traveling with Widgets" on the new site.

I was worried that Google would consider my new site a "mirror site" of the old one, but that didn't happen. Instead, both sets of articles were listed in Google. In some cases the new version placed higher on Google's SERP; n other cases, the old version placed higher. Over time, the new versions crept up while the old versions slipped down, but neither the old or new versions were penalized by Google.

From this experience, I've come to believe that "mirror" content is just one of the factors that Google uses to determine whether a page is legitimate. I'm guessing that, if two pages are absolutely identical (right down to the navigation code), Google may flag one of them as spam. But if the two pages have different page titles, navigation code, etc., Google may think "Oh, that's a legitimate syndication" and leave well enough alone unless it finds other spam techniques like hidden text, repeated keywords, or cloaking. (It would almost have to take that approach, or it would be penalizing every newspaper site that uses stories and features from AP or Reuters.)

Chris_R

8:27 am on Nov 2, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I agree with europeforvisitors

see this [google.com] for example.

[edited by: Woz at 9:00 am (utc) on Nov. 2, 2002]
[edit reason] shortened link [/edit]

Brett_Tabke

8:42 am on Nov 2, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



In most cases however, google will have no way dynamically to determine which content is the "original" one.

Not true. Very easy way. The oldest page is the accurate one and is NOT penalized in any way shape or form.

The google Solution:
Penalize all pages with duplicate content (although dmoz.org seems to be the exception). The penalization will result in lower Pagerank.

Not true. I care for 4 sites with 100% duplicate content. The main site is pr8 and the clones are pr3's. The main site was never effected at all. Seen that same thing, dozens upon dozens of times.

web_india

8:46 am on Nov 2, 2002 (gmt 0)

10+ Year Member



What if a high PR site decides to use the content EXACTLY as present on a low PR site (unlikely, but just a hypothetical example)? Would it make any difference to the sites? Also, I guess pr 8 site would now rank higher than the other site, right?

[edited by: web_india at 8:51 am (utc) on Nov. 2, 2002]

web_india

8:48 am on Nov 2, 2002 (gmt 0)

10+ Year Member



I care for 4 sites with 100% duplicate content. The main site is pr8 and the clones are pr3's

Brett, do these sites have the same navigation as well or only the content is same?

vitaplease

8:50 am on Nov 2, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Heretic,

also have a look at this thread:

[webmasterworld.com...]

heretic

2:22 pm on Nov 2, 2002 (gmt 0)

10+ Year Member



Brett,

Thanks for your input... I always thought that a developer can change the timestamp of the file...so if one pr8 site has a file dated whatever, you can just predate it and then google will think it's older hence the original?

Vita, thanks, I will check it out...

europeforvisitors

5:40 pm on Nov 2, 2002 (gmt 0)



Not true. Very easy way. The oldest page is the accurate one and is NOT penalized in any way shape or form.

The oldest page may not be the accurate one, as in the example that I gave. Let's say that (as in my case) a Webmaster moves his content to a new host and the old host leaves the outdated pages on its servers. The new pages are the "authorized" pages and the old pages are, in effect, pirated--but more important, from the user's perspective, is that the old pages aren't being maintained and the new ones are.

Not all such cases involve unauthorized use, of course. Let's take another hypothetical example:

Dr. Doe, a professor of German History at the University of Iowa. Dr. Doe posts a collection of resources on the history department's server at the University of Iowa. Two years later, he's offered a job at Harvard, and he recreates his resource list on its history department's server. But his old pages are still online at the University of Iowa because (a) he isn't search-savvy and has never heard of Google's "mirror sites" policy, or (b) he wanted to leave his resources list behind at Iowa for other people to use and modify if they wanted to do so.

So which list of resources is the most accurate one? To Dr. Doe and most users, the newer list (the one at Harvard) is the list that Google should feature in its search results.