Forum Moderators: open
The problem Google needs to solve:
If there are a thousand sites mirroring DMOZ.org's content, the "correct" way for google to handle this is to ignore for the most part the duplicate content and only "count" DMOZ.org, UNLESS, someone is searching the content from within a particular domain. In summary, it wouldn't hurt to index the dupe content, but you would want their results hidden from searches. You also would NOT want to penalize the original content which is DMOZ.org. In most cases however, google will have no way dynamically to determine which content is the "original" one.
The google Solution:
Penalize all pages with duplicate content (although dmoz.org seems to be the exception). The penalization will result in lower Pagerank.
The problem(s) caused by this solution:
There are many legitimate reasons for duplicate content. For example, mirrors. PHP.net has nice documentation and that is the original page. But if you are in Sweden, you may prefer to view the content from a mirror in Sweden. So perhaps php.net "deserves" the higher pagerank, but should the page in sweden have a "low" pagerank just because it is a mirror? Maybe. But then a search for "PHP manual +Sweden -America" may give less relevant sites a higher ranking due to the lower PR for that mirrored page.
But a stronger example where this causes a problem is where you have a database and want to display the same data in different ways. Let's say an articles database. You may want to sort by author, by category, by title of article. Each view may differ by only 10%. But it is original content, good content. Why should all of the pages have a lower pagerank?
Is there a better solution?:
Seems to me that rather than penalize ALL of the sites with duplicate content by offering a lower PR, it would be better to leave the PR as is and decide a kind of "penalty" based on the search done..so when a search is done, dynamically choose which of the dupe content is "most relevant", "most important" and perhaps "highest PR w/o any penalty" and hide the rest.
What do you google experts think?
P.S. Anyone notice Yahoo putting googlism in their cool websites link today? I searched googlism on googlism and had almost no responses...
otherwise it would be too easy to kill comptetition.you would jsut go to one free web provider,dwonalod entire comptetitor site with let say teleport pro, upload and next update they are both pr 0.
i think theycant jsut do simple "both get pr 0".
Eg, www.sitea.com, www.siteb.com, etc all having the same content, but sitea is number 1 in the SERPs, siteb is number 2, and the overall owner of the sites reaps the benefits of dominating that keyword.
This is what Google is penalising, as it is anti competitive and of no use to the user.
On the issues of databases displaying listings, I dont think there is a problem. Duplicating content within a single site can hold no benefit (other than spammy keyword repetition - but thats a seperate issue), so I dont think Google will pay much attention to it.
Eg, 1000 pages with exactly the same content is not going to do the site any good even if google ignored it. Only one page (well, one page and maybe an index page) will be displayed in the SERPS, so internal dup content holds no benefit and will only serve to alienate visitors.
The DMOZ data driven sites could probably be subject to a penalty for dup content on such a mass scale, but at the end of the day the dup content is links and not information.
IMHO, it is beneficial for the surfer to have such a large information directory in so many locations - its more likely that they will find what they want. However, if information was duplicate on the same scale, it would be a different issue cos it holds no benefit for the repeat visitor.
Am I making sense here? I dont know - its friday afternoon...... :)
JOAT
JOAT
”they should definitely just ignore the page with lower pr , that is put site with higher score into results”- JonB
If Google were to adopt this policy then I think it would unfairly assume that the higher PR page is the originator of the content.
I think it might be best to do as heretic suggests and “to leave the PR as is and decide a kind of "penalty" based on the search done”. However, I think it would be important for Google to have some way of indicating that the rankings were hindered because of duplicate content.
That way the website owners could sort things out on their own. For example if somebody hijacked my content then I could send them a “friendly” email stating:
It has been brought to my attention that you have taken my original, copyrighted content and incorporated it on your website without my authorization. This has resulted in a loss of search engine traffic to my site which translates to a loss in revenue on my behalf. Please remove all of my content from your website immediately to avoid any further legal action regarding this matter.
If the duplicate site/content was owned by the same person then they would have to live with the “penalty” or stop duplicating their own content on multiple sites.
of course google would not determine only by PR. there are other factors that google has: how long was site in index,howm any linsk arre pointing to it, quality of links, what category etc. it is easy to filter out total copy .
also what about legal mirrors, that is to divide the traffic and bandwirts? many org site probably have mirrors etc.
i still think punishing BOTH sites would be unfair if there is no croslinking or other proof that would tell it is "cloaking".
of course double resutls or mirrors are no good for search quality. i think google looks on many many factors when ranking pages so compete mirror would jsut get lower rate and thus eliminated from results.
sending email to one that intentionaly want to harm your site would do no good
If the first email didn’t help then perhaps the second “not so friendly” email sent by a lawyer might. Perhaps a third “informative” email sent to their hosting company, affiliate sites, sites linking to them etc. might help.
Disclaimer: I am not a lawyer and do not really know what I am talking about.:)
I don’t think Google should play the role of judge and jury when it comes to determining who the originator of the content is. If they tried to do this then they might have to play the role of the defendant. However, I think Google is well positioned to play the role of unbiased evidence gatherers.
The primary site is a USA destination guide where each state has its own pages of content about regions, attractions, transport etc.
Then several years later we decide the launch a separate, stand-alone domain as a Florida accommodation guide. We put in the accommodation pages and as part of building the content we add tours, theme parks and (this is the important part) re-use the text from the Florida regional pages of the original USA site.
The ODP editors say this is OK, being a valuable addition to their Florida tourism category. Yahoo take the $299 and put it in their directory. The site ranks top 5 with MSN, AV, FAST for "Florida accommodation".
But that search on Google now has the site wwwaaaaayyyyyyy down at the end search results. However, the PR is still the same as it was pre Sept.
So we are left with trying to figure out if the site is being slammed for duplicate content no matter how well intentioned it may be, or is there some other factor involved?
During that 8- or 9-month period, the same articles were at the "network of sites" and on my new independent site. There were differences in the package or "container" around the article content, but the underlying content was identical except for some minor updates on the new site. In other words, a four-page article titled "Traveling with Widgets" on the old site was still a four-page article titled "Traveling with Widgets" on the new site.
I was worried that Google would consider my new site a "mirror site" of the old one, but that didn't happen. Instead, both sets of articles were listed in Google. In some cases the new version placed higher on Google's SERP; n other cases, the old version placed higher. Over time, the new versions crept up while the old versions slipped down, but neither the old or new versions were penalized by Google.
From this experience, I've come to believe that "mirror" content is just one of the factors that Google uses to determine whether a page is legitimate. I'm guessing that, if two pages are absolutely identical (right down to the navigation code), Google may flag one of them as spam. But if the two pages have different page titles, navigation code, etc., Google may think "Oh, that's a legitimate syndication" and leave well enough alone unless it finds other spam techniques like hidden text, repeated keywords, or cloaking. (It would almost have to take that approach, or it would be penalizing every newspaper site that uses stories and features from AP or Reuters.)
see this [google.com] for example.
[edited by: Woz at 9:00 am (utc) on Nov. 2, 2002]
[edit reason] shortened link [/edit]
In most cases however, google will have no way dynamically to determine which content is the "original" one.
Not true. Very easy way. The oldest page is the accurate one and is NOT penalized in any way shape or form.
The google Solution:
Penalize all pages with duplicate content (although dmoz.org seems to be the exception). The penalization will result in lower Pagerank.
Not true. I care for 4 sites with 100% duplicate content. The main site is pr8 and the clones are pr3's. The main site was never effected at all. Seen that same thing, dozens upon dozens of times.
[edited by: web_india at 8:51 am (utc) on Nov. 2, 2002]
Not true. Very easy way. The oldest page is the accurate one and is NOT penalized in any way shape or form.
The oldest page may not be the accurate one, as in the example that I gave. Let's say that (as in my case) a Webmaster moves his content to a new host and the old host leaves the outdated pages on its servers. The new pages are the "authorized" pages and the old pages are, in effect, pirated--but more important, from the user's perspective, is that the old pages aren't being maintained and the new ones are.
Not all such cases involve unauthorized use, of course. Let's take another hypothetical example:
Dr. Doe, a professor of German History at the University of Iowa. Dr. Doe posts a collection of resources on the history department's server at the University of Iowa. Two years later, he's offered a job at Harvard, and he recreates his resource list on its history department's server. But his old pages are still online at the University of Iowa because (a) he isn't search-savvy and has never heard of Google's "mirror sites" policy, or (b) he wanted to leave his resources list behind at Iowa for other people to use and modify if they wanted to do so.
So which list of resources is the most accurate one? To Dr. Doe and most users, the newer list (the one at Harvard) is the list that Google should feature in its search results.