|Resolving duplicate content with links to the original|
Matt Cutts announced that links can help resolve dup content issues
At PubCon, Matt Cutts said that a link from the copy to the original would help Google resolve duplicate content issues. Google would be able to determine which was the original because the original would have a link from the copies.
This makes sense to me. RSS syndication should make this happen automatically because the syndicated content always has a link back to the site as part of the RSS specification.
I have an issue with cobranded sites. The site is a complete copy of the original but with branding for another site as a "section" of that other site. Currently we block googlebot from the cobranded site with robots.txt even though there is a small amount of original content on the site (with a TON of duplicated pages). Our contract prevents us from putting links back to the original that users can follow. I'm wondering if links in the head like: <link rel=original href="http://example.com/original.html"> might help the search engines resolve the duplicate content so that we could unblock the sites from robots.txt and allow googlebot to find the small amount of original content.
I like the idea very much - but the issue here is that the attribute is not standard. A search for any usage at all turns up just a few exmples. There is one mention of the idea in the W3C correspondence [lists.w3.org], but with no follow-up. And it does ask "Will search engines support this tag, if offered?"
A standard approach you might experiment with could be using rel="alternate":
|<link title="Original source for this document" rel="alternate" href="http://www.example.com/page.html"> |
Still, even using your non-standard attribute might be just enough. I'm pretty sure that Google would see the url, even if the relationship would notbe 100% clear.
[edited by: tedster at 3:02 pm (utc) on Nov. 21, 2006]
"so that we could unblock the sites from robots.txt and allow googlebot to find the small amount of original content. "
While at the same time adding tons of duplicated pages to the index?
So Google is asking people to voluntarily shout to Google that the page is duplicate content? OK, so that will help push all the white hatters into the supplemental index. Meanwhile, the black hatters will continue on and hope for the best (for them). Am I misinterpreting what Matt said?
This is one area where Google has truly thrown the baby out with the bath water. They really need to grasp which site is the original on articles.
I wrote several articles on my site years ago that were related to just one section of my site. Another site, the authority in its field, requested to copy the articles I wrote on their site, and link back to my site identifying me as the author. The original articles had been online on my site for months, and were listed in Google.
Then, when my site dropped in November 2004, all of a sudden the pages on the authority site were credited with those articles instead of my site. And they rank pretty high, too. My site is nowhere to be found.
This is an insult, I am not credited by Google in any way for my original copyrighted work.
If Google had it together on this, they'd know my site is the original since the articles were online on my site for months and months prior to appearing on the authority site, and that site does link to my site on the every page where there's an article. Because that site is an authority, I do link to them as well, but only on my links page.
|and that site does link to my site on the every page where there's an article |
Do they link to your site, or do they link to the original article. There is a world of difference between the two.
BigDave makes a good distinction. I was just trying to make the point that including a link to the original article location in your syndicated articles is a good idea. If we see two identical articles A and B, and article B has a regular hypertext link to article A, that helps builds the case that A is more authoritative.
The links on the authority site are to the domain, not the individual pages. I will write to the site and ask them to link directly to the page on my site where the articles appear, instead of to the domain, since that obviously makes a difference.
So, for legitimate sites we now have a way to flag to Google which one is the original version.
However, for scrapers borrowing content without permission, and none of the sites linking to each other, how is Google going to decide which one is the original?
And here is a quandry... site A publishes an article. Site B scrapes it, and gets sites C, D, and E to also publish it, and C, D,and E also link back to site B. According to this, site B is the authority. Site A just looks like some random site that didn't bother linking back to the original.
How would they react even if sites X, Y, and Z linked back to A, in the face of sites D, E, and F linking back to B? Do you think that they would still realise that A is the "real" site? Here we have two sites, both claiming to be the authority and both having the incoming links to "prove" it?
[edited by: g1smd at 8:45 pm (utc) on Nov. 21, 2006]
Any way you look at it, it's a bad policy of Google's. Here I am, original copyright holder of multiple works, receiving no credit for them because Google arbitrarily gave that credit to another site.
And once done, apparently, it's done. The other site has had those articles listed exclusively in Google for the past 2 years. I did the work, I published them first, and the other site gets the benefit.
Any way you slice it, it's just wrong.
|And once done, apparently, it's done. The other site has had those articles listed exclusively in Google for the past 2 years. I did the work, I published them first, and the other site gets the benefit. |
Well, that's solely your fault, if you forgot to check for the deep link.
Back when the decision was made to allow the other site to reproduce the articles, Google didn't seem to care about such things. I would have thought a credit to me and a link to my site was all that was necessary. I don't have a crystal ball, so there was no way of knowing they would want a link back to the very page on my site where the article originally appeared.
What do you suppose we're doing TODAY that will get us in trouble in 4 or 5 years? Could it be all those reciprocal links will get you penalized in the future? Who knows what Google will pull out of their bag of tricks in the coming years.
The only thing any of us can depend on is change. And something that is perfectly fine today will no doubt be a problem in the future, based on past history with Google.
Google does not have a crystal ball either. Even 3 or 5 years ago, the way to avoid this was to get links directly to your copy of the work. What if the original was on paper, you published it on the web first, then the copyright holder published it at a later date?
It seems obvious to me that if you start licensing your works, you will risk having the search engine consider a different copy of your work to be more important.
Google's goal, first and foremost, is to make sure their SERPs are not loaded up with duplicates. Most searchers (Google's search customers) don't care whether they get the copy from the original source.
Of course Google would like to have the most authoritative version, which is often the original, but they don't have that crystal ball either. If you want your original version to be the most authoritative, don't licence it to sites that kick your butt, and do thing which will show authority to the search engines, like getting links directly to your articles.
Don't expect Google to just assume that you are the best source, simply because you consider yourself to be the best source.
In what way does it protect you from scrapers?
In fact, it seems to me, it gives the scrapers an easy means to tell Google that they are the authorative source.
|What do you suppose we're doing TODAY that will get us in trouble in 4 or 5 years |
Googles goal is to find the best results for it's customers.
The way to accomplish that will always change, but the goal will remain the same.
=> Content and Semantics are the Kings
None of Brett's 26 Steps [webmasterworld.com...] will ever get us in trouble
I have tons of content on my sites that people can't find if they search for it in Google. I'm always in the Top 10 at every other SE for keywords, keyphrases, page titles, etc., so I must be doing something right.
When a person searches for Blue Widgets, they might like a bit more info than a picture and a price. And they can find pictures and prices on my site as well, along with a lot of other information at spot #155 on Google.
Yes indeed, Google certainly is maintaining the best search results... (NOT)