|Duplicate Content Threshold... Is there one?|
We inadvertently created a number of duplicate pages on various url's that we have since corrected. We are seeing some pretty good results so far. We still have a some products that are shown the same, verbatim, on just two sites now.
Does google have some kind of threshold for duplicate content. Some is ok, but more is not?
There is some kind of threshold for internal URL duplication problems - if you hit it, there's usually a warning in your WMT account. Usually it takes something like a nearly infinite URL space, with actual links that point to the URL variations, to cause a problem.
Tedster, could you please explain
"go into nearly infinite URL space, with actual links that point to the URL variations"
I am not sure I understand
Sometimes a server is configured so that any value at all after as a parameter in a query will generate the same content ay yet another URL. That is an infinite URL space - and it can be deadly. Site search result pages are on of many ways this can happen.
|Sometimes a server is configured so that any value at all after as a parameter in a query will generate the same content ay yet another URL. That is an infinite URL space - and it can be deadly |
Am I right in thinking that adding a canonical META tag would defeat this problem? (ie: with .../search.php?p=1&p=2&p=3...etc just using a canonical call to search.php )
A canonical link is a band-aid for the situation. Yes, it "should" work, but it puts the responsibility on the search engines rather than fixing it on your own server.
Fixing it on your own server is a 100% thing - a canonical link is not. Added to that, if the scripting makes an error in inserting an incorrect href value for the canonical link, the complications on Google can roll for a long time.
Thanks Tedster. Considering this must be a very common issue I am surprised that Google doesn't simply ignore duplicate content from decorated URLs. I know you were talking metaphorically but it shouldn't need a "fix" to prevent it harming a site IMO.
[edited by: Simsi at 6:33 pm (utc) on Jan 11, 2012]
From what I can see, Google does make an effort in this direction - ignoring certain duplicate URL issues. You can see some parameters in WMT that are already being ignored, for instance. However, there are so many specific cases, exceptional cases and variations, that site owners are wise to take responsible steps on their own.
Simply put, sometimes URL "decorations" do matter. Even more complex, sometimes a site has some URL parameters that do matter and others that don't. The web as a whole is quite immense and it's full of edge cases.
It is not wise to let indexing go on automatic pilot as determined by some algorithm.
To give a real example, a Java-based web server adds a unique parameter for each reload of a given page.
Reloading the page gives this:
Reloads of Inner Page:
Reloads of Another Inner Page:
The server apparently needs the parameters for other things on ths site to work, so stripping them out at the server level doesn't work.
WMT would be a good way to make Google ignore the parameters because the page loads fine without the parameters, which get inserted by the server.
While searching for a review this morning, I could not believe what I was seeing. Duplicate content overload, a well known .uk review site at #1, then at #8,9, and 10 their .com and .in domains with the same content clicked on the second page to find in order, com.my, com.sg, .ie, .ca, then at the bottom another .com, .uk and an .in all from the same company again and the same content on them. Page 3 was no better with anothe 6 results like this.
User friendly? Surely this is unfair for the user and also othe websites.
|Duplicate content overload, a well known .uk review site at #1, then at #8,9, and 10 their .com and .in domains with the same content... |
courier - I'd like to avoid opening this up to the discussion of a specific domain or keywords, but the site you're describing sounds very much like TripAdvisor, and we did discuss what appeared to be multiple duplicate pages of TA ranking here....
How Google is Showing the Results
As I mentioned in that thread, most of the pages that were ranking were different paginated reviews, images, etc, and, as such, they were not duplicates. Google for the most part appeared to be showing different pages on the different ccTLDs.
I've since noticed, though, that, on some searches, several of the results are the main TA listing pages for a particular query, which essentially are identical on different cctlds. So yes, Google is ranking some dupe content on different ccTLDs, but not quite as much as it might appears if you just look at the brand name. This is consistent with how Google handles ccTLDs. When the inbound links are sufficiently localized and independent, and other conditions are satisfied, Google tries to allow cctlds to appear in the serps. I haven't checked the details of the TA sites in terms of hosting, localization on the pages, etc. PageRank and authority are likely also factors.
|Does google have some kind of threshold for duplicate content. Some is ok, but more is not? |
I've noticed over the years... even way before the Vince Algo update [webmasterworld.com...] which many describe as being all about branding... that if a site had extremely good inbounds, Google would allow significantly more internal duplication than if a site was badly linked. I've never seen highly templated pages in a geo-directory type site, eg, cause nearly the same problem in a high PageRank, high authority site as they do in a low PR, low authority site. I can't give you percentages, as there are a lot of variables.
While I don't have a good before and after Panda comparison to say if this treatment of internal duping has changed, my guess is that Google did not intend for Panda to reduce rankings for sites with high "trust, reputation, authority, and PageRank," which is how, during discussions about branding, Matt Cutts characterized pages that Google wanted to rank.