Welcome to WebmasterWorld Guest from 54.204.165.156

Duplicate Content Threshold... Is there one?

   
5:02 pm on Jan 10, 2012 (gmt 0)

10+ Year Member



We inadvertently created a number of duplicate pages on various url's that we have since corrected. We are seeing some pretty good results so far. We still have a some products that are shown the same, verbatim, on just two sites now.
Does google have some kind of threshold for duplicate content. Some is ok, but more is not?

Thanks
5:58 pm on Jan 10, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



There is some kind of threshold for internal URL duplication problems - if you hit it, there's usually a warning in your WMT account. Usually it takes something like a nearly infinite URL space, with actual links that point to the URL variations, to cause a problem.
6:23 pm on Jan 10, 2012 (gmt 0)

10+ Year Member



Tedster, could you please explain

"go into nearly infinite URL space, with actual links that point to the URL variations"

I am not sure I understand

Thanks
8:47 pm on Jan 10, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Sometimes a server is configured so that any value at all after as a parameter in a query will generate the same content ay yet another URL. That is an infinite URL space - and it can be deadly. Site search result pages are on of many ways this can happen.
5:06 pm on Jan 11, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Sometimes a server is configured so that any value at all after as a parameter in a query will generate the same content ay yet another URL. That is an infinite URL space - and it can be deadly


Am I right in thinking that adding a canonical META tag would defeat this problem? (ie: with .../search.php?p=1&p=2&p=3...etc just using a canonical call to search.php )
5:31 pm on Jan 11, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



A canonical link is a band-aid for the situation. Yes, it "should" work, but it puts the responsibility on the search engines rather than fixing it on your own server.

Fixing it on your own server is a 100% thing - a canonical link is not. Added to that, if the scripting makes an error in inserting an incorrect href value for the canonical link, the complications on Google can roll for a long time.
6:32 pm on Jan 11, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Thanks Tedster. Considering this must be a very common issue I am surprised that Google doesn't simply ignore duplicate content from decorated URLs. I know you were talking metaphorically but it shouldn't need a "fix" to prevent it harming a site IMO.

[edited by: Simsi at 6:33 pm (utc) on Jan 11, 2012]

6:42 pm on Jan 11, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



From what I can see, Google does make an effort in this direction - ignoring certain duplicate URL issues. You can see some parameters in WMT that are already being ignored, for instance. However, there are so many specific cases, exceptional cases and variations, that site owners are wise to take responsible steps on their own.

Simply put, sometimes URL "decorations" do matter. Even more complex, sometimes a site has some URL parameters that do matter and others that don't. The web as a whole is quite immense and it's full of edge cases.

It is not wise to let indexing go on automatic pilot as determined by some algorithm.
2:01 am on Jan 12, 2012 (gmt 0)

WebmasterWorld Administrator anallawalla is a WebmasterWorld Top Contributor of All Time 10+ Year Member



To give a real example, a Java-based web server adds a unique parameter for each reload of a given page.

Home Page:
e.g. example.com/some/fixed/values/?_afrWindowId=4bo49s9oc_1&_afrLoop=430786379670795&_afrWindowMode=0&_adf.ctrl-state=4bo49s9oc_4

Reloading the page gives this:
example.com/some/fixed/values/?_afrWindowId=4bo49s9oc_1&_afrLoop=430819531485831&_afrWindowMode=0&_adf.ctrl-state=4bo49s9oc_4

Reloading again:
example.com/some/fixed/values/?_afrWindowId=4bo49s9oc_1&_afrLoop=431367959538926&_afrWindowMode=0&_adf.ctrl-state=4bo49s9oc_4

Reloads of Inner Page:
example.com/some/fixed/values/?_afrLoop=431668803284948&_afrWindowMode=0&_adf.ctrl-state=4bo49s9oc_4&
example.com/some/fixed/values/?_afrLoop=431734358055589&_afrWindowMode=0&_adf.ctrl-state=4bo49s9oc_4&

Reloads of Another Inner Page:
example.com/some/fixed/values/?_afrLoop=431772593983704&_afrWindowMode=0&_adf.ctrl-state=4bo49s9oc_4
example.com/some/fixed/values/?_afrLoop=431806008931704&_afrWindowMode=0&_adf.ctrl-state=4bo49s9oc_4

The server apparently needs the parameters for other things on ths site to work, so stripping them out at the server level doesn't work.

WMT would be a good way to make Google ignore the parameters because the page loads fine without the parameters, which get inserted by the server.
8:16 am on Jan 13, 2012 (gmt 0)



While searching for a review this morning, I could not believe what I was seeing. Duplicate content overload, a well known .uk review site at #1, then at #8,9, and 10 their .com and .in domains with the same content clicked on the second page to find in order, com.my, com.sg, .ie, .ca, then at the bottom another .com, .uk and an .in all from the same company again and the same content on them. Page 3 was no better with anothe 6 results like this.

User friendly? Surely this is unfair for the user and also othe websites.
5:59 am on Jan 16, 2012 (gmt 0)

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Duplicate content overload, a well known .uk review site at #1, then at #8,9, and 10 their .com and .in domains with the same content...

courier - I'd like to avoid opening this up to the discussion of a specific domain or keywords, but the site you're describing sounds very much like TripAdvisor, and we did discuss what appeared to be multiple duplicate pages of TA ranking here....

How Google is Showing the Results
http://www.webmasterworld.com/google/4381552.htm [webmasterworld.com]

As I mentioned in that thread, most of the pages that were ranking were different paginated reviews, images, etc, and, as such, they were not duplicates. Google for the most part appeared to be showing different pages on the different ccTLDs.

I've since noticed, though, that, on some searches, several of the results are the main TA listing pages for a particular query, which essentially are identical on different cctlds. So yes, Google is ranking some dupe content on different ccTLDs, but not quite as much as it might appears if you just look at the brand name. This is consistent with how Google handles ccTLDs. When the inbound links are sufficiently localized and independent, and other conditions are satisfied, Google tries to allow cctlds to appear in the serps. I haven't checked the details of the TA sites in terms of hosting, localization on the pages, etc. PageRank and authority are likely also factors.

Does google have some kind of threshold for duplicate content. Some is ok, but more is not?

I've noticed over the years... even way before the Vince Algo update [webmasterworld.com...] which many describe as being all about branding... that if a site had extremely good inbounds, Google would allow significantly more internal duplication than if a site was badly linked. I've never seen highly templated pages in a geo-directory type site, eg, cause nearly the same problem in a high PageRank, high authority site as they do in a low PR, low authority site. I can't give you percentages, as there are a lot of variables.

While I don't have a good before and after Panda comparison to say if this treatment of internal duping has changed, my guess is that Google did not intend for Panda to reduce rankings for sites with high "trust, reputation, authority, and PageRank," which is how, during discussions about branding, Matt Cutts characterized pages that Google wanted to rank.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month