homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 193 message thread spans 7 pages: < < 193 ( 1 2 3 [4] 5 6 7 > >     
Duplicate Content - Get it right or perish
Setting out guidelines for a site clean of duplicate content

 12:00 am on Aug 26, 2006 (gmt 0)

Probably one of the most critical areas of building and managing a website is dealing with duplicate content. But it's a complex issue with many elements making up the overall equation of what's in and what's out, what's on site and what's off site, what take precedence and what doesn't, how one regional domain can/cannot co exist with another's content, what % is same , etc etc and how the consequences are treated by Google in the SERP's.

Recently, on one of Matt's video's he also commented that the matter was complex.

When i looked into these forums [ unless i missed something ] i could see nothing that described the elements into a high level format that could be broken down and translated into a framework for easy management.

Does anyone believe they have mastered the comprehensive management of dupe content on Google into a format that can be shared on these forums?



 7:51 pm on Sep 16, 2006 (gmt 0)

>> a site:website.com or site:www.website.com may show pages as supplemental,

but when you try it again for a specific page e.g.

site:website.com/ABCD/ or site:www.website.com/ABC/ it may be "clear" <<

Yes, this happens where the Supplemental Result represents an older version of the page content, and the normal result represents the current content returned by that URL.


 1:48 am on Sep 17, 2006 (gmt 0)

Put it another way, is there a % of dupe content on the overall site which tips the overall balance of the site's pages in the overall rankings, and is G's filter applied to the whole site, regardless of the keyword searches chosen?

I don't see Google ranking sites lower for duplicate content. But I can see Google avoiding to deep crawl a site or refusing to index pages from a specific directory due to high dupe content %. Then the loss of pages in the main index / number of un-refreshed supplemental pages will negatively impact the site's overall ranking.


 1:47 am on Sep 18, 2006 (gmt 0)

I'm wondering if sites that rectified their link architecture to alleviate duplicate content, by replacing "/index.htm" and "default.htm" with "/" , may find that their pages are subjected to a "Sandbox" effect, where the results are dampened by the filter for a prolonged period of time which I'll call a "recovery period".

The reason for the thought is that whilst our sites are apparantly "recovering" by returning to the index, our results only rank well for terms in commas, " - " [ exact match ], but are consistantly at the bottom for general searches.

This has been going on for about two months since the remedial work took place.

Has anyone else witnessed similar effects whilst getting their duplicate content sorted out?

[edited by: Whitey at 1:57 am (utc) on Sep. 18, 2006]


 6:54 am on Sep 18, 2006 (gmt 0)

yet for other sites a couple of pages that are say 60% identical can mean severe drops..its clearly about the type of site rather than the duplication itself...

Please elaborate on this soapy as I am in this boat with one website. Mostly original articles but about 8 articles that are duplicates and posted with permission from other websites. The articles offered great information for our clients, but unfortunately were not blocked from indexing when July 27th hit.


 7:12 am on Sep 18, 2006 (gmt 0)

i mean as you go up the trust/authority path (in Google terms, not reality)they apply less and less penalising for identical misdemeanors.

If you had inbound links to your articles it may well be that it would balance out or cancel any duplication penalty. One way links whether or not organic.


 8:15 am on Sep 18, 2006 (gmt 0)

CainIV - I go along with SoapyStar on this one.

But how are you linking to that content? Have you got a direct link or a "no follow" on it?

We have a PR7 .edu site deep linked, that was introduced about 4 weeks ago. The DC's vary in what site:ourdomain.com/Subcategory1/ shows up for pages as being indexed [ anything from 12 to 153 ] which is heaps better than 1 that was there before!

Although trust rank thresholds on different DC's may vary, i suspect the quality of those links may break the suppression, but a better way is to look at the underlying issue of duplicate content on these pages, which is always causing vulnerability if it's not understood and dealt with.

As I mentioned above, i suspect there is a kinda sandbox routine going on related to the period after dupe content is fixed , so even when you break this issue, it may not be over indefinitely.


 6:12 pm on Sep 18, 2006 (gmt 0)

Hi guys.

One site of mine suffered on July 27th. The rest stayed perfectly fine and really have not moved in either direction since.

The site I am referring to:

Has had a 301 redirect in place for about 10 months from non www to www.

Was ranking quite high for main keywords and page one for secondary keywords.

Had mostly written content of my own in my niche. Checked other websites using copyscape and no scrapers etc had my content including the homepage content.

Has deep links to articles, links to categories, and alot of social bookmarking via tags. The website is rss enabled and I have checked that as well for duplicate issues.

At that time I had not placed a nofollow tag to the articles I had permission to post which were indexed on other websites.

There are no other paths to content other than search engine friendly mod rewritten urls which are static.

My take is to simply remove all articles that may have copies elsewhere on the internet. In the software I use, when I remove an article the old url returns a 404. I am wondering if this would help.

Any help or insight is appreciated:)


[edited by: CainIV at 6:16 pm (utc) on Sep. 18, 2006]


 7:25 pm on Sep 18, 2006 (gmt 0)

The articles offered great information for our clients

If you put that content online for your visitors, why not just exclude it from indexing (robots.txt or robots met tag) and leave it online? Would probably require a lot less re-coding of links, etc. and your clients are still being fed.

Checked other websites using copyscape and no scrapers etc had my content including the homepage content.

That makes me wonder if this is really the problem for you. Duplicate content across domains is a different issue than duplicate content within a domain. If these articles only show up on a couple of sites, I'm not convinced it's the source of your ranking challenges.


 7:35 pm on Sep 18, 2006 (gmt 0)

>> Duplicate content across domains is a different issue than duplicate content within a domain. <<

And for duplicate content across domains, "exact duplicates" (absolutley exact code and content) are treated differently to "sydication duplicates" (navigation code different, at the very least).


 8:34 pm on Sep 18, 2006 (gmt 0)

I have spoken too soon. Found a proxy server site running a script that has an exact copy of my homepage.

The cache date for this page is current - Sept. 17 06

The cache date for my website is gone entirely.
No links to my homepage are relative, they are all absolute

Base ref is in place.

I guess what was old is now new again :(



 5:25 am on Sep 19, 2006 (gmt 0)

If you put that content online for your visitors, why not just exclude it from indexing (robots.txt or robots met tag) and leave it online? Would probably require a lot less re-coding of links, etc. and your clients are still being fed.

Whitey thankyou for your assistance. I can place a no follow on the pages now, however, could there already be negative effects of this as the pages were not tagged in this manner for the alst 6 months.



 7:00 am on Sep 19, 2006 (gmt 0)

CainIV - We had a similar problem which we fixed about 6 months ago which was affecting about 60% of the site's content. The pages came back quite quickly. So i think there's hope for you, provided there is nothing else - it looks like just a few pages for you.

What I'm grappling with, is, if the whole site is recovering, [ per us ] is the site potentially sandboxed for other types of duplicate content. We appear to be experiencing those symptoms in a "form of sandbox".

If anyone can share their experiences on this "sandbox" following the tidying up of their site/s I'd really like to share the info.

[edited by: Whitey at 7:13 am (utc) on Sep. 19, 2006]


 5:22 am on Sep 20, 2006 (gmt 0)

I'd really like to get to the bottom of this sandboxing possibility and a disparity between exact and broadmatch results, which appears connected to recovering from dupe content, i mentioned previously - it's giving me grief why our results are suppressed and appearing like this.

Have you seen something like this where a [ "keyword keyword" ] exact match is at the top and non supplemental, and a broad match [ normal search ] is badly dampened and shows a supplemental for the same pages:

We're experiencing on a sample pages :

No 1 result for exact match - No supplemental result - cached 11Sep06

No 87 result for a broad match result - but with a supplemental result - cached 21Apr06

The current pages are duplicate content clean with differnt meta titles, descriptions, and content - what's happening i wonder?

There appears to be a duplicate content filter that is holding us down and seems set against old cached pages!

And on top of that i have a feeling there is a sandbox filter in place for pages being restored.

[edited by: Whitey at 5:40 am (utc) on Sep. 20, 2006]


 7:56 pm on Sep 23, 2006 (gmt 0)

Here's a thread from last year about duplicate content. It's well worth revisiting (even though some of the links are now broken) and it looks at things from a different angle -- more about true duplicate content and not about accidentally created multiple urls. Particularly note Brett Tabke's post, and the insights about how Google works altogether:


Thanks to Marcia for pointing this one out again!

[edited by: tedster at 8:29 pm (utc) on Sep. 23, 2006]


 8:12 pm on Sep 23, 2006 (gmt 0)

Blimey! A duplicate content thread that I missed out posting in!


 8:00 pm on Sep 28, 2006 (gmt 0)

One question when moving a small website to a new url: Is putting the noindex meta tag to the old site enough? The old site dropped in ranking in the last data refresh and I'm worried about using 301 redirect because the new url already ranks high and I don't want to lose that. Basically the question is how does Google's memory work? I don't want to rewrite all the text.


 8:04 pm on Sep 28, 2006 (gmt 0)

You can put the noindex meta tag on the old site if you want. That will get it out of the index, and allow the new site not to be in competition with a duplicate.

The redirect might be better as then all traffic to the old site will seamlessly arrive on the new site. The redirect will get the old site delisted soon enough too.

Technically, the redirect is the correct thing to do, because the content really has moved. The redirect MUST return a 301 status code in the HTTP header.

Patrick Taylor

 10:00 pm on Sep 28, 2006 (gmt 0)

Blimey! A duplicate content thread that I missed out posting in!

Crikey! g1smd's repeated the same good advice so many times on Webmasterworld - a multiplicate content Google penalty awaits...

So I vote for a special Duplicate Content Forum. Keep up the good (and patient) work, g1smd.


 10:07 pm on Sep 28, 2006 (gmt 0)

LOL. I just had that very same converstion with one of the Moderators here just a few hours ago.

This topic really has "hit the fan" in the last week or so. More people seem to be "getting it" now.


 10:32 pm on Sep 28, 2006 (gmt 0)


Why does Googlebot keep going back to pages it gets a 301 on - over and over and over again? I've done the 301 redirect from www. to non-www. A year ago, at least. I've indicated my preference with Google itself, through the tool they provided for me.

It should be very obvious to Google that I don't want www. to be indexed. I never intended for it to be a duplicate site, and I've done all I can to make it evident it's not.

Yet, I've seen Googlebot hit the same page at least 5 times in the last 12 hours, getting a 301 every time. How many times does it have to go back to check? And isn't there a better use of Google resources?

Why? What more can I do?


 10:35 pm on Sep 28, 2006 (gmt 0)

You need do nothing. Google will forever check the status of every URL that it has ever seen just in case the status ever changes again in the future.

It has to be that way. Google spiders a vast number more URLs than they index content for, and they index a vast quantity more URLs than they show in the search results.

That's correct and it couldn't be any other way.

The amount of time fetching your 301 response is miniscule. It can fetch dozens of those in the same time that it takes to download one page of normal content. Don't worry about it.


 11:00 pm on Sep 28, 2006 (gmt 0)

You don't need to do more, and nobody can control how others link to you, so Google hitting the old URLs is nothing to be concerned about, other than making sure your own links are going to the correct URLs and seeing if any major/friendly links to you are to the correct URL.


 1:17 am on Sep 29, 2006 (gmt 0)

G1SMD And Others ---

Need a bit of advice... We seem to suffering from supplemental hell and possibly a few other issues. From reading all of the posts, I seem to notice a few of the symptoms.

One that caught my eye early in this thread was about a certain directory could end being penalized due to a few entries. We have noticed this also. Some of our directories do great. While others do terrible. Would it be okay to move some of the content out these directories to others using a 301? I want to subdivide the articles in more concise directories. Would this cause problems?

We also tie all of our articles together using index pages. Each index page contains about 30 links to articles. The entry in the index page contains a small snippet from the article. Usually the first paragraph. This first paragraph is also used in the meta-description tag. Could this be causing us issues? Could it be considered duplicate?

Appreciate all thoughts and help.


 6:48 am on Sep 29, 2006 (gmt 0)

"a certain directory could end being penalized due to a few entries"

That's silly. Many sites have exactly one file in every directory.


 10:05 am on Sep 29, 2006 (gmt 0)

>>>>>We also tie all of our articles together using index pages. Each index page contains about 30 links to articles. The entry in the index page contains a small snippet from the article. Usually the first paragraph. This first paragraph is also used in the meta-description tag. Could this be causing us issues? Could it be considered duplicate? >>>

good question..anyone?


 8:14 pm on Sep 29, 2006 (gmt 0)

Since we are on the topic of dupe, I have a question.

I have a website that has multiple urls to a particular page.

All but the main url (the one i prefer) are supplemental naturally.

A search of those caches shows a May 6th date (06) and a different site design as I had switched designs since then.

Can this still harm me, is it safer at this point to amalgamate and go thru the cleanup with noindexing and 301's?


 9:46 pm on Sep 29, 2006 (gmt 0)

Yes, all of the alternative URLs need to return a 301. That is the fix.


As for the index page problem, there is a possibility that it is seen as duplicate content.

As an experiment, I took one long page of content and split it in to two. Google dropped the long page (behind the "click for omitted results" message) within days and listed only the two shorter ones instead.

However, I wonder if you link to /folder/index.html and Google lists www.domain.com/folder/ instead. That would be a far bigger problem.


 1:09 am on Sep 30, 2006 (gmt 0)

I'm wishing anyone who takes this on the very best - perhaps Matt, Vanessa , GG and Adam could give some thought on how they could support this.


I don't believe it is a complex issue at all. :o

A heart transplant is a complex issue. What google wants you to think is not a complex issue.

Who has heard or watched these google representatives truly resolve a problem. They are vague at best, and they are skilful in sublime vagueness and have turned it into an art form.

How often, if ever, had Matt, Adam or the other misleadingly named google representative told you that the www in your domain is in fact a sub-domain? And that google has a definite dislike for sub domains.

The only thing the www is doing is ZERO. Nothing, and is a problem that is staring webmasters in the face and is still a problem because these representatives want you to keep being puzzled.

I've never come across a host provider that asks me how I want the www treated. They all either don't know or don't care. How many webmasters have been asked by a host as to why are you hosting your site in a very vulnerable way to google on our servers?

Canonical issues, duplicate content issues and a myriad of other problems do not exist but manifests itself by google creating the environment for it.

These representatives have never informed webmasters properly about anything. Adam ventured into a powerful thread recently to proclaim the existence of peanut butter and sightings of Elvis eating hamburgers to deflect webmasters from the valid points being discussed.

If a webmaster continues to think that there is any value in the www without knowing why it is there, other than it means World Wide Web, which it does not, then we will never progress and continue to play a game set out by google to keep webmasters guessing. The www was a gimmick the pioneers of the internet needed for other purposes.

I've never heard these representatives to give anything away other than cryptic clues.

Millions of websites are exactly in the very state I am explaing here and for anybody to say it there the webmasters fault would be an unfair comment.

Let us assume a simple website is to be created like a fortress against duplicate content. Firstly, there is no such thing as duplicate content. Google creates it. So you are safe before you start. Use that safety to your advantage. Don't just read google's misleading webmaster guidelines.

When purchasing a name for this simple fortress type website think first that the registrar has no or little knowledge regarding search technology. He only wants your money.

A standard and arcane method is used to sell domain names, often by ill equipped registrars.

When you purchase a domain for this simple website as the example here, the responsibility is now yours to protect the domain from google. Google will cause your domain duplicate content and canonical issues if you present this domain to google as it is.

How will google find this domain to mistreat it. Google has to request instructions from the global DNS to first ask if the domain exists. Since it exists google is then told that the name is parked at cheapskateregistrars nameservers. Now, this is the place to make sure that google is never given the opportunity to apply duplicate content penalties or canonical issues to your website.

Here you can merge the two versions of the domain together so that they resolve to one only. A far better choice would be to get rid of the www subdomain. Since google has shown a dislike for subdomains. And their algo is forever changing. Leave the least possible variants to minimize problems in the future.

WWW is nothing but a suddomain. That is why it is near impossible to create a subdomain such as mysite.www.simple-site.com. So this explains why if you only had simple-site.com and you wanted a subdomain for a ware-wolf-woes you then get ware-wolf-woes.simple-site.com and if you wanted to abbreviate it, you now have www.simple-site.com. We are now back to the www full circle.

Get rid of the www subdomain and be left only with simple-site.com at the ANAME RECORDS. Now it becomes totally impossible for google to create a problem regarding www versus non www.

Any agent, browser, crawler, spider, harvesting bot or any browser cannot go wrong. Your website now answers to one name only. It is impossible for a mix-up. And the risks and problems google might throw your way is eradicated at source. Regarding www and non www that is.

These google representatives are employees of google. They should ooze with technical replies and foster confidence in webmasters but they are trained only to keep you guessing.

Google does not tell you that if you do not make sure that your website answers only to one name that it will give you penalties. It simply goes ahead and does it. Nor does google own up to its responsibilities that it is its harvesting crawlers that pick up damaging links to your website. In fact, google misleads you by saying another website cannot harm your site.

We have say;
[simple-site.com...] just like a usual website. Unprotected.

It ranks nicely after your efforts and you are pleased with the looks and hits to your website. You have say 50 pages of that site nicely crawled by google and all indexed. Some wiz kid webmaster in a remote Chinese village creates a scraper site and is using a php based automated linking process that is going to point to your website. He leaves out the www. Ahhh, he has actually left out the subdomain. Don't forget.

A link now is visible at the wiz kid's site pointing to [simple-site.com...] The wiz kid used a cheap computer and uploaded the site in between feeding chickens and pigs on his land. Unintentionally creating a killer link that google says is impossible.

And google is the potential killer. Its harvesting bot has detected the killer link. Potentially, this link can create 50 duplicate pages of your website.

Pagerank is going to be split. Duplicate content penalties are going to abound. Untold problems are going to beset your website by google because you are going to be caught cheating.

The harvesting bot informs google's database that a new link exists and is put aside for later processing by deepcrawl bots. A time bomb is ticking away and it is going to explode.

Google instructs a deepcrawl bot to GET info for [simple-site.com...] the bot must first go to request existence of the domain from DNS it is told that it exists on nameserver NS01CHEAPSKATES and the killer bot goes there and sees that it points to the server the simple site is on. At the poorly configured server hosted by a one man band who knows nothing about search technology the bot makes a request. BANG.... It is given a 200 GET because this is the very first time that domain has been given out. And the server has presented the killer crawler with a 200 GET. Now the bot takes the contents of the index page full of relative links to google's notoriously ill equipped algo. A trained eye would spot in the same raw logs that 2 minutes ago a google deepcrawl bot had requested [simple-site.com...] and the bot was given a 304 UNCHANGED. Here is the evidence, two websites exists and its cheating in the eyes of google.

The process now continues until all 50 pages are in google without the www and all 50 pages with the www and 50 duplications are to be processed by the duplicate content algo. Red flag after red flag is raised against the unprotected site.

Sensational names are given to googles updates. Bourbon is the chosen name amongst webmasters to celebrate the event. The owner of www.simple-site.com makes a first post. MY SITE HAS TANKED and is frustrated. What happened. Who can help me.

If the ANAME RECORDS said that it only answers to one name, then no duplication would result. No canonical issues. No vulnerability. The near useless host provider would be academic.

Now we know that another webmaster can indeed tank your website.
Base refs etc are untidy and dangerous to a novice. Serverside redirects and many other things can be done but that is another story.

Please do not use this example as a basis to fix your website. I'm simply making a point about how webmasters are misled by google.

Sorry to have made such a long post.


[edited by: AlgorithmGuy at 1:39 am (utc) on Sep. 30, 2006]


 3:18 am on Sep 30, 2006 (gmt 0)

Remember all that there is way more G's here then most of us know about.

Analize top natural listings and you will find out it has not much to do with www versis non www and all inbetween for G's sake. In fact the more you try to make your site perfect from errors in any SE doesn't do anything for top placements anymore (for now) since what pre-Florida was a hint for some in G and some notice other engines do copy G and always seem to have a little update just after G does.

Look around you in your natural money keywords now in G, MSN or Yahoo. All the same, tons of small sites with tons of one way sig-n-blog sites. Tons of the same bull as always including large payoff sites. It always will be this way. It's just getting much more obvious as the struggle for an SE figuring advertising revenue.

[edited by: tedster at 1:00 am (utc) on Oct. 4, 2006]


 3:20 am on Sep 30, 2006 (gmt 0)

Let's not debate Google's ethics or intentions. Let's help each other get our sites properly indexed. That's the topic of the thread -- "guidelines for a site clean of duplicate content".

Jordo needs a drink

 6:01 am on Sep 30, 2006 (gmt 0)

This thread could be accessed using:


And www.webmasterworld.com/?jordo_needs_a_drink=yes can access the webmasterworld homepage also...

My point in showing that is that you never know how another site is going to link to you.

Even if you think you've done everything you could to eliminate the duplicate content/sup issue, at least for the home page, everyone should do a googe search on:
site:www.mysite.com "your homepage title" without the quotes
site:www.mysite.com "your homepage description" without the quotes

I did both of those for my own site and it came up with some very interesting results. Then I did it for a couple of very well known sites and came up with some dupe issues. (webmasterworld being one of them ;) )

This 193 message thread spans 7 pages: < < 193 ( 1 2 3 [4] 5 6 7 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved