homepage Welcome to WebmasterWorld Guest from 54.166.65.9
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 193 message thread spans 7 pages: 193 ( [1] 2 3 4 5 6 7 > >     
Duplicate Content - Get it right or perish
Setting out guidelines for a site clean of duplicate content
Whitey




msg:3060900
 12:00 am on Aug 26, 2006 (gmt 0)

Probably one of the most critical areas of building and managing a website is dealing with duplicate content. But it's a complex issue with many elements making up the overall equation of what's in and what's out, what's on site and what's off site, what take precedence and what doesn't, how one regional domain can/cannot co exist with another's content, what % is same , etc etc and how the consequences are treated by Google in the SERP's.

Recently, on one of Matt's video's he also commented that the matter was complex.

When i looked into these forums [ unless i missed something ] i could see nothing that described the elements into a high level format that could be broken down and translated into a framework for easy management.

Does anyone believe they have mastered the comprehensive management of dupe content on Google into a format that can be shared on these forums?

 

Quadrille




msg:3061147
 8:55 am on Aug 26, 2006 (gmt 0)

The management of dupe content is complex simply because there is no one single way to be sure - it all depends on how you build your site, how you link within it, whether (and how) you syndicate or share content.

For example, most of my sites are palin vanilla HTML; none of them have dupe problems because I do not duplcate, clone my stuff on article farms, or reprint other peole's stuff.

I don't syndicate my stuff, because I use it to attract visitors 9and return visitors) to MY sites. I don't use other peoples stuff for similar reasons, and if I do like it, I'll simply link to it. I 301 from domain to www.domain as a matter of routine. For me, it isn't complex.

But for someone who buys in content, or syndicates content, or uses a dynamic, database system ... there could be problems, depending on how they buy content, how they syndicate, or how they operate their database and content management.

So the 'rules' are not the same for everyone!

The simple rule is 'Don't Dupe' - it's the practice that's complex :)

g1smd




msg:3061167
 9:18 am on Aug 26, 2006 (gmt 0)

>> Does anyone believe they have mastered the comprehensive management of dupe content on Google into a format that can be shared on these forums? <<

I thought that I had already repeated the various points ad nauseum, for the last year or more, both in the posts about managing supplemental results, and in the posts exposing flaws in the design of forum and cart software?

Whitey




msg:3061268
 1:05 pm on Aug 26, 2006 (gmt 0)

So the 'rules' are not the same for everyone!

Quadrille - that's my concern - it's complex and it would be good if there was some method of articulating the main areas, and then fleshing out the elements, which would likely lead to "management" of issues rising from those elements.

I thought that I had already repeated the various points ad nauseum, for the last year or more, both in the posts about managing supplemental results, and in the posts exposing flaws in the design of forum and cart software?

g1smd - I shouldn't have said *nothing* - my big mistake - these and several other things you have pointed out have been a huge help - in particular the internal content , linking/architecture and your recent post which Matt "chimed in on" - thanks.

But I was thinking more along the lines of the issue that Quadrille describes as a problem of complexity, and how that could be headlined orsectioned and then expanded on over time into a single reference. Maybe I'm asking too much - I'm not sure - but i thought I'd put out the call anyway. It's largely an editorial question, as I sense that most of the knowledge just needs to be brought together.

Quadrille




msg:3061389
 4:10 pm on Aug 26, 2006 (gmt 0)

I think you are quite right; the G1smd's of this world have provided many useful answers over the months (years even) responding to similar but unique questions from all kinds of people; it would be good for someone to gather all that into one comprehensive article (though I suspect that would require a few caveats and exceptions!).

It would be a brave person to try, I reckon ...

tedster




msg:3061539
 6:45 pm on Aug 26, 2006 (gmt 0)

This is quite a coincidence -- the stars must be aligned! g1smd and I have been in conversation over the past day or so talking about creating such a thread for easy reference. It's coming, as time permits. I agree very much that we need it.

Whitey




msg:3061690
 10:40 pm on Aug 26, 2006 (gmt 0)

Certainly it would be very brave, because the complexity has the potential to leave out some key areas, even for the best brains.

I'm wishing anyone who takes this on the very best - perhaps Matt, Vanessa , GG and Adam could give some thought on how they could support this.

What I'm afraid of is that we enter into this and accidently leave out some key area in the complexity.

g1smd




msg:3061788
 12:51 am on Aug 27, 2006 (gmt 0)

Take 2:

Should be easy enough. If you want a head start read all my reallly long posts of the last month or so (anything talking about supplemental results, duplicate content, duplicate titles and meta descriptions, and multiple-URL bugs in forum software packages).

It's at least a 3-dimensional problem.

On one axis, all the pages of your site (just the canonical URL for each one). You have control over this axis. There is one entry for each "page" of content.

On another axis, all the alternative URLs for the same content (the duplicate content: www and non-www, multiple domains, and differing URL parameters), some of which may already be marked as supplemental results. These need to be fixed so that only one URL per page of content can be indexed. The fix is the 301 redirect for all alternative URLs. Once the fixes are in place, the supplemental results that already existed will continue to show for another year, but can be safely ignored. A few weeks after the fixes are in place, non-canonical URLs that are now redirected or show 404 will be converted to supplemental results and will continue to show for another year. Those too can safely be ignored once they reach that state. You have some control over this axis while those URLs are not shown as supplemental results.

On the final axis, there are two things: the dust trail of supplemental results representing older versions of the same content (for currently indexed URLs) where those pages were edited long ago, and the supplemental results for pages that were deleted long ago. These can both safely be ignored. Google holds on to them for a year before cleaning them up. You do not have control over this axis.

The goal is to lengthen the first axis, while reducing the second axis to one entry. The length of the first axis is also reduced (BAD!) by the appearance of pseudo-duplicate content (caused by too similar titles and meta descriptions). Making sure that all titles and meta descriptions are unique is what you also need to do to lengthen that axis.

g1smd




msg:3061810
 1:24 am on Aug 27, 2006 (gmt 0)

:-)

Discuss...

Halfdeck




msg:3061899
 3:48 am on Aug 27, 2006 (gmt 0)

Early this year, I created a few pages under one directory to test how Google reacts to duplicate content.

I created one original page with around 300+ words of text, and then various copies of that page, some copies sharing 60% of the content of the original, others 90%+. PR distribution for all pages are identical - one inlink from domain root to each page, and no outgoing links.

Initially (pre-BD), all pages found their way into the main index (including a page 100% identical to the original). A month or two after Big Daddy roll out, all the pages in that directory vanished from the index. Surprising to me, since I expected at least the original copy and pages with less than 70% similarity to stay in the main index and to have other pages either drop or turn supplemental.

It looks to me like Google "banned" the directory and refuses to index anything inside it. It could be due to lack of trust/PR, except Google has the rest of the domain in its main index.

I assume Googlebot prefers to crawl/index trusted, valuable, frequently-updated sites first. If it finds 100 near-duplicates under one directory, and knows there are still 100,000 in that directory left to crawl, it would be more efficient to skip that directory, instead of spending time actually crawling it knowing none of it is worth keeping in the main index.

Whitey




msg:3061922
 5:13 am on Aug 27, 2006 (gmt 0)

I think we're talking turkey.

How about centralising this key area into a format, summarising it and referencing it out to threads. As knew issues arise the central reference can be updated on a subsequent post.

Just so that we can say .... it's here [ didn't you see that Whitey -are you without glasses! ] . Is this clear or unclear? Oops - we missed this. Do you know what, this has just come to light.

A bit of this has been going on already in the past and it would save time in trying to reference around.

The more i think about it, it's more of an editorial issue, which needs to be flexible enough to permit new additions/revelations/angles that require to be reposting over a lengthly period.

[ and gee i'd be the last person to do this correctly, either technically or editorially! ]

g1smd




msg:3061995
 9:42 am on Aug 27, 2006 (gmt 0)

Back to my 3-axis thinking.

The goal is increase search engine visibility.

The first axis does this by showing more pages in the index. Publish more pages of unique content to increase visibility. Try also, to avoid the "shortening" of this axis caused by using the same title and/or meta description on multiple pages. Make tham all different.

The third axis also indirectly helps in this goal of increasing visibility, by showing supplemental results for the old version of the content on those pages. You have little or no control over this. It just is.

All of the URLs on the second axis are dragging PageRank down for all of the canonical URLs listed along the first axis, by having multiple URLs for the same content. Whether this is non-www and www, or multiple domains (.com and .co.uk), or dynamic URLs with multiple differing parameter orders, is irrelevant. All of these things are negative factors, and should be avoided.

Aim to get all the alternatives out of the index. If they then turn up as supplemental results, you'll need to wait a year for them to disappear; but as long as the redirects etc are in place, the extra entries cannot harm things.

g1smd




msg:3062015
 10:49 am on Aug 27, 2006 (gmt 0)

Recently related:

[webmasterworld.com...] - especially Page 2 onwards.

[webmasterworld.com...]

[webmasterworld.com...]

Older:

[webmasterworld.com...]

[webmasterworld.com...] (parts).

[edited by: g1smd at 11:01 am (utc) on Aug. 27, 2006]

schalk




msg:3062018
 11:01 am on Aug 27, 2006 (gmt 0)

g1smd

Are you essentially saying that we must forget about those pages that go supplemental?

I fear I have been guilty of quite a few points you have mentioned in your previous posts (thanks for this valuable information)

I have been guilty of the following

1) Google has indexed both a http and https version of some pages. (No idea how this has happened). I am now 301 redirecting the https to http

2) We also had deep pages pointing back to home page default.htm instead of / (I have also sorted this)

My point is, if I sort all these problems, can I expect the supplementals to return to the index, or should I forget about them and essentially rename the pages as new?

g1smd




msg:3062020
 11:07 am on Aug 27, 2006 (gmt 0)

Be careful what you mean by "page". To me a "page" of content can have multiple URLs if there is an error in the site design. The optimum is for each "page" to be indexed once with just one URL that can be used to access it. What often happens is that a "page" gets indexed with multiple URLs - and then you are in trouble. The alternatives steal PageRank, turn supplemental, and their cache ages so that it no longer matches the real on-page content.

.

URLs that are supplemental and are for live pages that are returning a "200 OK" for searches based on current content, and which are a duplicate of some other URL with the same content (non-www and www, multiple domains, multiple dynamic parameters, http and https), all need to be fixed with a 301 redirect to that canonical URL.

Supplemental results that represent older content at the canonical URL, can be safely ignored.

Supplemental results which represent duplicate URLs, but have already been fixed using the redirect, can also be ignored.

In those cases the supplemental result will be dropped after a year.

.

[google.com...]

g1smd




msg:3062032
 11:37 am on Aug 27, 2006 (gmt 0)

Take this thread, for example: [webmasterworld.com...]

.

This thread could be accessed using:

www.webmasterworld.com/google/3060898.htm
www.webmasterworld.com/google/3060898-1-30.htm
www.webmasterworld.com/google/3060898.htm&printfriendly=1
www.webmasterworld.com/google/3060898-1-30.htm&printfriendly=1

In this case, the other three variants would dynamically have a meta robots noindex tag added to the page to keep them out of the index.

.

The thread might also be available at:

webmasterworld.com/google/3060898.htm
webmasterworld.com/google/3060898-1-30.htm
webmasterworld.com/google/3060898.htm&printfriendly=1
webmasterworld.com/google/3060898-1-30.htm&printfriendly=1

It might also be at:

www.someotherdomain.com/google/3060898.htm
www.someotherdomain.com/google/3060898-1-30.htm
www.someotherdomain.com/google/3060898.htm&printfriendly=1
www.someotherdomain.com/google/3060898-1-30.htm&printfriendly=1

In all those other cases you would set up a dynamic 301 redirect all going to the single correct URL on the correct domain.

.

Your measure of success is in how many of the correct URLs get fully indexed, not how many supplemental results for incorrect URLs hang around in the index after the fix is put in place. The supplemental results for redirected URLs will hang around for a year; you cannot control them. Ignore them, they are not harming things.

However, supplemental results for URLs that return "200 OK" for current content searches are a warning sign that something needs to be fixed.

Sidenote: don't be fooled by searches that return supplemental results when you search for old content that used to be on the page, but is no longer there. That is normal. Google holds on to those "dust trail" results for a year.

Likewise, Google holds on to 404 pages for a year, marking them as supplemental. Make sure that they really do return a 404 HTTP status code, then move on.

Bewenched




msg:3062178
 4:27 pm on Aug 27, 2006 (gmt 0)


However, supplemental results for URLs that return "200 OK" for current content searches are a warning sign that something needs to be fixed.

What is there to fix if the content is current?

g1smd,

I'm pretty much in the same boat as schalk with the https/http versions which caused dupes. along with the whole www/non www garbage.

What you're saying g1smd is that those URLS are basically going to stay supplimental for a year.

The real question is if the canonical problems have been fixed and we need our traffic back from those pages should we

1. move the pages and 404 the supplimental ones.
2. move the pages and 301 redirect them
3. abandon the entire directory if it all or most went supplimental and move pages.

On a side note I have started seeing some of the recent supplimentals climb out of supplimental status, but no page rank is given.. in fact the bar is grey, but results are in google?

Beachboy




msg:3062185
 4:42 pm on Aug 27, 2006 (gmt 0)

If you're concerned about duplicate content, then rewrite it. If you don't want to do it yourself, there is a very well known online classifieds website out there with a "gigs" category. Place a free advertisement for a writer to do it. And suddenly you have no more dupe content worries.

SuddenlySara




msg:3062221
 5:46 pm on Aug 27, 2006 (gmt 0)

I think tedster and g1smd should write an ebook together and sell it!
Their thoughts and information have been tops on all of these google issues.

Bewenched




msg:3062275
 7:02 pm on Aug 27, 2006 (gmt 0)

Well .. it wasn't duplicate content until the site got spidered under SSL .. so there are two versions ..

https://www.example.com/mypage.htm
and
http://www.example.com/mypage.htm

[edited by: tedster at 11:36 pm (utc) on July 1, 2007]
[edit reason] switch to example.com - it can never be owned [/edit]

g1smd




msg:3062362
 8:56 pm on Aug 27, 2006 (gmt 0)

>> >> However, supplemental results for URLs that return "200 OK" for current content searches are a warning sign that something needs to be fixed. << <<

>> What is there to fix if the content is current? <<

What you will find is that you either have (duplicate content) multiple URLs for the same content so you need to install noindex tags or redirects on the alternatives; or you have (pseudo-duplicate content) too similar titles or meta descriptions which need rewriting. That is what needs fixing. The site should return each piece of content under a single "200 OK" URL.

>> I'm pretty much in the same boat as schalk with the https/http versions which caused dupes. along with the whole www/non www garbage. <<

Yes, that is another type of duplicate which you fix with either a redirect or noindex tag on the https.

>> What you're saying g1smd is that those URLS are basically going to stay supplemental for a year. <<

Once you fix a problem on the site, the URL will continue to show as a supplemental result for a year. It might not rank, but the other URL, the canonical URL, will usually quickly regain position.

>> The real question is if the canonical problems have been fixed and we need our traffic back from those pages... <<

The canonical URL for a page of content should rank quite quickly after fixing the alternative URLs by making them issue redirects or noindex tags.

>> should we...

>> 1. move the pages and 404 the supplimental ones. <<

No. The URL will continue to show as supplemental for a year.

>> 2. move the pages and 301 redirect them <<

You should have a 301 rdirect on the alternative URLs. They will continue to show in the index for a year, and the redirect will take the visitor to the correct page.

>> 3. abandon the entire directory if it all or most went supplimental and move pages. <<

Whatever you do, those URLs will continue to show as supplemental for a year. Make sure that only one URL returns a "200 OK" for each piece of content that exists on the site, and all others cannot be indexed.

Refer to the example URLs for this thread, in the post above, for guidance.

Halfdeck




msg:3062489
 12:26 am on Aug 28, 2006 (gmt 0)

>>On a side note I have started seeing some of the recent supplimentals climb out of supplimental status, but no page rank is given.. in fact the bar is grey, but results are in google?

Though the last supplemental cache refresh for my sites happened around Aug 2005, as GoogleGuy hinted in one of these threads, I'm hoping Google is moving toward a more frequent supplemental cache refresh.

The delay between a cache refresh and pages returning to the index implies Google does those two things separately:

1. Retrieves the content of a url listed in the supplemental database, and either updates the database with new cache, or store the new cache in a secondary database (Since we're still seeing supplementals occassionally reverting back to 2004/2005 even on gfe-eh.google.com, I assume Google isn't letting go of that information. Anyway, maintaining multiple versions of data while working on a project as a safety precaution isn't unheard of).

2. Re-evaluate a page for inclusion into the main index.

For example, Google recently refreshed the cache on one of my pages, which is unique, validated, content-heavy, with unique META description/title, though a bit short on PageRank. I expected the page to be included in the main index during the supplemental cache refresh, but it wasn't. A few days later, I see the page pop up in the main index (both checks were done on gfe-eh.google.com).

In this case, the page remaining in the supplementa index after the cache refresh was probably due to its previous supplemental status, not due to re-evaluation of a freshly cached page.

g1smd




msg:3062503
 12:38 am on Aug 28, 2006 (gmt 0)

>> Though the last supplemental cache refresh for my sites happened around Aug 2005 <<

Often a URL is already in the main index when you search for the current content, and only appears to be in the supplemental database when you search for some words that are in an old version of the content that used to be at that URL.

That is, a page called /latest.events.html lists events happening in Spring. Later on, the page is updated: the Spring information is removed and the Summer stuff put in its place.

Some weeks later: If you search for stuff matching any Summer events you see the URL as a normal result. However, if you search for any of the Spring events, the URL is returned as a Supplemental Result, the snippet shows Spring information, but both the cache and the real page only show Summer stuff.

In that case there is no fixing to do. Google will hold on to that supplemental result for a while and then one day will completely drop it out of view.

I have seen that happen in late 2005, early 2006 and again in the last week or so.

.

When duplicate content is also a factor: multiple URLs for the same page of content, things get a lot more complicated, as different URLs may represent older and older content cached long ago, and now frozen for a year. You can't immediately influence those listings. Get the redirect to the canonical URL in place, then wait up to a year for the extra listings to be dropped. If you don't place the redirect, the URL will periodically update cache and continue to show up as a Supplemental Result for ever more.

Halfdeck




msg:3062532
 1:04 am on Aug 28, 2006 (gmt 0)

Often a URL is already in the main index when you search for the current content, and only appears to be in the supplemental database when you search for some words that are in an old version of the content that used to be at that URL.

That is, a page called /latest.events.html lists events happening in Spring. Later on, the page is updated: the Spring information is removed and the Summer stuff put in its place.

Some weeks later: If you search for stuff matching any Summer events you see the URL as a normal result. However, if you search for any of the Spring events, the URL is returned as a Supplemental Result, the snippet shows Spring information, but both the cache and the real page only show Summer stuff.

Good point. As an example, I just took a unique snippet off a blog post I published on May 8, and ran it through Google, which returned /blog/ (my blog front page) listed as supplemental, though its listed in the main index if I run a site: search.

The cache is dated Aug 26, 2006, though the snippet I searched for is nowhere to be found (the latest entry on that page is July 17).

As you said earlier, every page in the main index seems to have its old shadow hiding in the supplemental index.

[edited by: Halfdeck at 1:18 am (utc) on Aug. 28, 2006]

Whitey




msg:3062536
 1:13 am on Aug 28, 2006 (gmt 0)

Just opening this out a bit, what's the situation with these duplicate content scenarios:

Affiliate Feeds:

www..mysite.com/my-books-content-and-title/ links to the site that feeds it with a partial XML feed, so that approximately 50% of the content is the same, although structured differently - www.theirsite.com/my-books-content-and-title/

-Can both sets of content co exist on the SERP's
-Which content will take precedence in the SERP's, my site or their site?

Multiple regional domains [ same content ]:

Several sites have the same content and same domain name with a different regional extension. Site 1 is mysite.co.uk and the owner wants to publish those results accordingly for each market audience on Google.co.uk ; Site 2 is mysite.ca;
Site 3 is mysite.com and so on.

Will Google restrict Site 1 from publishing it's results on Google.co.uk and so on? ie will Google ignore the "duplicate content on the regional SERP's" because there are no other similar results "only" on the regional Google ".co.uk" site?

Offsite page effects on referring links

Given that there are different types of supplementals, can the status of a link from a referring page in the index effect the receiving page's search result ranking.

ie if the referring page has a "supplemental status" [ noting that there are different types of supplementals ]

Meta Titles , Descriptions and content

from [webmasterworld.com...]

Building the Perfect Page - Part II - The Basics

Developing an effective <title> element.- [webmasterworld.com...]

Title Tags: A badly written title will sink your site

How to sabotage your web site without even knowing it. - [webmasterworld.com...]

Given that large db driven sites have limits on unique data fields and combinations that they can pull in and display, and, if it is now even more necessary to have different titles and descriptions, approximately how different should they be weighted and structured in the differentiation?

Will 50 sites all with the same meta title pose a duplicate content problem in the eyes of Google?

e.g. "Blue Widgets in Memphis" Site A , Site B and so on

Past guidelines on Meta Descriptions

Has anything changed from this?

Maybe the similar character %'s [ percentages ] and structure of the phrases are more of an issue? I may have missed it, but i didn't see it mentioned and now might be a good time to think about this.

http://www.webmasterworld.com/forum5/4785.htm

Building the Perfect Page - Part III - The Basics
Developing an effective META Description Tag.

meta Description Tag (metadata)

The meta Description Tag usually consists of 25 to 30 words or less using no more than 160 to 180 characters total (including spaces). The meta description also shows up in many search engine results as a summary of your site.


CainIV




msg:3062741
 6:02 am on Aug 28, 2006 (gmt 0)

One question I have is, if I have all original content on my website, and each page is unique, can I use a snippet of the first paragraph of each unique page as it's own meta description? Will this work, or cause some kind of duplicate issues.

Also, what is the effect if I do not use meta keywords at all?

Thanks for your insight, and great post guys,

Todd

tedster




msg:3062753
 6:26 am on Aug 28, 2006 (gmt 0)

That "snippet of the first paragraph" approach is one I've used in several cases. I've had only good results. And I've never known of a problem by omitting the meta keywords tag either.

Whitey




msg:3062764
 6:42 am on Aug 28, 2006 (gmt 0)

CainIV - When you say a snippet from the first paragraph which is unique, do you mean 100% unique?

The reason I'm asking is that some folks say a paragraph is unique even though they have only changed some of the words in each paragraph.

It may make it relevant to the user, but Google may throw it out as being too similar [ something I'm seeing ].

Therefore a key factor could be the density of the unique terms within the snippet.

g1smd




msg:3063276
 4:41 pm on Aug 28, 2006 (gmt 0)

>> As you said earlier, every page in the main index seems to have its old shadow hiding in the supplemental index. <<

YES! You've got it. Those are the supplemental results that you can safely ignore.

Google only recently deleted all of the older supplemental results from 2005 June to 2005 November, and then at the same time they also created new supplemental results for pages edited or deleted since 2005 December.

The stuff to be worried about, and fixing up, is where the same content appears at multiple URLs. You need to ensure that each piece of content can only be indexed at one canonical URL.

trinorthlighting




msg:3063302
 4:57 pm on Aug 28, 2006 (gmt 0)

Duplicate content is complex! Imagine all the eccomerce sites using manufacturers descriptions or technical data that is duplicated.

This 193 message thread spans 7 pages: 193 ( [1] 2 3 4 5 6 7 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved