Duplicate Content - Get it right or perish - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Duplicate Content - Get it right or perish

Setting out guidelines for a site clean of duplicate content

Whitey

12:00 am on Aug 26, 2006 (gmt 0)

Probably one of the most critical areas of building and managing a website is dealing with duplicate content. But it's a complex issue with many elements making up the overall equation of what's in and what's out, what's on site and what's off site, what take precedence and what doesn't, how one regional domain can/cannot co exist with another's content, what % is same , etc etc and how the consequences are treated by Google in the SERP's.

Recently, on one of Matt's video's he also commented that the matter was complex.

When i looked into these forums [ unless i missed something ] i could see nothing that described the elements into a high level format that could be broken down and translated into a framework for easy management.

Does anyone believe they have mastered the comprehensive management of dupe content on Google into a format that can be shared on these forums?

g1smd

6:04 pm on Aug 28, 2006 (gmt 0)

>> And I've never known of a problem by omitting the meta keywords tag either. <<

Omitting the meta keywords tag is fine, but don't scrimp on the meta description at all.

Elixir

10:38 pm on Aug 28, 2006 (gmt 0)

Duplicate Content has become a major headache for us. I am referring to stolen content. I spend hours each week sending out cease and desists. When there is a major issue such as another SEO company stealing entire sections of the site our rankings plummet. We resolve the issue and our rankings come back. I find the SEO's stealing the content and re-producing it on their site talking about their ethical approach highly offensive. Has anybody ever sued over duplicate content. I wonder if there is a legal case to sue somebody and make it a high profile case to try and deter companies stealing content as a short cut. I am not talking about scrapers either although that happens too I am talking about entire sections of content stolen with the name of the Copmpany deliberately changed. The greatest frustration of all is that when these unscrupulous thieves steal our content our rankings plummet.

The DMCA procedure takes too long. Has anybody had any experience of reporting the site to the plagarists ISP?

g1smd

11:14 pm on Aug 28, 2006 (gmt 0)

In the main, this thread is addressing duplicate content which is an exact copy: because it is coming from the same server, just with a different URL (non-www vs. www, multiple domains, URLs with different parameters or different parameter order, and http vs. https, etc).

Scraper sites do also come into the equation but only where they publish exact or almost exact copies. Normal publishing of the same artcles on multiple unrelated sites is not so much of a factor because the rest of the navigation and other content and code on those pages will be quite different.

[edited by: g1smd at 11:41 pm (utc) on Aug. 28, 2006]

trinorthlighting

11:35 pm on Aug 28, 2006 (gmt 0)

Yes, you can sue a site for stealing your content, but it is probally not worth it unless its a major corporation. Just filing a lawsuit will typically cost you $1000 in attorney fees, then you have to prove they knowingly stole content, etc...

Honestly, its not really worth it unless you and the person you are sueing has deep pockets. Additionally, the person you are sueing could live in another country and might not even be able to be found.

montefin

11:43 pm on Aug 28, 2006 (gmt 0)

Elixir wrote:

The DMCA procedure takes too long. Has anybody had any experience of reporting the site to the plagarists ISP?

Yes. I wrote a firmly worded notification to the site owner, cc'ing the Admin Contact, and the management of the site's ISP/hosting company of an intent to file a DMCA complaint.

Within 48 hours corrective action had taken place to the point where I felt no need to submit the DMCA complaint.

It is a good idea to document your taking exception to someone's infringing on your branded materials.

CainIV

5:14 am on Aug 29, 2006 (gmt 0)

I have just recently updated my ecommerce websites (xcart) to be dupe free. Talk about a heck of alot of work. Most are simply not built to function in that matter. And why should they be required to really..

The stuff to be worried about, and fixing up, is where the same content appears at multiple URLs.

This has been hashed around, but what % of duplicate content that is posted on external url's affects this 'filter' in Google if you will.

I notice some high end article websites doing fine in Google, although they are publishing content that is duplicated on other websites and most of them have the same meta descrip, title and keywords.

Does this mean that these websites have been deemed to have the original content, while the rest are filtered for the same content? What justifies who gets to wear that hat, is it based on internal pagerank, trustrank, age?

Hard and fast answer - If I have content that was published on my website long ago that others have since published (long before this fiasco began) is it wise to delete, or better yet rewrite it?

Todd

vincevincevince

7:04 am on Aug 29, 2006 (gmt 0)

_{Originally posted by Whitey:}
Probably one of the most critical areas of building and managing a website is dealing with duplicate content. But it's a complex issue with many elements making up the overall equation of what's in and what's out, what's on site and what's off site, what take precedence and what doesn't, how one regional domain can/cannot co exist with another's content, what % is same , etc etc and how the consequences are treated by Google in the SERP's.

Recently, on one of Matt's video's he also commented that the matter was complex.

Does anyone believe they have mastered the comprehensive management of dupe content on Google into a format that can be shared on these forums?

_{Ah yes, it seems that I have comprehensively managed duplicate content on these forums!}

mbucks

12:56 pm on Aug 29, 2006 (gmt 0)

This thread could be accessed using:
www.webmasterworld.com/google/3060898.htm
www.webmasterworld.com/google/3060898-1-30.htm
www.webmasterworld.com/google/3060898.htm&printfriendly=1
www.webmasterworld.com/google/3060898-1-30.htm&printfriendly=1

we are having a problem on our site that may have been caused by this problem with different URLs to access basically the same content.

On our site, for a page about blue widgets there would be a lot of text and info about blue widgets, and then at the bottom a comments box. People could post their info on there. It was possible to navigate on the comments box back/fowards and also sort by different criteria, which directed to URls like blue_widgets.php?comments_page=2 or blue_widgets.php?comments_sort=date

All these pages would have had the same main content about blue widgets, just showing different comments. Many of these are listed in supplemental now.

We put 'noindex' tags on all the pages a couple of months ago apart from the main 'blue widget' page, is this enough even though these have already been listed in supplemental? I may be splitting hairs here, I only bring this up as you mention using noindex to stop these pages getting in the supps index but then mention using redirects after they are already in there.

The pages that are shown in supplemental for a site: search seem to be very volatilehowever - the cache dates on them actually are going backwards. A few months ago all pages were moved to supplemental, but were fairly recent versions of the page. A month or so ago, the supplemental was then pages around around early May. Now the supplemental results are only showing yet another block from early March.

Thanks for any advice.

CainIV

6:37 pm on Aug 29, 2006 (gmt 0)

Whitey:

Thanks for the information. This was jsut asked a couple of posts ago, but I will re ask - is there any definitive way of knowing what content on your website has been deemed duplicate?

g1smd

7:23 pm on Aug 29, 2006 (gmt 0)

>> Does anyone believe they have mastered the comprehensive management of dupe content on Google into a format that can be shared on these forums? <<

Yes! And I thought that I had spelt it out quite clearly in several threads already. Which bits are you not getting? I'll try to find another way to explain them again.

>> We put 'noindex' tags on all the pages a couple of months ago apart from the main 'blue widget' page, is this enough even though these have already been listed in supplemental? <<

That is the correct approach. You have one canonical URL that can be indexed. Google will now drop the other URLs from their index after one year has passed.

>> but then mention using redirects after they are already in there <<

Check my example post again: [webmasterworld.com...]

Only the URLs with differing parameters need the meta robots noindex tag. These will all be on the same domain as the canonical URL.

The URLs at non-www, on other domains, etc, need the 301 redirect to the canonical form, so that all content appears at one domain in the search results.

The cache date shown for a supplemental result sometimes depends on the search query you used.

Google has multiple results for a single URL. The normal index holds the current result. The supplemental result may contain data about what was on the previous version of the page.

During a supplemental update, Google:
- clears away older supplemental results for URLs that no longer exist or have been redirected for a long time (like a year or more),
- creates newer supplemental results (refreshes them) for duplicate content that is still duplicate and still served with "200 OK",
- creates brand new supplemental results for URLs that have been edited or redirected very recently (and will hold on to those for a year or more).

Those updates occured in at least 2005 August, 2006 February? and 2006 August.

snofoam

8:12 pm on Aug 29, 2006 (gmt 0)

one recent issue that i noticed was getting spidered from a link on a splog that somehow mis-parsed our url as [2fexample.com....] I'm guessing this has something to do with %2f being /

Anyhow, we use subdomains (which can be useful with multiple production servers that might access shared stuff, like images), so it's inconvenient to 301 everything besides www.example.com, although I guess I don't have much choice. I think it does make something of an argument for having absolute rather than relative links. in our case, we use relative links mostly, so one link to 2fwww.example.com can result in our entire site (or most of it) being indexed that way.

g1smd

10:39 pm on Aug 29, 2006 (gmt 0)

The <base href="http://www.domain.com/"> tag on every page of the site will avoid that happening.

animate13

11:17 pm on Aug 29, 2006 (gmt 0)

Hello,
I am new to this site. In fact, have spent very little time on any forum.
I work for company out of Texas and we were discussing duplicate content and what is counted as spam.
If you have a www. for each state and city, and the content is the same except for the city and state, is this counted as duplicate content?
I had all this companys urls (over 500 domains) in the top 10 of Yahoo and about 93 in the top of Google. Now they are all gone. I was just wondering if this could be why. The content is the same on every url except for the city and state.
Is this counted as duplicate or spam to the search engines?
Thanks in advance to your help.
Donna (animate13)

g1smd

11:39 pm on Aug 29, 2006 (gmt 0)

More than likely that it is all too similar and large amounts of it have been filtered out.

You need to make a very careful analysis of exactly what is still indexed...

Swanson

1:12 am on Aug 30, 2006 (gmt 0)

g1smd, thanks for your insight.

I would just like to add what happens if the site is using a CMS or something similar that does all sorts of redirects etc. as they always do to get to the destination page.

Most CMS systems I know (JSP,ASP etc) do this and redirect to a URL that even worse includes a session id!

What I am asking is can Google crawl this, can it find duplicate content (it will do) and why in this day and age is this not acceptable.

My answer is that -
a) this type of shop is not good for Google
b) but after years of this type of shop it should be

My question is, how is this solved - not for me and the people I know - but the innocent people that buy shop software to publish their online shop (that will never make it into Google).

I have just reviewed what I have written and I know the answer - us techies will sort it.

What happens to the rest of them.

Why is it fair that we can fiddle about technically and help them - what about the sites that don't have a clue?

g1smd

1:25 am on Aug 30, 2006 (gmt 0)

Lobby the manufacturer to make a better product. It's the only way that change will occur. That change will benefit everyone.

Look, vBulletin, PHPbb, osCommerce, and many other applications are riddled with these design flaws.

Swanson

1:44 am on Aug 30, 2006 (gmt 0)

All applications have these "design" flaws.

The only guys that win at the moment know this. I know, I have deployed thousands of e-commerce projects that have needed to be re-written.

But aside from the fact I am a great guy - who re-writes the other guys stuff?

Aren't we going backwards if we now have to re-write everything for the new Google?

Swanson

1:58 am on Aug 30, 2006 (gmt 0)

What I am saying is who protects the guy that has a nice ecommerce site that is not based on all the above.

What happens if he bought a site that doesn't use all of the above?

There are studies that prove that the average joe on the street doesn't buy a "known" e-commerce platfrom that is nice and open source - because he is not a techie.

In fact in the UK it just doesn't happen.

So lobbying just doesn't cut it. In fact lobbying is bobbins quite frankly.

In fact unless you are from the US the word "lobby" is crap - "lobby" means nothing over here. We don't "lobby" - and the fact that US companies "lobby" US government pisses us off big time - that is where scandals originate from.

How will "lobbying" help the thousands of businesses that don't use your off-the-shelf software? The majority of e-commerce providers in the UK don't use off-the-shelf crap, so how do we lobby that?

tedster

2:50 am on Aug 30, 2006 (gmt 0)

As I read g1smd he said "lobby the manufacturer" -- which I take to mean let them know your complaints. That wouldn't have anything to do with the political lobbying of any government whatsoever, but just good old consumer commentary directed to the creator of the program.

When I managed physical world shops, I ended up needing eventually to be reponsible for the implications of electrical codes, plumbing codes, health regulations, carpentry, architectural choices, and even the quality of soil under the shop and the seasonal changes in the water table. None of it seemed directly related to selling our widgets, but it comes with the territory of doing business in a shop.

And so I think it goes on the web. IF you want people to locate you through Google (and you certainly can set up shop on the web without that approach) then you need to stay in touch with the evolution of their technology and how your chosen technologies may interact with theirs.

There are at least two distinct issues here --

1. content intentionally published at several different locations
2. content that was intentionally published only at one address, but that the domain's server technology unintentionally allows that content to be accessed through many different urls.

It's that second case that seems to blind-side people so very often. It's this second situation we need to look at most of all.

What happens in the best possible scenario is that Google chooses what seems to be the "best" address for that content and filters out the rest so that search results don't end up being a choice of 10 addresses for the same exact thing. But sometimes Google gets flooded with so many address that appear to be the same content that something goes "tilt" and big problems appear for the site.

Google is only working to give their end users a good search experience. If you have good content related to the topic searched on, then Google would like to show your page in the results too. But if you create a big "cloud" of alternate addresses, then you are making things very difficult for them. And truth be told, any one site probably needs Google more than Google needs any one particular site.

That's the reality of the situation, as I understand it.

Whitey

5:59 am on Aug 30, 2006 (gmt 0)

Swanson - Maybe webmasters andsite owners should use this tactic - email the support desk to warrant the compatibility of the software that they are selling, and if they can't, or won't apply "lobbying" , "persuasion" whatever for them to become compatible.

I said earlier that i thought alignment between software's and the SERP's frankly sucks [ It's like pre windows days - way too technical and un cordinated - IMHO ] , which demonstrates, if I'm right, that apart from a global handful of webmasters, few have any clue about the implications of the way they do things with online CMS's.

Three developers that I know well still pump out SE incompatible CMS solutions for top branded sites and clients - I mean we need to get some alignment happening [ sorry - I'm getting excited :) ]

How about a big red sticker on the box :

GUARANTEED - will not cause you any duplicate content problems.
GOOGLE COMPATIBLE
Money back guarantee.

Once one does it, everybody else will do it or loose it. G could be instrumental in some QC drive.

And what would be great is if Google kept the manufacturers in the loop about designing their products properly. Just a bit of PR once in a while would help raise the awareness. Just my thought - I'll clock off now!

[edited by: Whitey at 6:20 am (utc) on Aug. 30, 2006]

Alex70

1:50 pm on Aug 30, 2006 (gmt 0)

g1smd and everyone that can help,

my si indexed in google like this: www.mysite.com and www.mysite.com/index.php got probably some link pointing back to my home with .php .
Didn't think was a huge problem until I have discovered that they have a different date in the cache!
How can I fix this problem?

asusplay

2:19 pm on Aug 30, 2006 (gmt 0)

I think with php you can use htaccess to do a 301 redirect from index.php to the domain and that should sort it out.

I had a major headache with this but a site of mine was on asp and shared hosting so I had no access to the IIS. In the end i did a 301 redirect from index.asp to the domain, and changed the default page which is now not linked to at all (as all links to the home page now point to the domain)

akmac

5:41 pm on Aug 30, 2006 (gmt 0)

g1smd,
Is it then a good general rule to include the following in .htaccess?

redirect 301 /index.php /

(For the purpose of redirecting all requests for the /index.php (or index.html) to /)

[edited for clarity]

g1smd

6:51 pm on Aug 30, 2006 (gmt 0)

That is a good start.

There is some longer code that will handle all index pages on the site even those in folders and subfolders. See the thread at [webmasterworld.com...] for some examples.

First, redirect all instances of *.domain.com/*/*/index.php to the www.domain.com/*/*/ version.

Next redirect all non-www URLs to the www version. This catches all the other URLs that need redirecting.

Do the index redirect first. Do NOT do the non-www redirect first.

You need to avoid redirecting domain.com/index.php over to www.domain.com/index.php and then on to www.domain.com/. Doing it the other way avoids this redirection chain.

You want this to happen:
- domain.com/index.php redirects directly to www.domain.com/
- www.domain.com/index.php redirects directly to www.domain.com/
- domain.com/ redirects directly to www.domain.com/
- domain.com/anything.else redirects directly to www.domain.com/anything.else

The redirected URL will continue to appear as a supplemental result for about a year after the redirection is implemented. The cache for that URL will be frozen. It will continue to show the content just as it was, a few weeks before the redirect was implemented. You cannot change that action. Ignore it.

Alex70

7:17 pm on Aug 30, 2006 (gmt 0)

g1smd, thank you.

akmac

9:13 pm on Aug 30, 2006 (gmt 0)

So is this the proper way to format it? Currently, I only have the last three lines-and they work for the www issue, but I also see different cache dates for my / and /index.php

redirect 301 /index.php /
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST}!^www\.mysite\.com
RewriteRule (.*) [mysite.com...] [R=301,L]

I also have multiple paths to the same categories (which leads to different urls for the same page) which is another issue... Oh, and g1smd-your pm box is full.

CainIV

6:37 am on Aug 31, 2006 (gmt 0)

g1smd - A question I have for you, is does Google treat content that is duplicated from other domains on my domain in the same fashion in regards to filtering?

If I remove every article that is a duplicate article on another site from my site, and return a 404 in a header check for that url, would you think that the same effect on ranking would be seen (assuming the change in late July was caused by duplicate content)?

g1smd

11:31 pm on Aug 31, 2006 (gmt 0)

>> redirect 301 /index.php /

Use the longer code in the other mentioned thread, so that you catch all index files in all folders of the site.

g1smd

11:33 pm on Aug 31, 2006 (gmt 0)

If you have multiple domains, don't issue a 404 for one version, get a 301 redirect pointing to the main version installed instead.

Google will continue to show the redirected URLs as supplemental results for one year before dropping them. Don't worry about that, they are not harming things as long as the redirect is installed and working.

Cluttermeleon

11:09 am on Sep 1, 2006 (gmt 0)

So would varying query strings that don't alter page content be viewed as duplicate content?

For example...

www.mysite.com/widget/
www.mysite.com/widget/default.asp
www.mysite.com/widget/default.asp?Partner=ABC
www.mysite.com/widget/default.asp?Partner=DEF

If all these URLs had identical page content and the partner parameter is only used for tracking referring sites, would there be a duplicate content penalty?

Would Google resolve these URLs?

This 193 message thread spans 7 pages: 193