Duplicate Content - Get it right or perish - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Duplicate Content - Get it right or perish

Setting out guidelines for a site clean of duplicate content

Whitey

12:00 am on Aug 26, 2006 (gmt 0)

Probably one of the most critical areas of building and managing a website is dealing with duplicate content. But it's a complex issue with many elements making up the overall equation of what's in and what's out, what's on site and what's off site, what take precedence and what doesn't, how one regional domain can/cannot co exist with another's content, what % is same , etc etc and how the consequences are treated by Google in the SERP's.

Recently, on one of Matt's video's he also commented that the matter was complex.

When i looked into these forums [ unless i missed something ] i could see nothing that described the elements into a high level format that could be broken down and translated into a framework for easy management.

Does anyone believe they have mastered the comprehensive management of dupe content on Google into a format that can be shared on these forums?

g1smd

11:13 am on Sep 1, 2006 (gmt 0)

Yes, that is "duplicate content" of the exact type that I defined near the start of this thread. You must avoid that.

You need to make sure that three of the four variations are served with a <meta name="robots" content="noindex"> tag on the page, so that only one variation can be indexed.

You need to set up the server so that any URL with extra parameters just does a 301 redirect to the canonical form of the URL. That will help your PageRank a little too.

schalk

2:27 pm on Sep 2, 2006 (gmt 0)

Still trying to make sense of duplicate content and supplemental pages.

To recap we have had problems with the following, which is causing a large number of pages to go supplemental, primarily since April.

1) https pages been indexed for some not all pages. 301 redirect now in place.
2) Deep pages were pointing to default.htm instead of /
3) Many pages with little content
4) Many pages with similiar title or meta description tag
5) Poor inbound links

We have sorted points 1 and 2 and are in the processing of addressing 3, 4 and 5

We are an E-Commerce site, so tend to include the Brand name of the product, in the title of each page. So we may have several hundred products with the same brand name as an initial part of the title. I had thought that this could be a big problem, so was considering changing this, but I have seen many competitors web sites using the same methods and yet they are not supplemental, they also have little content on each page.

So in summary, could I be going too far trying to fix my site, when maybe I could be wasting my time, if Google has other problems, that could be impacting me.

It would be a fantastic feature of google Sitemaps, if it listed all supplemental pages and also indicated the reason why they were supplemental.

michael1963

4:33 am on Sep 3, 2006 (gmt 0)

One of my sites also disappeared on August 17th (was #5, now #680).

It is running phpBB (16000 posts). I had done a mod rewrite to have a search engine friendly URLs.

I figured out it was from 2 issues:

1) multiple URLS to same page
2) Same Meta Descripton on all pages

I modified the robots.txt to tell it to ignore all URLS but the "correct URL" However, the same meta description on all pages is a problem.

What I have done to try to fix it, is to added the "correct url" to the Robots.txt file, basically, telling to skip ALL files.

Next, I added a .htacess/301/redirect from the now blocked "correct URL", to a "new correct URL".

I am assuming that blocking via Robots.txt ALL the page and starting with new correct ones (meta description removed), is the fastest way to fix the problem. I was thinking of just having a redirect from the correct url to the new correct URL, without adding to RObots.txt, but I thought having all URLS blocked and starting clean was the bset.

Anyone care to comment on this this path I chose to "fix it"?

michael1963

4:34 am on Sep 3, 2006 (gmt 0)

Forgot to mention, I also have a SiteMap to google with the "new correct URL"

decaff

5:21 am on Sep 3, 2006 (gmt 0)

If you want to see an excellent example of very unique and well defined content individualized to each page...just step into the backend of Google (using the more link)..and nav through any of their additional content pages ....

Halfdeck

5:17 pm on Sep 3, 2006 (gmt 0)

I had thought that this could be a big problem, so was considering changing this, but I have seen many competitors web sites using the same methods and yet they are not supplemental, they also have little content on each page.

What one domain can get away with doesn't necessary apply to another domain. An older, more established domain, for example, can get away with more because it has Google's trust that it's not out to spam its index. A newer site or site with other shortcomings doesn't necessarily have that luxury. In that case, I'd take a more conservative approach.

g1smd

5:25 pm on Sep 3, 2006 (gmt 0)

PHPbb has multiple "duplicate content" issues out of the box.

Check what I wrote about vBulletin over the last few months. The same applies to multiple other forum, CMS, and cart packages.

Whitey

7:23 am on Sep 4, 2006 (gmt 0)

We have a site that has had dupe content issues [which i thought was fixed ].

It produces results consistantly at the top of the SERP's for "exact match searches", and consistantly at the bottom of the SERP's for "broad match searches".

Is this a sign that more work has to be done on dupe content or that there is something else in play?

simonmc

8:38 am on Sep 4, 2006 (gmt 0)

Duplicate content is only an issue if you want to be indexed well in Google. MSN and YAHOO have not built in these weaknesses to thier system so without having to jump through hoops you can be indexed correctly by them right out of the box.

When google loses it's massive market share or that share dwindles then duplicate content will again be insignificant to the webmaster.

It takes boffins no time at all to introduce these kinds of problems in a search engine and if it was not duplicate content causing a problem it would be something else.

The best thing to do for yourself is to fix these problems on your site if you want Google to play right by you but at the same time you should spend some energy telling people who don't know any better to give the other search engines a try too. Some may even be converted which in the long run is good for your duplicate content problems.

Alex70

9:50 am on Sep 4, 2006 (gmt 0)

>> If you want to see an excellent example of very unique and well defined content individualized to each page...just step into the backend of Google (using the more link)..and nav through any of their additional content pages<<

google does not use meta description on internal pages, who said it was crucial?

optimierung

5:03 pm on Sep 4, 2006 (gmt 0)

Hi g1smd
I am a total newbie here

How can I 301-redirect a non-www to www.domain for a site using shared hosting ("normal redirect statements" for that purpose in htaccess does not work as well as rewrite code)?

Or is this special task solved by the Google webmaster feature "Preferred Domain"?

Thanks

g1smd

6:58 pm on Sep 4, 2006 (gmt 0)

Use a set of rewrite rules that test for which domain was originally requested:

RewriteCond %{HTTP_HOST} ^maindomain.com [NC]
RewriteRule ^(.*)$ http://www.maindomain.com/$1 [L,R=301]

If you have some other domain that also needs to deliver the user to the same website then also add:

RewriteCond %{HTTP_HOST} ^otherdomain.com [NC]
RewriteRule ^(.*)$ http://www.maindomain.com/$1 [L,R=301]
RewriteCond %{HTTP_HOST} ^www.otherdomain.com [NC]
RewriteRule ^(.*)$ http://www.maindomain.com/$1 [L,R=301]

All of these redirects preserve the original folder and filename request in the redirect.

optimierung

12:59 pm on Sep 5, 2006 (gmt 0)

g1smd
Your rewrite rules worked well

What do you think about Google "Preferred Domain" settings?

Thanks

g1smd

1:03 pm on Sep 5, 2006 (gmt 0)

If you have the redirect in place, Google will already know your preferred domain - and so will all other search engines.

You only need their "preferred domain" tool if you cannot set up the redirect on your site. Even then I hear from some that it is not reliable.

Be aware that redirected URLs will continue to appear in the SERPs as Supplemental Results for one year after the redirect is put in place. This is normal. You cannot change that - and you don't need to. Everything is still OK if that does happen.

fraudcop

9:45 am on Sep 6, 2006 (gmt 0)

I want to get rid of all the 48.000 supplemental pages inside the cgi-bin directory .

I cannot use redirect becouse many pages are active listings so I'm wondering if is enough to add inside the robot.txt

Disallow: /cgi-bin/

as i did?

will the 48.000 pages inside the cgi-bin directory be removed after one year? or i should do something else?

thanks in advance for your answer

g1smd

5:13 pm on Sep 6, 2006 (gmt 0)

Add the robots.txt directive and Google will clear out the Supplemental Results about a year after you do that.

Everything in that folder will be delisted from Google at that time. It may happen sooner than that (unlikely).

fraudcop

10:19 pm on Sep 8, 2006 (gmt 0)

g1smd thanks

(after the robots.txt diretive)

In case I had a double content penalty, How long do I have to wait before everything go back to normal?

g1smd

10:25 pm on Sep 8, 2006 (gmt 0)

I believe, but cannot yet prove, that duplicate content stops being a problem about 3 months after you fit the appropriate fixes (redirects and/or noindex). I think it needs at least a PR and BL update to happen before you are in the clear. However, the Supplemental URLs for those disallowed URLs take a year to drop out of the SERPs after the fix is applied.

Whitey

6:31 am on Sep 9, 2006 (gmt 0)

Based on our ongoing experience, it's looking like 2-3 months for the "/" , /index.html ; /default.html fix - which is done.

Maybe only a crawl for meta title/ description content to change - so potentially days.

PR and backlinks [ 1 update only ] has occurred, i cannot verify how long it takes for results to return, but I'm hoping it will be in the next month or so. Our only "clue" is exact match results which rank, which seems to suggest to me that another BL update will help compliment the 1st update.

However, we may not be a perfect example [ I just hope we haven't missed something! ]

Whitey

2:07 am on Sep 11, 2006 (gmt 0)

If an external site has linked to " /index.htm " will this effect our site?

tedster

2:53 am on Sep 11, 2006 (gmt 0)

It depends on the "fix" that you've employed. If you are 301 redirecting those external links to the domain root or the undorned directory, then outside links should not "contaminate" your domain. If the fix has been to use a <base href=""> tag on all the index.html and default.htm pages, then you're "pretty safe" but perhaps a bit more vulnerable. And you are yet nother bit more vulnerable if you have only corrected your internal links, but not employed either a 301 or a <base href=""> tag. The 301 is best, I think, but it can be challenging in certain server environments.

However, it does seem to me that Google is slowly but surely sifting this problem from their side -- at least when the dupe urls only appear in links from external domains. At least I am optimistic -- in recent weeks I've seen a few troubles sort out with no intervention from the site owners.

circusboy

9:19 pm on Sep 12, 2006 (gmt 0)

Sorry if this has already been covered...

We have found a competitor site that offers a 'home service'. They have used the same (1500 word) article to create hundreds of pages, but each page is focused on a unique city - the only unique content on the page is the city name in the title, h1 tag, and in the content, and the page also lists a unique dealer contact info for that city's region.

Esentially, all they are changing out is the city names and contact info, the rest of the page is an exact duplicate (1500 word article!).

BUT - they rank #1 on G for almost all the locations they have created pages for when searching: "home service city" (without quotes)...

Is this white, black, or grey? - and if it's black or grey, why does Google allow this type of duplicate content? :)

Thanks in advance.

Oh yeah - they link to all these pages from their home page...

g1smd

9:58 pm on Sep 12, 2006 (gmt 0)

Sounds like quite a low quality site. I suspect that the algo will get them sooner rather than later. The visitor experience does not sound at all "rich". The article should be on one page, then there should be a directory of companies, not 1500 copies of a one-word-swap spam article.

Whitey Re: your index page.

If it is one that does, out of hundreds that do not, it is unlikely to cause a lot of grief. The "/" will be "stronger".

However, do ask them to amend their link, and/or set up an "index to / redirect" on your site.

soapystar

11:04 am on Sep 13, 2006 (gmt 0)

Sounds like quite a low quality site

quite the opposite in my experience. I watch major brands do similiar things and it appears its the trust element that allows them to get away with it. It apears the focus is on the MO of spam sites rather than duplicate content itself. Therefore it appears google is not bothered with duplciation if it trusts you to not be a spammer or low end website.

circusboy

2:37 pm on Sep 14, 2006 (gmt 0)

Soapystar: I think you may be right. This is a well known brand, and the article they use is (admittedly) good quality - and these pages have been #1 to #4 for 3 years when searching for 'home service cityname' (so unfortunately, sooner may be later - g1smd). We were hoping that the BD update would get them over the last few months, but it actually seemed to strengthen their positions. Maybe it's the way to go to capture local traffic?
I would love to see their longtail stats...

So, it seems that if G trusts you're not a spammer - you can spam.

Interesting. Seems to be verified by all the reports of large or corporate sites getting away what us smaller guys get hammered for. We're not 'known' (i.e.: trusted), thus diposable?

g1smd - I need to consider this technique for our site if it is proven to be working - but here is my moral dilemma, tell me what you think:

When a user searches 'home service cityname', and they come to my competitor's page - the page itself is useful - it tells them what they need to know about the topic, and provides a local resource they can contact for further help. So, if the page is useful, BUT the content is duplicate, other than the cityname - IS THIS REALLY SPAM?

This is killing me, because I don't want to be considered a spammer, but I want to do what's best for my business...

Is 'grey' the most beautiful color?

Whitey

4:41 am on Sep 15, 2006 (gmt 0)

There's a couple of things intriguing me from this thread [webmasterworld.com...] which relate to getting the content right.

Firstly, a site:website.com ***keyword

It seems to reveal 2-3 commonly used words around the core term , in this case "duplicate content"

e.g. site:www.webmasterworld.com *** duplicate content

Note there are no supplementals as everything is unique, but try this on another site and see what you get.

Not only is it picking up common terms around the core term, if you try it on other sites, you may find that "commonly" used terms throughout the site are also highlighted. For example observe ISO currency code's which may be in exist on an e-commerce site.

I wonder if this means that G is accounting for the use of common terms surrounding the core term to decide if it's duplicate or not.

Secondly, a site:website.com or site:www.website.com may show pages as supplemental,

but when you try it again for a specific page e.g.

site:website.com/ABCD/ or site:www.website.com/ABC/ it may be "clear"

Is this site command [ even if it is "buggy" ] saying anything about the way Google is analysing the number of words/characters and the positioning of terms and characters it calculates to establish pages as "duplicate" and assign "supplemental" status to them?

soapystar

7:36 am on Sep 15, 2006 (gmt 0)

comparing site: with and without *** got google shows around 50% of their pages supplemental.

dancy

7:59 am on Sep 15, 2006 (gmt 0)

Hello!
Very interesting fact about dublicate content.
There's one site, it uses only dublicate content for a very long time. It has a lot of News pages,which are simply copied from the news site;
And even the descriptions of brands are copied.
But this site is always on the top. And no penalties, no drops... nothing changes.
So, probably the dublicate content is ok? sometimes? not for all of us?

soapystar

9:42 am on Sep 15, 2006 (gmt 0)

its not hard to see it isnt a straight duplicate filter....you can easily bring up searches where the top ten sites have all pulled the exact same inforamtion..and sites that use 1000's of pages where the content stays the same and just the keywords change.....

yet for other sites a couple of pages that are say 60% identical can mean severe drops..its clearly about the type of site rather than the duplication itself...

Whitey

7:10 am on Sep 16, 2006 (gmt 0)

I'm wondering if duplicate content on one part of a site can effect the result rankings on another part of the site.

Put it another way, is there a % of dupe content on the overall site which tips the overall balance of the site's pages in the overall rankings, and is G's filter applied to the whole site, regardless of the keyword searches chosen?

This 193 message thread spans 7 pages: 193