homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 193 message thread spans 7 pages: < < 193 ( 1 2 3 4 [5] 6 7 > >     
Duplicate Content - Get it right or perish
Setting out guidelines for a site clean of duplicate content

 12:00 am on Aug 26, 2006 (gmt 0)

Probably one of the most critical areas of building and managing a website is dealing with duplicate content. But it's a complex issue with many elements making up the overall equation of what's in and what's out, what's on site and what's off site, what take precedence and what doesn't, how one regional domain can/cannot co exist with another's content, what % is same , etc etc and how the consequences are treated by Google in the SERP's.

Recently, on one of Matt's video's he also commented that the matter was complex.

When i looked into these forums [ unless i missed something ] i could see nothing that described the elements into a high level format that could be broken down and translated into a framework for easy management.

Does anyone believe they have mastered the comprehensive management of dupe content on Google into a format that can be shared on these forums?


Jordo needs a drink

 6:01 am on Sep 30, 2006 (gmt 0)

This thread could be accessed using:


And www.webmasterworld.com/?jordo_needs_a_drink=yes can access the webmasterworld homepage also...

My point in showing that is that you never know how another site is going to link to you.

Even if you think you've done everything you could to eliminate the duplicate content/sup issue, at least for the home page, everyone should do a googe search on:
site:www.mysite.com "your homepage title" without the quotes
site:www.mysite.com "your homepage description" without the quotes

I did both of those for my own site and it came up with some very interesting results. Then I did it for a couple of very well known sites and came up with some dupe issues. (webmasterworld being one of them ;) )


 11:25 am on Sep 30, 2006 (gmt 0)

Let's not debate Google's ethics or intentions. Let's help each other get our sites properly indexed. That's the topic of the thread -- "guidelines for a site clean of duplicate content".


Sorry, but I'm convinced that google's unethical activities will work against a site clean-up.

The points I raised earlier point out just how determined google is to make sure that the internet works for google and not for webmasters.

Just let any one of these representatives enter this thread and I will unlease the kind of questions again to them where they cower or start to advertise peanut butter and Elvis sightings.

No matter how you go about cleaning up your website there will always be a negative factor that nullifies your attempts. For instance, I've disclosed many ways how a site can be tanked by any webmaster. Yet google assures webmasters that nobody can influence your ranking in google's index.

Google also uses misleading ploys. It's button to treat versions of a site is a sublime tactic, a gimmik and it is misleading. Some webmasters think it is a 301.

I just cannot believe some of the things I am reading.

One desperate webmaster asked me recently to help find out what had happened to his top ranking site for a competive keyword. He was given many good intentioned advice by members here. He got in such a mess that nearly every aspect of his site was altered. When in fact he was still the number 1 ranking site for his niche.

The command we know about that removes this filter reveals his website at number one position. In the organic results he now ranks 120.

The best way to understand how google trcks the webmaster and the end user is to accept the fact there are TWO RESULTS when you do a search. It happens at the blink of an eye. The first results where your site is in the first page is first produced. The filter then removes sites from that page and displays what google wants displayed. Your site is made not visible. You now think your site has tanked, when in fact it has not. It is filtered.

This is a google mechanism that is specifically designed to filter out websites to make that competitive keyword work for google. By removing relevant sites the chances of the end user clicking the pay per click adds increases.

The more competitive a keyword the more filtration of relevance is made by google.

While such unethical tricks are played by google, it simply becomes impossible to create a foundation or a quick reference to follow in order to clean up a site. The site that tanked above DID NOT TANK. Google's organic results displays the site indexed. Therefore no penalty was inflicted on the site. A relevance filter picked out the site at random and summarily demoted it for simply being a well constructed site for the competitive keyword.

There is simply no point for that website to be adjusted. It is the quintessential site perfectly created and designed as the authority of its niche and thus is a prime target for google to remove it from the first page results. Its presence hinders google's pay per click revenue.

And when you consider that google has the right to display "your websites contents" in another websites URL, how on earth are you going to clean up your website. The canonical issue is misunderstood by webmasters. Google reserves the right to reflect what it sees on the internet. If it determines by its determination that a different URL to yours is better equipped to display your contents, that is exactly what google will do. That is google's canonicalization process. It is different to what we think.

Google itself will duplicate your website at another URL belonging to another site if it deems is a better canonical representation of your website. So how in God's good name are you going to set about to make sure you do not create duplicate content in google.?

Time after time you read webmasters doing a site wide command. How many know that google relaxed the capabilities of these commands and adjusted what these commands display. inurl: command now does not do what it did a year ago. It again is displaying what google will allow you to see. Not what you want to see.

Evidence of google's canonical methods are hidden. It's alogo and other software often contradict each other.

Chasing ones tail is not the way to clean up a mess that does not exist.

[edited by: AlgorithmGuy at 12:04 pm (utc) on Sep. 30, 2006]


 1:03 pm on Sep 30, 2006 (gmt 0)

...there is way more G's here then most of us know about.

Suddenly Sara makes a good point there.

As for AlgorhythmGuy's two long posts, I must say I was told essentially the same thing over a year ago regarding the www. I was told that it is not necessary, and the story as it was told to me was that it was used as a marketing device to make sure people knew when they saw print ads, or ads on TV or radio, that they needed to go to the Internet to view the site.

Seems silly, but it makes sense when you consider in a world full of AOL users, many of whom don't even realize that the AOL community alone isn't the Internet! There's a whole world outside of AOL. This is a true story. A friend of mine could not find my site. It's because he wasn't going out onto the Internet, he was just staying within the confines of AOL searching for it! It was difficult explaining to him that he needed to go out onto the web to find it...

That said, I have redirected my site to the non-www, and it hasn't helped so far. I'm going to look into the nameserver issue, though. That's a great point. If the bots want to find my site, they'll have to use the correct URL - the one I choose - for it.

The thing that makes me really angry is this: While I'm spending hours and hours researching trying to find out why people can't find my site, I'm not creating new pages that others would find helpful. I would bet I could have had 500 or more hand coded, unique, original content pages online now that I don't have because I'm trying to fix the ones I already have that no one can find. They used to be #1, and now they're buried.

I haven't done anything different, Google has. And they're doing a disservice to their customers, because their customers aren't finding the good stuff at #145, or #201. So, in some respects, Google needs to take their own advice and stop manipulating their index and return results that people find helpful.

And I know for some of the terms I check, that isn't happening now, and hasn't been happening for awhile.

[edited by: AndyA at 1:04 pm (utc) on Sep. 30, 2006]


 1:21 pm on Sep 30, 2006 (gmt 0)

Jordo needs a drink,

There is no possible way that google would tank webmasterworld. Even if it had a billion duplicate contents. No possible way exists for webmasterworld to have a canonical or duplicate content issue. Not long ago a major change was done by the mod's here. If it were your site, it would have gone into oblivion. Webmasterworld remained unaffected. In fact, it was assisted by google to feature where it belongs in google's serps.

Other authority sites such as msn, fifa, aol, yahoo, cnn, ebay etc cannot be affected in any way for any penalties.

Such sites MUST exist in google. Google knows that if it tanks fifa, millions of surfers will use alternative search engines. If google allowed its penalties on such sites, google would self destruct.

But its penalties apply to normal sites.

[edited by: AlgorithmGuy at 1:27 pm (utc) on Sep. 30, 2006]


 1:34 pm on Sep 30, 2006 (gmt 0)


Indeed, I believe you are right -- the results get filtered -- and what doesn't pass the filter gets dumped into the supplemental results bucket.

They are trying to improve search results for their audience -- searchers. (this is how they would explain it)

They are trying to weed out bad webmasters that intentionally produce spam so that the search results will be "better".

Unfortunately, good webmasters are having "good" results caught in the filter too.

Webmasters that learn about the 301 redirects and other anti-duplicate content measures and implement them eventually get their results restored. (hopefully)

"Eventually" does seem like an awfully long time, though, and it does feel like it is unjust.

So it is to be 1 url that returns "200 OK" per page of content, 1 "domain" per site, and any additional urls/domains that arrive at the same content should either return a 301 redirect to the preferred url/domain or return a 404.


 2:18 pm on Sep 30, 2006 (gmt 0)

Unfortunately, good webmasters are having "good" results caught in the filter too.


Yes, and possibly deliberately by google.

I had just looked at a site requested to me by a webmaster. I truly believe that this webmaster has been dealt a death blow by google simply because his website is relevant to search terms and topics of interest the end user may look for.

A long established website that catered for a specific niche.

A child could now create a single page and rank higher than the authority site that google has demoted.

This is all crap thrown in the face of webmasters by google. Scraper sites galore rank pages before the authority website. In the most vague keywords even, unrelated sites that simply have a link to the authority site tower in rank well above the authority site.

It is as though google has demoted the site because it gets in the way of google's pay per click. It is too relevant and would provide the end user what they look for. In google's eyes this is not good for its business model.

This is a classic canonical issue. Google has deemed that other websites better represent the authority site. Even a simple link from an empty page is canonically a better representative of the authority site regarding its contents. This really is how google determines canonicalization. If the empty pages that contain a simple link were also to have the authority sites contents, that is how google would still display its results. Other URL's being deemed by google to represent that particular content of the original and legitimate site

This is the way google reflects what it sees on the web. It has no comitment to index or to rank a website. Whether we agree with it or not.

Should this webmaster now go hunting around in a desperate attempt to look for a basis of a collective method to fix his website?

No damned thing wrong exists about his website to fix
It is simply google removing relevance from the eye of the end user.


[edited by: AlgorithmGuy at 2:36 pm (utc) on Sep. 30, 2006]


 2:38 pm on Sep 30, 2006 (gmt 0)

>> This is all crap thrown in the face <<


I've got my cleanup kit and I'm cleaning it off my face.

There is nothing I can do about google.

Jordo needs a drink

 3:58 pm on Sep 30, 2006 (gmt 0)

There is no possible way that google would tank webmasterworld.

That wasn't the point of the post. I'm just saying webmasters need to check their sites even if they think they've corrected everything. I just used webmasterworld as an example, just as others did.


 4:46 pm on Sep 30, 2006 (gmt 0)

That wasn't the point of the post. I'm just saying webmasters need to check their sites even if they think they've corrected everything. I just used webmasterworld as an example, just as others did.

Jordo needs a drink,

"That wasn't the point of the post"

Hmmm, you make it sound as though I've done something wrong. Should I had not clarified what some webmasters may or may not know, or just kept quite?

Your trigger happy display to admonish a webmaster is another reason that google will always win against webmasters, we are very quick to snap at each other.

I doubt very much if I miss focal points of a discussion. Especially to do with google.

The topic of this discussion is to help find ways to avoid as best possible the duplicate content problems in google. My points are very valid and pertinent to the thread.

And I am not here to miss points raised.

You actually rasised a good point and I adumbrated on it. Making it clearer in my view so that webmasters who may not be aware of it were now aware.

If my comments are useless and off target, I'd be quite happy not to post and to mind my own business.

If you want my comment about missing a point, you say that webmasters should check their websites even if they think they got it right. Check what when there are only decptive guidelines to go by.

I'm affraid that sort of comment is no good to a webmaster who's website has tanked in google. At least my comments dig deep. Deep under google's skin.


[edited by: AlgorithmGuy at 4:51 pm (utc) on Sep. 30, 2006]

Jordo needs a drink

 5:23 pm on Sep 30, 2006 (gmt 0)

Your trigger happy display to admonish a webmaster

Who did I admonish? The only thing I did was tell you that google tanking webmasterworld wasn't the point of my post.

The topic of this discussion is to help find ways to avoid as best possible the duplicate content problems in google.

Exactly! And that's what my post did. Give another example and a way for webmasters to check their sites for the unexpected dup's.


 7:37 pm on Sep 30, 2006 (gmt 0)

Exactly! And that's what my post did. Give another example and a way for webmasters to check their sites for the unexpected dup's.

Jordo needs a drink,

OK, we are on the same playing field.

Let us assume a webmaster has done everything he/she knows to make sure no duplicate content exists. I'd like to clear up something here before we go on.

A website can actually benefit for having very similar or near identical pages in google. At least that is what I see in the indented results. So long as it is in the same domain. One way of getting rid of another competitor off that first page is to try and get an indented result.

Google is not restricted to giving 10 URL's in its first page organic results. In extreme cases there maybe only 5 websites in a result. All with indents. The more unique a site is in all its pages the less chance of an indent. So this contradicts slightly the duplicate content issue.

As for a duplicate content within the same domain that may warrant a penalty, I doubt if that penalty exists because google is clever enough to determine that it is from the same site. It rewards the webmaster with an indented listing and a link indicating this site has more of what you are looking for. Despite what I think of google it does a good job relating to this anomily.

Splitting of pagerank or a sudden link that causes yet more duplications and more splitting up of the pagerank may reflect as a sudden drop in google's ranking in a rolling update or main updates. Sometimes sending the site into oblivion.

Google is indeed sensitive to sudden mass of links relating to a website. That sudden link above causes a website-load of links to be accessible to google very suddenly based on a non resolving site with relative links.

If a 100 page site had 100 relative links where all pages can access each other, then that rougue link pointing to a vulnerable site could have devastating effects to that site. Constant splitting of pagerank etc. Somehow, very few sites seem to recover after all that has been done to make sure all duplications are rectified.


[edited by: AlgorithmGuy at 7:44 pm (utc) on Sep. 30, 2006]


 8:00 pm on Sep 30, 2006 (gmt 0)

>> There is no possible way that google would tank webmasterworld. <<

When Brett put a noindex tag on the site Google delisted the whole lot within weeks.

Once it was removed back it came.

However, the site is carefully designed to avoid most Duplicate Content issues. Most of the fixes discussed here in the forum have already been implemented on the site itself.


 8:46 pm on Sep 30, 2006 (gmt 0)

>> That said, I have redirected my site to the non-www, and it hasn't helped so far. <<

In what way hasn't it helped? And how do you know?

If you are looking at the number of www pages still showing as Supplemental pages then you are looking at the wrong thing. They will stay in the index for one year before being dropped. They do NOT count as Duplicate Content, soon after the redirect is applied to the URL. For your site you should be seeing how many non-www pages get fully indexed. Any of those that remain as Supplemental after a few months still have a problem that needs fixing.

[edited by: g1smd at 9:08 pm (utc) on Sep. 30, 2006]


 3:29 pm on Oct 3, 2006 (gmt 0)

Hmm, how Google manages to put my site into 90% supplemental listings, while a guy who registered a domain in August and copied my site (pages, site structure, tags, content, everything) is not supplemental with MY content is beyond me.

I'm glad MSN and Yahoo haven't fallen for this.

Time to tell somebody to take the content they've stolen down...


 5:50 pm on Oct 3, 2006 (gmt 0)

It's all because you're now the duplicate....


 12:18 am on Oct 4, 2006 (gmt 0)

We also tie all of our articles together using index pages. Each index page contains about 30 links to articles. The entry in the index page contains a small snippet from the article. Usually the first paragraph. This first paragraph is also used in the meta-description tag. Could this be causing us issues? Could it be considered duplicate?

We noticed same issue at our site. I guess you would need to re-write this snippets as they seems to be duplicate content.

[edited by: tedster at 12:25 am (utc) on Oct. 4, 2006]
[edit reason] fix quote formatting [/edit]


 1:45 am on Oct 4, 2006 (gmt 0)

Pikono - Did rewriting the snippets fix your issue? Curious if this has been our curse for awhile.


 3:55 pm on Oct 4, 2006 (gmt 0)

We didn't make decision yet what to do with these snipets.

We consider to get rid of these pages and include snipets at some other pages where there is a lot of unique text so there, it will present small % of total unique content.


 7:03 am on Oct 9, 2006 (gmt 0)

g1smd> Lobby the manufacturer to make a better product. It's the only way that change will occur. That change will benefit everyone.

You would have thought that would be a good idea but a lot of these Open Source folks feel that "doing this SEO stuff" will somehow taint them and that it is "all black magic and voodoo". While they seem to have understood the need for meaningful URLs rather than?id=3943&lang=en etc it seems that a lot of the more complex stuff really goes over their heads or they simply just don't believe it. Even where I have given examples of where their product is failing they won't believe what is in front of their eyes.


 4:18 pm on Oct 9, 2006 (gmt 0)

Just to keep this material in a relevant thread... this is the fix for both the index page vs. "/" and the non-www vs. www "duplicate content" problems:

The check for index pages should also force the domain to the www version in the rewrite, and the index check should be both domain insensitive (working for both www and non-www index pages), and should occur before any check for non-www URLs:

RewriteEngine on

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html? [NC]
RewriteRule ^(([^/]*/)*)index\.html?$ http://www.domain.com/$1 [R=301,L]

RewriteCond %{HTTP_HOST} ^domain\.com [NC]
RewriteRule ^(.*)$ http://www.domain.com/$1 [R=301,L]

First, this forces all index pages, both index.html and index.htm to / for both non-www and www, and forces them all to be on www. The redirect works for index pages both in the root and in any folders, and the 301 redirect preserves the folder name in the redirect.

Secondly, for all pages that are on non-www the other 301 redirect forces the domain to be www. This second directive is never used by index pages as the first directive will have already converted all of them.

This code goes in the .htaccess file on an Apache web server.


 7:52 pm on Oct 10, 2006 (gmt 0)


Would it be possible to give such specific instructions on what to do on IIS? I have taken care of mine thorough the hosts, but a lot sites on IIS too face the canonical problem, and do not know how to explain what is to be done to their hosts.


 8:11 pm on Oct 10, 2006 (gmt 0)

Windows servers are a different beast. The latest thinking on this forum about the default.asp --> domain root fix on IIS is that you need a third party rewrite plug-in -- although, if a plug-in can do it, then there's got to be a way to do directly what the plug-in does. We just haven't uncovered it yet. See this thread:

domain vs. domain/default.asp - a Windows server problem [webmasterworld.com]

The no-www and with-www canonical issue can be handled in the IIS "Internet Services Manager" dialogue


Domain Level
"non-www" to "with-www"
Permanent 301 redirect in IIS

1. After the www.example.com website is set up, now set up
example.com (without the www) in Internet Services Manager.

2. Select the example.com web site in Internet Services
manager and enter the properties.

3. In the Home Directory tab, change the option button "When
connecting to this resource the content should come from" to
"A redirection to a URL".

4. In the "Redirect to" box, enter http://www.example.com$S$Q

5. Check the checkbox that says "A permanent redirection for
this resource." Otherwise you get a 302 redirect, the default.


In #4 above, $S is the requested URL, $Q is the query string. So those variable will make the 301 redirect work for any full url including the query string, and only add the "www", retaining the entire url rather than redirecting to the domain root.

[edited by: tedster at 8:14 pm (utc) on Oct. 10, 2006]


 8:13 pm on Oct 10, 2006 (gmt 0)

The problem with IIS is that there is nothing in the default configuration to allow you to do the index rewrite yourself. You need to get something like ISAPI_Rewrite installed. Oh, and most sites on Windows boxes don't have enough Admin rights to actually install it themselves. Further, most hosts will refuse to allow you to install it, and refuse to install it for you.

Tedster details how to solve the www vs. non-www problem on IIS boxes, but I have found a few people that say that they didn't even have access to that.

Ever wondered why I always recommend Apache for web hosting?

[edited by: g1smd at 8:37 pm (utc) on Oct. 10, 2006]


 8:33 pm on Oct 10, 2006 (gmt 0)

RewriteEngine on

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html? [NC]
RewriteRule ^(([^/]*/)*)index\.html?$ [domain.com...] [R=301,L]

RewriteCond %{HTTP_HOST} ^domain\.com [NC]
RewriteRule ^(.*)$ [domain.com...] [R=301,L]

First, this forces all index pages, both index.html and index.htm to / for both non-www and www, and forces them all to be on www. The redirect works for index pages both in the root and in any folders, and the 301 redirect preserves the folder name in the redirect.

What if my site is already indexed by Google with all of the sub folders having the index.html shown. For example:

(1) [domain.com...]

Would I still want to do this? In other words if I have everything redirected with this so that it will redirect to:

(2) [domain.com...]

Couldn't this cause me problems where the original url (1) has high page rank and also ranks very well for google searches? Wouldn't the second url be considered a different "new" page and not rank as well?


 8:39 pm on Oct 10, 2006 (gmt 0)

The question I would ask is how many external links point to www.domain.com/folder/ itself?

If none, then I would be tempted to leave things alone.

If many, I would make the change now, before Google "flips" at some point in the future... as they surely will at some point.

Perhaps I would start by changing several folders and monitoring what happens to them for the next 3 or 4 months.


 8:49 pm on Oct 10, 2006 (gmt 0)

I see what you mean g1smd. Thanks. So if I did want to keep the index.html for the sub folders, but did want to get rid of if only for the main domain (which uses index.html as the default), then would the following be the right way to do that?

RewriteCond %{HTTP_HOST} ^index\.html[NC]
RewriteRule ^(.*)$ [domain.com...] [L,R=301]

RewriteCond %{HTTP_HOST} ^domain\.com [NC]
RewriteRule ^(.*)$ [domain.com...] [R=301,L]


 9:57 pm on Oct 10, 2006 (gmt 0)

Your first example is invalid, your domain name {HTTP_HOST} is not "index.html".

Your second example is the standard non-www to www fix that you will still need to be in place.


The first part should read:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html? [NC]
RewriteRule ^index\.html?$ http://www.domain.com/ [R=301,L]

That will redirect for both index.html and for index.htm but only for those found in the root.

[edited by: g1smd at 10:31 pm (utc) on Oct. 10, 2006]


 10:03 pm on Oct 10, 2006 (gmt 0)

Got it. Thanks g1smd your posts have really helped me a lot....

natural number

 8:13 pm on Oct 12, 2006 (gmt 0)

Hey G1smd, here is my dupe content question:

I have a main page, called example.
Example has a gallery on it.
The title of example is "Widget Example."

When you click on the pics in example.html, the pages displaying the widget have titles like this:
Widget Example 2.
Widget Example 3.
Widget Example 4.
Widget Example 5.

Am I walking dangerously close to duplicate content? These other pages just show pictures, what should I do .. if I am close.
thanks in advance,


 8:56 pm on Oct 12, 2006 (gmt 0)

It is a bit close. If you want them to be indexed, I would add a few more words to each title.

You'll have trouble getting any rankings if there is little or no on-page text, and they will all be treated as duplicates in the end.

However, if those pages JUST contain images, then I would use <meta name="robots" content="noindex"> on all of those so that those pages don't get indexed at all.

Search engines index text, so why not just work on getting each of the main text pages and gallery pages (with added text) indexed and ranked, and don't bother with Google looking at the image sub-pages at all?


 2:52 am on Oct 13, 2006 (gmt 0)

Why is it that we get old supplemental results serving .com and no supplementals on various regional Google sites using the site:tool?

I believe Matt is saying site:results should be largely fixed which conflicts in our experience.

This has been going on for 3-4 weeks

This 193 message thread spans 7 pages: < < 193 ( 1 2 3 4 [5] 6 7 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved