Duplicate Content - Get it right or perish - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Duplicate Content - Get it right or perish

Setting out guidelines for a site clean of duplicate content

Whitey

12:00 am on Aug 26, 2006 (gmt 0)

Probably one of the most critical areas of building and managing a website is dealing with duplicate content. But it's a complex issue with many elements making up the overall equation of what's in and what's out, what's on site and what's off site, what take precedence and what doesn't, how one regional domain can/cannot co exist with another's content, what % is same , etc etc and how the consequences are treated by Google in the SERP's.

Recently, on one of Matt's video's he also commented that the matter was complex.

When i looked into these forums [ unless i missed something ] i could see nothing that described the elements into a high level format that could be broken down and translated into a framework for easy management.

Does anyone believe they have mastered the comprehensive management of dupe content on Google into a format that can be shared on these forums?

Whitey

2:52 am on Oct 13, 2006 (gmt 0)

Why is it that we get old supplemental results serving .com and no supplementals on various regional Google sites using the site:tool?

I believe Matt is saying site:results should be largely fixed which conflicts in our experience.

This has been going on for 3-4 weeks

ftwb05

7:38 pm on Oct 13, 2006 (gmt 0)

Here's my recent experience:

Site started May 2006.

Sep 06 - 103 pages indexed, 70 supplemental.

Oct 06 - 106 pages indexed, 103 supplemental.

Ripped out a whole section of javascript drop-down navigation links which took up about 1/4 of the html on each page (and identical on every page), so it may have been seen by G as duplicate content.

Now: - 104 indexed, 70 supplemental, and going down every day.

g1smd

7:40 pm on Oct 13, 2006 (gmt 0)

How many pages has the site actually got? Valid pages are URLs that return content with "200 OK" status.

Are the Supplemental Results for pages that are live, or are they for URLs that redirect or are 404?

Just counting "how many" URLs have a "Supplemental" status is a futile exercise, as the only Supplemental results that ever need any sort of fixing are those that represent live active pages on the site.

Tip: Put javascript into an external file, and call it with:

<script type="text/javascript" src="/jscode/the.script.js"></script>

errorsamac

7:47 pm on Oct 13, 2006 (gmt 0)

On the topic of duplicate content, how does Googlebot handle these URLs:

http://www.example.com./

If you put a . at the end of the URL, the site returns the same result as www.example.com without the . at the end. It even works on Google's own site - [google.com....]

Would this be considered duplicate content of the homepage? How does Googlebot handle this?

ftwb05

9:30 pm on Oct 13, 2006 (gmt 0)

g1smd, all pages are "live content", site has 160 pages, it's on a windows server and I can't (or don't know how to) solve the www.example.com or http://example.com issue, I just told Google Sitemaps that I prefer www. and G only now lists www. pages.

So I guess I see it as a good thing that these pages are now coming out of the supplemental results.

Page impressions are up 30% since these pages have started to come out of the supplemental results.

[edited by: tedster at 2:52 am (utc) on Oct. 14, 2006]
[edit reason] use example.com [/edit]

Whitey

5:32 am on Oct 14, 2006 (gmt 0)

I was wondering if it is wise to adjust the meta titles and descriptions on the "Site Map" pages to be unique.

Any experiences out there on it's effects, if any?

g1smd

11:30 am on Oct 14, 2006 (gmt 0)

You have more than one page for your site map?

What is different about each page? What makes it different to the other pages? Put those facts in the title.

.
A SERP that says:

mysite.com - Sitemap
mysite.com - Sitemap
mysite.com - Sitemap
mysite.com - Sitemap
mysite.com - Sitemap
mysite.com - Sitemap
mysite.com - Sitemap
mysite.com - Sitemap
mysite.com - Sitemap
mysite.com - Sitemap
mysite.com - Sitemap
mysite.com - Sitemap

is totally useless to a potential visitor.

.
On the other hand, titles like:

mysite.com - Sitemap - Widgets
mysite.com - Sitemap - Gadgets
mysite.com - Sitemap - Gizmos
mysite.com - Sitemap - Doodads

is much better (but usually with "mysite.com" generic information last on the line, not first).

toothake

12:33 pm on Oct 14, 2006 (gmt 0)

I HAD ENOUGH WITH HACKERS AND DUPLICATE MY SITES
do you want a duplicate content for every URL in this planet?
try that ......
http://www.example.com/?eat_my_beans
http://www.example.com/?eat_my_cat
http://www.example.org/?b_f_c
http://www.example.com/?z_g_k
put some links to the above URL's and the allmighty Google will index them.

[edited by: tedster at 4:24 pm (utc) on Oct. 14, 2006]
[edit reason] use example.com [/edit]

g1smd

1:37 pm on Oct 14, 2006 (gmt 0)

Toothake: while that generates a Duplicate URL that gets indexed in the short term, if the very same URL cannot be generated by internal linking from within the site, then it generally gets filtered out within a few days to weeks.

That's why URLs with session IDs spread in the SERPs, why non-www and www spread in the SERPs, but extra redundent parameters quickly fade away. That is, a URL that is only generated from the outside is not trusted as much as one that is also generated from within the site.

reseller

8:56 pm on Oct 14, 2006 (gmt 0)

But such kind of "fantom-pages" should'n be indexed even for a short period of time.

Are we talking about a "fantom-pages" bug in Google?

Whitey

2:53 am on Oct 15, 2006 (gmt 0)

You have more than one page for your site map?

We have around 2,000 sitemap pages restricted to approximately 40 links per page. The idea was to drive link strength and make indexing faster with smaller pages [ around 9k ].

The meta titles/descriptions are like this:

Sitemap 1 - Widget 1
About Us - Template link 1 , template link 2
Sitemap 2 - Widget 2
About Us - Template link 1 , template link 2

There is no meta description, so this is filled with text taken from the links on the site template

There is no advantage to visitors to view these pages, except possibly from a navigation point of view.

From our perspective the only thing that concerns me is the lack of speed with which Google is indexing the sites, and i wondered if duplicated content, or content that is deemed too similar on sitemap pages might inhibit the overall indexing process for a site.

[edited by: Whitey at 3:13 am (utc) on Oct. 15, 2006]

g1smd

4:24 pm on Oct 15, 2006 (gmt 0)

It is hard to tell, but if it is easy enough to make the pages each tell the story as to what that page does, then I would do it.

Whitey

1:09 am on Oct 16, 2006 (gmt 0)

Why do we have supplementals and up to date cache dates showing for a regional site on "google.com" and not for "google.regional"?

This is consistant with 2 regions where we have sites.

Whitey

1:09 pm on Oct 16, 2006 (gmt 0)

Sorry - the above related to the site:tool command

wanderingmind

12:33 pm on Oct 17, 2006 (gmt 0)

How advisable is it to use a line from the page itself as the description if it stands as a good enough description in its own right?

We have sites where there are subheadlines (intro to a story) that are pretty good descriptions of what the page is all about. How asfe is it to use them as descriptions?

Or do I get them to write descriptions for google separately?

mcskoufis

1:27 pm on Oct 17, 2006 (gmt 0)

On a couple of my sites that have been severely penalised, google still shows supplemental results which are high in the site: command and don't exist (404) for over 6 months now. I really don't get it. Tried to also disallow the non-existing urls via the robots.txt with no luck whatsoever.

To anyone who trully recovered from duplicate penalties... How long did it take for your site to recover from those penalties AFTER you have applied the fixes?

Did getting some fresh links (in the process) help?

On one of my test sites I am doing a Disallow: * to check if all the duplicates can be removed from the index faster than it would normally take....

errorsamac

4:24 pm on Oct 17, 2006 (gmt 0)

toothake - What happened to your site when Google indexed it with a "?franks_and_beans" at the end of the URL. I've seen a few Googlebot requests for something similiar on site.

g1smd

8:07 pm on Oct 17, 2006 (gmt 0)

>> Google still shows supplemental results which are high in the site: command and don't exist (404) for over 6 months now. <<

Yes, that type of Supplemental Result is shown in the search results for one year after the page is deleted. This is so that people who looked at that information some time before, can still find that URL again, and then either view the Google cache of the now-gone page, or visit some other part of your site instead.

>> Tried to also disallow the non-existing urls via the robots.txt with no luck whatsoever. <<

No. Don't do that. Google needs to "see" the 404 status in order to start the 'removal clock' ticking. The "noindex" tag doesn't help much in this case. The "noindex" removes URLs from the normal index, but seems to have little effect on Supplemental Results.

CainIV

5:59 am on Oct 18, 2006 (gmt 0)

If this has been discussed I may have missed it. How different do Meta Titles and descriptions need to be to not be classified as pseudo-dupes?

I use the page content, always over 700 words, and the first 100 words are my meta description.

The titles can be quote similar at times but this is somewhat necessary:

How to Find Widgets in Blue.
Classic Widgets in Red

etc.

Whitey

8:29 am on Oct 18, 2006 (gmt 0)

CainIV

Perfect Meta Titles and Descriptions

Have a look at Pageoneresults post over here - it looks pretty much bang on: [webmasterworld.com...]

Meta Title

Page title elements are normally 3-9 words (60-80 characters) maximum in length, no fluff, straight and to the point. This is what shows up in most search engine results as a link back to your page.

Meta Description

The meta Description Tag usually consists of 25 to 30 words or less using no more than 160 to 180 characters total (including spaces). The meta description also shows up in many search engine results as a summary of your site.

Hope that helps.

I've heard of folks using content extracted from the page to populate the meta description, to make it relevant.

[edited by: Whitey at 8:34 am (utc) on Oct. 18, 2006]

toothake

9:08 am on Oct 18, 2006 (gmt 0)

"toothake - What happened to your site when Google indexed it with a "?franks_and_beans" at the end of the URL. "
My site has been sank since then ,exept a few pages that rank stil.

natural number

1:06 am on Oct 19, 2006 (gmt 0)

Whitey we talked on another thread about /index.html, but I'll carry it on over here.

For my part, I think Google is smart enough to know that example.com and example.com/index.html are the same. I base this on the presumption that it isn't very likely that index.html and example.com will ever differ.

So for my two cents, the /index.html issue is a possible but not very likely source of duplicate content.

(I did redirect my http://example.com links to www though using cpanel.. but I'll try to keep it on one thing at a time.)

steveb

2:54 am on Oct 19, 2006 (gmt 0)

"I think Google is smart enough to know that example.com and example.com/index.html are the same"

Uusually they do, but they very often don't. Assuming they will get it right is a terrible assumption.

Whitey

3:03 am on Oct 19, 2006 (gmt 0)

natural number - I'm just picking up on the massive inputs on these forums on the subject and the lack of understanding that i previously had. This and the sites are now fixed. So i can provide testimony to this. steveb uses the word "terrible" for the assumption.

It was my "terrible" assumption. But i am smiling again - at least on duplicate content issues [ maybe not filters overall ] .

[edited by: Whitey at 3:28 am (utc) on Oct. 19, 2006]

CainIV

4:42 am on Oct 19, 2006 (gmt 0)

Agree with steveb. Taking that chance could mean a large loss for you, if not now, then in the future. It takes less than 5 minutes to preventatively fix the issue.

This issue has happened to me - index.html specifically on one site, and has not recovered after 2 months.

Whitney - thanks for the heads up on a great post about metas and titles and uniqueness :P

natural number

6:55 am on Oct 19, 2006 (gmt 0)

alright, I admit, steveb, whitey, cain .. you guys scared me about the /index.html thing so I cut and pasted this mod_rewrite rule from another post here at WWW to fix it:

RewriteCond %{THE_REQUEST} ^.*\/index\.html?
RewriteRule ^(.*)index\.html?$ http://www.example.com/$1 [R=301,L]

I can't read mod-rewrite rules so I hope it is enough. At any rate, you guys may have saved my internet life .. but we'll never know for sure.

Nimzovich

9:23 am on Oct 19, 2006 (gmt 0)

I've put a 301 redirect and works perfect in browsers. Thanks, g1smd :-)

When a look at http headers using an online tool, I see my server is sending a 301 status (perfect), but it also sends a typical apache error page:

<HTML><HEAD>
<TITLE>301 Moved Permanently</TITLE>
</HEAD><BODY>
<H1>Moved Permanently</H1>
The document has moved <A HREF="http://www.domain.com/">here</A>.<P>
<HR>
<ADDRESS>Apache/1.3.36 Server at domain.com Port 80</ADDRESS>
</BODY></HTML>

Maybe it's a silly question but, is this ok? Will Google index that page? Should I personalize that page and add a noindex metatag?

toothake

11:50 am on Oct 19, 2006 (gmt 0)

Nimzoviz you can try this as well....:)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>

<head>
<style>
a:link{font:8pt/11pt verdana; color:red}
a:visited{font:8pt/11pt verdana; color:#4e4e4e}
</style>
<meta HTTP-EQUIV="Content-Type" Content="text-html; charset=Windows-1252">
<title>Cannot find server</title>
</head>

function doNetDetect() {
saOC.NETDetectNextNavigate();
document.execCommand('refresh');
}

function initPage()
{
document.body.insertAdjacentHTML("afterBegin","<object id=saOC CLASSID='clsid:B45FF030-4447-11D2-85DE-00C04FA35C89' HEIGHT=0 width=0></object>");
}

</SCRIPT>

<table width="400" cellpadding="3" cellspacing="5">
<tr>
<td id="tableProps" valign="top" align="left"><img id="pagerrorImg" SRC="pagerror.gif"
width="25" height="33"></td>
<td id="tableProps2" align="left" valign="middle" width="360"><h1 id="textSection1"
style="COLOR: black; FONT: 13pt/15pt verdana"><span id="errorText">The page cannot be displayed</span></h1>
</td>
</tr>
<tr>
<td id="tablePropsWidth" width="400" colspan="2"><font
style="COLOR: black; FONT: 8pt/11pt verdana">The page you are looking for is currently
unavailable. The Web site might be experiencing technical difficulties, or you may need to
adjust your browser settings.</font></td>
</tr>
<tr>
<td id="tablePropsWidth" width="400" colspan="2"><font id="LID1"
style="COLOR: black; FONT: 8pt/11pt verdana"><hr color="#C0C0C0" noshade>
<p id="LID2">Please try the following:</p><ul>
<li id="instructionsText1">Click the
<a xhref="javascript:location.reload()" target="_self">
<img border=0 src="refresh.gif" width="13" height="16"
alt="refresh.gif (82 bytes)" align="middle"></a> <a xhref="javascript:location.reload()" target="_self">Refresh</a> button, or try again later.<br>
</li>

<li id="instructionsText2">If you typed the page address in the Address bar, make sure that
it is spelled correctly.<br>
</li>
<li id="instructionsText3">To check your connection settings, click the <b>Tools</b> menu, and then click
<b>Internet Options</b>. On the <b>Connections</b> tab, click <b>Settings</b>.
The settings should match those provided by your local area network (LAN) administrator or Internet service provider (ISP). </li>
<li ID="list4">If your Network Administrator has enabled it, Microsoft Windows
can examine your network and automatically discover network connection settings.<BR>
If you would like Windows to try and discover them,
<br>click <a href="javascript:doNetDetect()" title="Detect Settings"><img border=0 src="search.gif" width="16" height="16" alt="Detect Settings" align="center"> Detect Network Settings</a>
</li>
<li id="instructionsText5">
Some sites require 128-bit connection security. Click the <b>Help</b> menu and then click <b> About Internet Explorer </b> to determine what strength security you have installed.
</li>
<li id="instructionsText4">
If you are trying to reach a secure site, make sure your Security settings can support it. Click the <B>Tools</b> menu, and then click <b>Internet Options</b>. On the Advanced tab, scroll to the Security section and check settings for SSL 2.0, SSL 3.0, TLS 1.0, PCT 1.0.
</li>
<li id="list3">Click the <a href="javascript:history.back(1)"><img valign=bottom border=0 src="back.gif"> Back</a> button to try another link. </li>

</ul>
<p><br>
</p>
<h2 id="IEText" style="font:8pt/11pt verdana; color:black">Cannot find server or DNS Error<BR> Internet Explorer

</h2>
</font></td>
</tr>
</table>
</body>
</html>

Nimzovich

2:41 pm on Oct 19, 2006 (gmt 0)

Nice design, Toothake ;-) I have searched "301 Moved Permanently"+"The document has moved here" in Google and I have seen lots of pages like this. Do you think it's necessary prevent them from being indexed? For example, adding a noindex metatag to a personalized 301 error page.

g1smd

7:03 pm on Oct 19, 2006 (gmt 0)

I assume that they were indexed simply because the server sent a "200 OK" response code in the HTTP header, and not a 301 response.

This 193 message thread spans 7 pages: 193