Serious analysis of Duplicate content penalties

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Serious analysis of Duplicate content penalties

Clark

11:18 pm on Sep 21, 2005 (gmt 0)

Has anyone done some serious analysis into the extent of the Duplicate content filter?

If hello were not a stop word and it was used 10 times, that is not duplicate content.

If you quote a sentence from another site. That is not duplicate content.

How about if you quote a page out of 100 pages? If you quote two paragraphs?

If you quote a paragraph but your page is only a pararaph.

If you quote a paragraph and your page is 10,000 times as long.

I imagine you can see what I'm getting at.

nsqlg

6:56 am on Sep 24, 2005 (gmt 0)

Maybe use removal tool can be good ideia for print pages.

nickied

11:20 am on Sep 24, 2005 (gmt 0)

Clark:

I'm talking about dupe content from your own site only.

I've got hundreds of pages returned by an allinurl: which don't / never existed. Things like offset=45 or negative offsets such as offset=-50, where offset=10 is the 2nd page of 10 widgets. I never had urls with 15, 25, 35, etc. Also never had urls such as offset=-357 (negative), etc.

9,710 allinurl pages to be exact. actual pages per G, about 816, half url only. actual pages per me, about 1,000 - 1,050.

allinurl pages have cache dates "as retrieved on Jan 27, 2005" and "as retrieved on Jun 24" etc, as I expect many members here also have.

Site was cleaned up with 301's for non-www, etc about June. xml sitemap generated and supplementals vanished quickly. On 4th July pages more than quadupled to 3440 and many returned to supplemental. after that, pages kept increasing.

Got to believe there's a dup penalty here. Thinking of turning off custom error page off, returning 404's for bad pages, and waiting. (I'm not a coder, btw, so doing the url rewrites will not be easy here, and there are hundreds of pages to be done.) Or should I just wait for next "update" and hope old cache's go away?

Thanks.

AndyA

11:54 am on Sep 24, 2005 (gmt 0)

stargeek: Yes, it was around the middle of December. The drop off in the statistics is quite sharp.

Clark

5:47 pm on Sep 24, 2005 (gmt 0)

Nicki,
Wish I could help. This is the one issue where I've seen very little comment that I thought was insightful, authoritative or helpful. Not here not anywhere. And I've been concerned about this issue for a long time. It really BEGS to be addressed by Google. Both with official guidance and by changing their algos. Why should we be afraid of Printer Friendly Pages or people linking to?var=val pages or using rel=nofollow.

When you use a lot of software and have lots of content and a fair number of domains, doing 301's and nofollows becomes impossible.

Google, please talk about this somewhere.

annej

11:37 pm on Sep 24, 2005 (gmt 0)

Here is another problem I am finding. I moved some of my pages to a new URL in the process of reorganizing a bit. These pages still come up in Google searches though they are listed as supplimental pages. I don't understand why Google hasn't just dropped them.

The old pages have a "404 moved page" come up. It is customized with a link to the homepage. Could that be the problem?

It does look like I have duplicate pages even though I don't. After my Bourbon problem I know that whole sites can be down rated for this.

nickied

11:59 pm on Sep 24, 2005 (gmt 0)

Clark:

Wish I could help.

not a problem.

This is the one issue where I've seen very little comment ... (snip)

maybe this will help get it on the radar for gg.

regards.

nickied

12:02 am on Sep 25, 2005 (gmt 0)

annej:

The old pages have a "404 moved page" come up. It is customized with a link to the homepage. Could that be the problem?

Possible. Are you using a custom 404 page made through a control panel or something? I had to shut mine off (custom 404) in order to use the G removal tool. Check to see if they are really returning a 404 code.

bumpski

11:01 am on Sep 25, 2005 (gmt 0)

annej:

I've found that Google must attempt to crawl a non-existent page at least 3 times to remove the page completely from its index(es).

This may be bad advice, BUT, one thing that may work is putting absolute links to your non-existent pages (pages your want, and have, removed), perhaps even on your home page. Then allow Google to crawl these links getting your hopefully present 404 error at least 3 times. Then remove your absolute links.

Using Firefox and the "Live HTTP headers" extension is one way to check for correctly formatted 404 errors in your response headers.

The response header string from live HTTP headers:
"HTTP/1.x 404 Not Found" (Hope this isn't overkill)

When I wanted to fix www vs non-www problems this was the only way to eliminate all pages incorrectly showing as non-www versions in Google's index. Basically there must be a link to the non-existent page you want removed, until the page is actually removed from Google's index. This could take 3 crawls, perhaps up to 3 (maybe 4) months worst case!

Otherwise the page will remain as "orphanned" (and uncrawled) in the supplimental index, probably forever! Again a disclaimer use this info at you own risk!

I have one page that has been orphanned and non-existent and it remains in Googles index 3 years after it was removed! I've left it there for posterity! Anyone in the world could make it go away by linking to it! (If they could find it, and they can!). Even if you click on the link Google provides in the SERPS, Google will not remove the page until it is crawled 3 times.

Finally I've also seen in my research that some say the link to the non-existent page to be removed must come from off site. I was successful just linking from the home page of the same site.

econman

12:12 pm on Sep 25, 2005 (gmt 0)

In reading this thread, there seems to be very little consideration of the difference between a penalty and a filter.

The title of the thread uses the word "penalties" yet I thought there was no penalty for having the same content appear more than once on your own site.

With respect to content on the same site, I thought Google was filtering out duplicates. It applies a filter in an attempt (not always successful) to prevent the same content from appearing twice in the Google SERPs.

If it is a filter, not a penalty, there would seem to be little or no harm (and little or no SEO benefit) from providing duplicate content on your site, to the extent you want to do that (e.g. to make the site easier to navigate, or provide some other benefit to users).

Am I wrong?

Clark

5:08 pm on Sep 25, 2005 (gmt 0)

Excellent point.

That's exactly what I'd like to know. Is dupe content within a site only a filter or a penalty?

andrea99

6:56 pm on Sep 25, 2005 (gmt 0)

Picking up from my earlier post in this thread:

Many of the meta descriptions that I changed (merely appended part of the site title) have now been indexed
and the site:mydomain.com search is now listing them separately instead of being a part of the "...omitted some entries very similar..."

I strongly suspect this lifts some penalty, unfortunately my Google referrals have not increased. If anything they are slightly lower. I'll give this some more time before trying to figure out what's happening.

andrea99

7:16 pm on Sep 25, 2005 (gmt 0)

Whoa! I Should have looked at my stats again before making that last post, Google referrals have just TRIPLED over the last hour. I've learned not to get too excited over these kinds of spikes, but I have only seen it happen like this a few times over the past few years, and it has always been a very good thing.

On closer inspection of the logs it does appear that G is now picking up the pages with altered meta descriptions.

I'm hesitant to say "back in fat city" but I definitely am typing this with a big smile on my face.

I'll send detailed stats on request.

wiseapple

10:10 pm on Sep 25, 2005 (gmt 0)

Andrea99 -
So the key is to use meta-description tags which are completely unique and not used anywhere else on the site?

Do you use snippets of you articles anywhere else on your site?

Trying to figure out if we should scrap the use of using snippets of our articles on index pages. We have the following situation.

- We use a snippet of the article on an index page.
- We also use the snippet in the meta-description tag.
- The snippet also exists in the article.

We are wondering if this causes filtering or penalty? It is the last thing I can think of.

wiseapple

12:15 am on Sep 26, 2005 (gmt 0)

I ran into this on Matt Cutt's blog.

---------------------------------------------------
The only things we can think of (since we don�t use any black hat SEO) is:

1.) We use titles and descriptions in our sub sections to introduce contents of our articles which is the same as the title and description on the top of our articles and related articles as well as the meta title and descritpion.
---------------------------------------------------

This is the third site that I know of that is dead in the water which is structured in the same way.

Seems as though if you use a "snippet" of an article in other places on your site, plus use it in the description meta-tag, you will be filtered or penalized.

What could also be causing an issue is because scrapers are also taking the description meta-tag an using it on their site. A snippet could end up spread across a thousand sites. Therefore, when they do the filtering, your site gets swept with the others.

Anyone with similar setup that has issues?

andrea99

12:37 am on Sep 26, 2005 (gmt 0)

So the key is to use meta-description tags which are completely unique and not used anywhere else on the site?

It is difficult to generalize. But I outlined what worked for me and the results are dramatic. I was able to make the change globally inside a half an hour, they were all spidered within the next day and the new results are coming online now.

I wish I could post the graph of hourly hits, a thing of beauty--jumps to more than double at 3:00 pm and holds there...

Probably would be wise to avoid duplicate text entirely. My quick and dirty fix on the meta descriptions worked but I'm going to go back and rewrite each one this coming week.

Hey, wouldn't it be great if I could use this time to create real content? :)

wiseapple

1:04 am on Sep 26, 2005 (gmt 0)

Anrea99 -
Previous to updating your description meta-tags, were many of you pages supplemental? Did you check the cache dates on your supplementals? Were they from Nov. 2004? Dec. 2004? or Feb. 2005?

Thanks.

andrea99

1:30 am on Sep 26, 2005 (gmt 0)

>> were many of you pages supplemental?

Last week they were virtually all (~400 pages) showing in the index as "supplemental." Now there are only a few listed that way but there are only 185 pages indexed now. Hopefully the rest will fill in soon.

In August most of my cached pages were dated July though there were some anomalies with caches from January showing up. That was while the entire domain was banned, and occasionally the caches would disappear as well.

So very confusing... But at least my nightmare is over (for now). Who knows what unpleasant surprise lurks. :)

annej

7:06 am on Sep 26, 2005 (gmt 0)

Thanks for the suggestions. I already put my 404 redirect back to default. I had already tried the idea of linking to the removed pages from a live page in hope it would help get rid of them in Google. I thought that had worked but now they are back. I may have to do it again and leave the links up longer. I really hate to have to link to non existant page on my site though.

johan

11:36 am on Sep 26, 2005 (gmt 0)

Printer friendly version is fine cos I was hit by Bourbon and did not change it to get out. When I was hit by Bourbon I don�t think anyone really knew what was happening. But what we did learn was that dupe content is not really an internal issue but across domain names.

I had .com .net and .org versions of my website that had redirects including one with the full URL as a test domain when I moved hosting! I changed all these to be safe so they had been deleted or go to 1 page website that simply says �This domain is owned by widgets please www.widgets.co.uk to visit the site). Then complained to Google who send out the usual irrelevant stock reply and then I was out.

Bourbon was so screwed no one really new for sure.

stargeek

12:00 pm on Sep 26, 2005 (gmt 0)

"stargeek: Yes, it was around the middle of December. The drop off in the statistics is quite sharp."

I'm hearing a lot about the Mid-December "non-update" as a precursor to burbon, in May. I'm not sure why the mid-december changes were never recognized as an update.

Mobillica

1:26 pm on Sep 26, 2005 (gmt 0)

I am in the same boat, I think I have been hit by a slight duplicate penalty.

What I have done is using the command site:www.mydomain.com, and going through the webpages looking out for pages with no description. Then amending these/uploading/wait for reindexing.

I heard that these are the pages that could have been causing the dup penalty?

wiseapple

8:30 pm on Sep 26, 2005 (gmt 0)

Bringing this back on topic... Does anyone have concrete ways on how duplicate penalties can occur?

- Can duplicate penalities occur because of onsite factors?

- Are duplicate penalties only because of offsite copying?

Any thoughts?

Trisha

9:37 pm on Sep 26, 2005 (gmt 0)

- Can duplicate penalities occur because of onsite factors?
- Are duplicate penalties only because of offsite copying?

I don't claim to be an expert, but based on my experience, I'd say yes to the first question - and add that both onsite and off site factors are involved. But there appears to be some sort of threshold - like the percent of pages on a site that are duplicates before an entire site is hurt by it. Or maybe the presence of other seemingly 'spammy' factors - which could very well be unintentional.

Things like using datafeeds in a non-creative manner can trigger it for example. Also, I'm suspicious - but could be wrong - about using articles that people have sent in that could have been used on other sites too and pages with too little content other than site navigation, etc. that might be able to trigger it too.

My guess it that most all the time google gets it right when it comes to who had something first so I don't think someone copying your stuff will be likely to hurt, but who knows. I would guess further that a site map might help in that case though.

RockyB

9:49 pm on Sep 26, 2005 (gmt 0)

Now, this is probably going to sound really stupid to a lot of you, but I'm still pretty much a novice in this game.

On my site, I have copies of articles I have written in the past. These articles have also been spread to various article banks to add as backlink attractors. Most of these are now on 6-7 different sites as well as mine.

So what should I do with my copies of the articles? I still want them avaliable to my visitors to read if they wish, but at the same time I don't want to be hit by a penalty. Shall I remove these altogether, leave them as they are, or put them in the robots.txt exclude list?

Thanks in advance for your help.

wiseapple

10:02 pm on Sep 26, 2005 (gmt 0)

Here is another question - has anyone experienced recovery from dup content penalty? How long did it take to recover once the duplicated items were removed?

wiseapple

10:21 pm on Sep 26, 2005 (gmt 0)

What affect can scrapers have on your site?

Scrapers come by the site and scapre pages. Mostly taking the titles and meta-descriptions of you articles. If you use these same titles and meta-descriptions through out your site - can this be considered duplicate content? And if google tries to remove scrapers - could it be possible that your site could get caught up in a removal because you are using the same titles and meta-descriptions as the scraper?

Trisha

11:01 pm on Sep 26, 2005 (gmt 0)

has anyone experienced recovery from dup content penalty? How long did it take to recover once the duplicated items were removed?

yes, at least as far as I know that was what the penalty was for. How long is hard to answer, in part because I didn't keep good records of when I made changes! The site that just came back had changes made to it I think in June or July - sorry I can't remember which. On the other hand, another site has not come back yet and think it may have had less of a problem. I could have made changes on it a bit later though. The recovery time may depend on how much duplicates there are too. My suggestion to someone would be to clean it up as much as you possibly can, set up a site map and send in a reinclusion request and hope for the best.

So what should I do with my copies of the articles?

If the articles were on your site way before the other sites then Google probably knows where they originated from and the others would be considered duplicates, not yours. But if you are really concerned you could rewrite the ones on your site so that they are significantly different enough that they wouldn't be seen as duplicates or just block google from them as you suggested.

annej

7:47 pm on Sep 27, 2005 (gmt 0)

If the articles were on your site way before the other sites then Google probably knows where they originated from and the others would be considered duplicates, not yours.

I used to think first published was considered original but now I'm not so sure. Based on recent experience it seems the page with the highest ranking is seen as the origianl and the lower ranked page then gets a supplimentary listing.

stargeek

8:03 pm on Sep 27, 2005 (gmt 0)

"I used to think first published was considered original but now I'm not so sure. Based on recent experience it seems the page with the highest ranking is seen as the origianl and the lower ranked page then gets a supplimentary listing. "
i would echo this, its definetly not oldest page, but most linked to/highest ranking.

AndyA

8:30 pm on Sep 27, 2005 (gmt 0)

I can verify the oldest page has nothing to do with Google's determination on originality. I had an unusual phrase in a page I wrote that went online in 2000. A few years after that, long after Google had indexed it, another site copied it in its entirety. They even forgot to modify a link at the bottom, which went to my site, which is how I found out about it.

Messages to the Webmaster went unanswered, and my original page was nowhere to be found in Google when I searched for the unique phrase in quotes. I finally got fed up and filed a DMCA on the perp. Within a few days, his entire site had been removed by his host, and within a week my original page was back on Google.

Kind of insulting, actually, that Google did this. Especially since the site was on Angel Fire, and kind of cheesy compared to mine. (In my opinion, of course.)

This 76 message thread spans 3 pages: 76