Serious analysis of Duplicate content penalties

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Serious analysis of Duplicate content penalties

Clark

11:18 pm on Sep 21, 2005 (gmt 0)

Has anyone done some serious analysis into the extent of the Duplicate content filter?

If hello were not a stop word and it was used 10 times, that is not duplicate content.

If you quote a sentence from another site. That is not duplicate content.

How about if you quote a page out of 100 pages? If you quote two paragraphs?

If you quote a paragraph but your page is only a pararaph.

If you quote a paragraph and your page is 10,000 times as long.

I imagine you can see what I'm getting at.

RockyB

9:25 pm on Sep 27, 2005 (gmt 0)

Looks like I'm gonna blcok half of my articles then. A well, I'm still pretty much sandboxed anyway, just wanted to know in case this may have an effect in the future.

Plenty of time to write more content.

annej

4:30 pm on Sep 29, 2005 (gmt 0)

like the percent of pages on a site that are duplicates before an entire site is hurt by it.

This is the key I think. I don't mind if a single page doesn't get filtered out but in the Bourbon update the whole site was down graded.

has anyone experienced recovery from dup content penalty?

Mine recovered but I'm still not sure why. I had fixed my 301 hijacked problem but also there appeared to be a readjustment in the algo after Google Guy realized how many regular sites were being affected.

I think with Bourbon I got caught in a penalty or filter that was aimed at scraper sites. The problem is if it happened once it could happen again.

travelin cat

8:43 pm on Sep 29, 2005 (gmt 0)

We have a travel site that specializes in a specific sector, we have pages for 152 destination cities wherein the copy on the pages are all different but all of the description, keyword and title meta tags are virtually the same with just the name of the city changed.

Is the general consensus that this would be considered duplicate content even if the content of the pages are all different?

thanks....

DamonHD

8:53 pm on Sep 29, 2005 (gmt 0)

Hi,

I have the same meta description/keywords for every page (40,000+) one one site and it does not seem to hurt.

They are right for every page, and G does use the meta description in the SERPs sometimes, when it doesn't seem to be able to pick a good snippet.

Totally white-hat.

Rgds

Damon

HenryUK

3:32 pm on Oct 12, 2005 (gmt 0)

Hoping to bump this one up with particular emphasis on supplemental results.

One of the sites that I run is a site with user-generated content (it's an advertising play - free ads at a basic level, then subject to various kinds of upsell). There's a reasonable amount of churn (3000 or so new items per month, with about the same number coming off, average life of a data item about six months).

All searching users access the content through a search form, so I have an alternative browse index, dynamically created from the db, which allows spiders to come in and grab the content.

In my fairly niche sector, we have had an absolute lock on positions one and two for [widget type][UK town] for about three years.

However, we have dropped down the rankings some time in the past couple of weeks, quite severely, and the first result that we have for any given search is often one that is quite a poor match compared with other indexed pages from the site.

Running a "site:" search combined with a typical phrase shows in many cases that our results have been downgraded to "Supplemental Results", which means that they are obviously less competitive on search results.

I can see how these pages may have been hit by a dupe content penalty: for one thing, there may be 100 or more [type of widgets] available in certain UK towns, the unique descriptions tend to be fairly short, much of the data is the same and there is identical text (eg "Welcome to the site, well done for finding it, here's how to do x and y if you're interested") on every one of these pages.

It may be to do with recent algo changes; it may also be to do with the fact that we have an increasingly large dataset, it may finally be something to do with the fact that Google keeps trying to hit pages that are no longer on our site and instead of giving up it indexes the same identical error page.

What would others recommend as a way of dealing with this? As far as I can see I have the following options:

1) reduce the amount of identical text on the pages (not too keen on this as these are the landing pages for a number of new users and I want to help them to understand what kind of a search they have stumbled upon)

2) reduce the number of pages that get indexed (not keen on this as I don't want to stop particular pages - which may be highly relevant to the user - being found)

3) other options that I don't know about!

All feedback welcome.

cheers

Lorel

5:18 pm on Oct 12, 2005 (gmt 0)

Nickied:

I've got hundreds of pages returned by an allinurl: which don't / never existed.

Could this domain have been owned by someone else previously? This could be why those strange urls are appearing.

annej

Here is another problem I am finding. I moved some of my pages to a new URL in the process of reorganizing a bit. These pages still come up in Google searches though they are listed as supplimental pages. I don't understand why Google hasn't just dropped them.
The old pages have a "404 moved page" come up. It is customized with a link to the homepage. Could that be the problem?

Google had those old pages indexed and may be applying the url of the moved pages to the contents of the 404 page. Being as the 404 page always brings up the same content the other pages get penalized with Supplemental Results.

RockyB

On my site, I have copies of articles I have written in the past. These articles have also been spread to various article banks to add as backlink attractors. Most of these are now on 6-7 different sites as well as mine.
So what should I do with my copies of the articles? I still want them avaliable to my visitors to read if they wish, but at the same time I don't want to be hit by a penalty. Shall I remove these altogether, leave them as they are, or put them in the robots.txt exclude list?

I would leave them up as long as they are not tagged as supplemental but be prepared to take them down if they are (or disallow in robots.txt).

I advise my clients to never post their articles on their own pages but to instead (before they post them elsewhere) post them in a dated newsletter on another website as 3rd party proof of who wrote the original.

Henry UK

Any page that has duplicate content on it drawn up automatically needs some original content and that needs to be at least 12% of the body text. With thousands of pages on the site that's a big job but it's either fix it or disallow Google from those pages and I would do the later from both a meta tag and also robots.txt.

HenryUK

2:21 pm on Oct 13, 2005 (gmt 0)

thanks Lorel

I think maybe the 12% figure has been increased of late, but I've made changes to the site (substantial cuts to the shared text) that should push the % much higher than that... I'll let you know how it works.

nickied

3:26 pm on Oct 13, 2005 (gmt 0)

Lorel:

Nickied:
I've got hundreds of pages returned by an allinurl: which don't / never existed.

Could this domain have been owned by someone else previously? This could be why those strange urls are appearing.

No, the domain was started by me. I've found part of the problem. About a year ago I had pages of 5 widgets, since changed to 10 (offset=10). (I previously reported I never had urls with 15, 25, 35, etc. which was wrong.) G has the pages of 5 in cache. These recently turned up again and are part of the ever increasing page numbers being returned. The other part of the problem is that G has indexed pages such as offset=-117 (that's a negative). No such pages ever existed, I have no idea how G would spider such a page (no linking to these odd ones). The negatives do return the valid main page in the particular category probably due to poor php/db coding. Not being a coder this is something I'll have to have fixed in the future.

Now up to around 12k pages on a just under 1k page site.

nfinland

6:41 pm on Oct 13, 2005 (gmt 0)

From a post I made to my blog (SEO category). I�m referring to what has happened to Matt Cutts Blog. You might know, otherwise try using Google to find out...

Earlier Google denyed that someone else could hurt your rankings in Google. This has changed and Googles webmaster F.A.Q. pages now say: There's almost nothing a competitor can do to harm your ranking or have your site removed from our index.
The fact seems to be that anyone can use Google�s duplicate content filter and get a site GoogleWashed, and steal his ranking and traffic.

lobo235

9:23 pm on Oct 13, 2005 (gmt 0)

I read caveman's post [webmasterworld.com] about duplicate content recently and decided to try and salvage one of my sites that was hit with a duplicate content penalty. My site was originally designed with links to all the categories on the site of every single page. When my site first started I had about 8 categories so this was not a big deal. My site grew rapidly though and I ended up with closer to 60 categories. Each category has a 1-3 word title which was used as the link text. When added up, the category links, navigation links, titles, headings, and other things that displayed on each page totaled about 50%-80% of the page content. This was obviously a bad thing because pretty much every page that had these category links got dropped out of the search results.

Well, to make a long story short, I have trimmed all the pages and instead of linking all categories I only link 1-5 categories that relate to the page's content. I have taken out any extra bloat that could be considered duplicate content. And I have optimized my urls for each page so instead of using show_widget_235.html I now use green_widget_with_blue_stripes.html so that it's more descriptive and hopefully has some good keywords in the url. I then used a 301 permanent redirect from show_widget_235.html to green_widget_with_blue_stripes.html.

I am wondering if this strategy will help me to get back into the Google search results again. I have done fine all along with other search engines but I saw a 50% drop in traffic when Google dropped my site so I would like to get back in the game.

Has anyone had any luck using a method like this to get relisted?

nickied

8:04 pm on Oct 14, 2005 (gmt 0)

FWIW: An update re: inflated number of pages and dup content. Today, allinurl reporting about correct number of pages, down from 12k to just < 1k which is closer to the correct number. Also, the yourcache tool is reporting same at almost half the data centers, other dcs now around 9k down from 12k. Of the pages being returned, about 30% are url only. These were full description shortly after g introduced the xml sitemap scheme and the xml sitemap was uploaded.

discrete298

8:47 pm on Oct 14, 2005 (gmt 0)

lobo235, You may end up in sandbox I fear.

caveman

9:12 pm on Oct 14, 2005 (gmt 0)

lobo235, FWIW, we try to avoid large scale changes to sites, especially things like renaming files. That kind of change can make a bad situation worse, unfortunately.

erny

9:17 pm on Oct 14, 2005 (gmt 0)

keep on moaning but Google has no ears.We have lost our pages while other useless rank at the tops without any content just google adsense
have a look at
[google.co.uk...]
check the page at #3

econman

1:02 pm on Oct 17, 2005 (gmt 0)

That example is pretty funny.

#3 post is an empty page from a big site that just says:

"We currently have no information about"

annej

4:10 am on Oct 19, 2005 (gmt 0)

Outside of my now resolved 302 hijack experience up until now my original articles have always been fine in Google serps. Today I found one that is designated "supplimental". <sigh>

I don't really want to take it down as it's a good one. I'm wonderfing if putting a noindex tag on it would be good enough to avoid a penalty?

I am much more concerned about dup content after bourbon.

This 76 message thread spans 3 pages: 76