Google's Duplicate Content Filter

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google's Duplicate Content Filter

&filter=0 heart attacks on the rise

joeduck

3:33 pm on Jul 7, 2005 (gmt 0)

It seems to me an increasing number of sites are reporting/affected by the 'dupe content filter". This results in downranking at Google so severe it usually results in the loss of almost all G traffic.

My understanding is that this filter is sitewide rather than page by page, and that it can only be removed by top level Google people - do others see this as a problem for many sites?

ciml

3:50 pm on Jul 7, 2005 (gmt 0)

Google want to avoid providing the same content multiple times for the same search, so where they can identify near-dupes they keep the first result and hide the others until you click "repeat the search with the omitted results included", or append &filter=0 to the URL.

It is on a page basis, not domain but if two or more sites carry mostly the same content then often you'll see that the more powerful site wins across many searches.

joeduck

4:07 pm on Jul 7, 2005 (gmt 0)

ciml -

by "more powerful" do you mean the site with higher PR?

In our case I see a lot of sites with snips of our stuff above our own original page, including our own copyright statement! It started Feb 2 and acts like we've been scored downward for all searches.

ownerrim

4:10 pm on Jul 7, 2005 (gmt 0)

"It is on a page basis, not domain but if two or more sites carry mostly the same content then often you'll see that the more powerful site wins across many searches."

How do you define "wins". Appearing lower than other pages in the serps, or not appearing at all?

ciml

4:32 pm on Jul 7, 2005 (gmt 0)

> by "more powerful" do you mean the site with higher PR?

I mean the site that tends to have more powerful pages. ;-)

I don't know if that's the URL with higher PR, or the URL that is listed higher without the near-dupe filter.

> How do you define "wins"

The page that is listed even without clicking &filter=0

activeco

4:52 pm on Jul 7, 2005 (gmt 0)

According to some G documents "powerful site", in this meaning of winning among duplicate sites, is the one who either have higher PR, more trusted host & nameservers or being more fresh.
I think in that order.

joeduck

5:10 pm on Jul 7, 2005 (gmt 0)

or being more fresh

One very experienced person in New Orleans suggested that this freshness factor sometimesm meant that the copycats come up before the (older) original posts. This would explain some of the odd results we have seen where we are the original stuff but are treated as the duplicate.

kwngian

6:17 pm on Jul 7, 2005 (gmt 0)

IMHO, Google increases the filtering out of supplemental results with Jun 15 update and reduces the ranking on sites with high amount of supplemental results, probably to defeat those scraper sites.

Then they realise the filter brought down the scrapers but also damages legit sites and does a little reduction on the filters but the scrapers start floating to the top again.

The supplemental portion of the algo is probably still buggy and I see some authority websites getting alot of supplemental pages.

theBear

6:50 pm on Jul 7, 2005 (gmt 0)

kwngian,

Me thinks you may have hit upon the problem.

Now about that duplicate page thingy only being per page?

I may be that it grows as it goes.

IOW it gets better (worse) with age.

joeduck

6:59 pm on Jul 7, 2005 (gmt 0)

The supplemental portion of the algo is probably still buggy

Has anybody had the duplicate filter on their site for a certain search and then seen it removed for that search?

I have a feeling this removal is not automatic but manual - any thoughts on that?

caveman

8:00 pm on Jul 7, 2005 (gmt 0)

We've seen exactly what CIML described, and it was a topic of convesation in N.O. at several different gatherings I was at...though some were late enough that I don't recall all the details of sites suffering similar issues to some of our own. ;-)

I mean the site that tends to have more powerful pages. ;-)
I don't know if that's the URL with higher PR, or the URL that is listed higher without the near-dupe filter.

I can absolutely confirm that it is not the higher PR pages that are winning. My personal theory is that the pages being displayed are getting there by hitting the semantically related kw mix best, and that there are also other sitewide factors now clearly coming into play.

I have a feeling this removal is not automatic but manual - any thoughts on that?

Yes, my thought is: No, it is definitely not manual ... it just *feels* that way. ;-) Doesn't mean that there aren't some hand checks here and there, but this is being done via algo/filter sets.

Though it may not seem so on first glance, there is a related thread here [webmasterworld.com] in the Supporters Forum. (If you haven't ponied up to be a Supporter yet, and you're dealing with this issue, now would be a good time.) ;-)

vincevincevince

8:02 pm on Jul 7, 2005 (gmt 0)

Most sites which complain about this have hundreds of very similar pages. Those that don't have hundreds of similar pages have either thousands or millions.

caveman

8:23 pm on Jul 7, 2005 (gmt 0)

> Most sites which complain about this have hundreds of very similar pages. Those that don't have hundreds of similar pages have either thousands or millions.

I have no clue how you would know what 'most sites' do unless you maintain a hecka large db that's kept very fresh, but in any event this is not the case where we're looking.

FWIW, my guess is that it won't stay this way for too long, simply because whatever dials have been turned is resulting in some pretty high quality pages being dropped in favor if pages that are much lower in overall importance but are newer, and/or have guessed right or figured out other key elements of the algo that (IMO) relate to kw's.

joeduck

4:31 am on Jul 8, 2005 (gmt 0)

Caveman -

Insightful stuff here and in supporter's forum. I think you are on to something with the idea that they have turned up the restrictions beyond what is reasonable in an effort to knock out a higher percentage of scrapers.

I'd suggest it may even be calculated as a way to "purge" the scrapers and assume the legitimate sites will stick it out where the scrapers will bail out quickly, but that's wild speculation on my part.

ciml

10:21 am on Jul 8, 2005 (gmt 0)

caveman:

I can absolutely confirm that it is not the higher PR pages that are winning. My personal theory is that the pages being displayed are getting there by hitting the semantically related kw mix best, and that there are also other sitewide factors now clearly coming into play.

Could your experience be consistent with Google keeping whichever page would score highest for a search without the near-dupe filter?

I think I see an easy way to check for that. Find two pages where one is behind &filter=0, pick searches that would favour one or the other, see if the remaining URL with &filter=0 in place switches according to the search.

zeus

10:46 am on Jul 8, 2005 (gmt 0)

dup filter has gone totaly nuts the last years time, maybe since they added all those new (sites) I have made a few search for different topics and once google said to have found 2mill. results, but when I came to the 10s page the omitted results came, alot of time it appeare much to soon.

kwngian

11:06 am on Jul 8, 2005 (gmt 0)

The thing is the site's 'reputation' (according to google that is) comes first before the relevancy of the page thus even if your page is non-duplicate content, you will still rank lower, much lower than those that has no or little duplicate content (again according to google) site wide.

Will semantics come into play before the CRC? or fingerprinting of the web content? Say similiar terms like price, value, cost, worth (I know this is a bad example) are they the same or considered close enough for duplicate content in a page?

Will google decide on a default and apply the filter upon rating the full content of the page for similiarity before ranking.

There are pages on my site that is obviously exact terms found elsewhere but doesn't get filter but there are others that has only a small portion of duplicate content and goes into supplemental results. There are pages with no duplicate contents that goes into supplemental results too.

Other than buggy, it could be google has 'shorted' itself out. Just ban the scrapers! hehe.

activeco

11:11 am on Jul 8, 2005 (gmt 0)

...and that there are also other sitewide factors now clearly coming into play.

Yes, I think that the 'other sitewide factors' are very much utilized in the new SERP's too.
Unfortunately, Google has become much elitist in nature, favoring 'authorities' in any field.
E.g. Bush mentioning 'spirit' would make him an expert in spiritual matters and philosophy @Google.

Anyway, IMO PR is still much important for a total sitewide score.

caveman

5:43 pm on Jul 8, 2005 (gmt 0)

The thing is the site's 'reputation' (according to google that is) comes first before the relevancy of the page thus even if your page is non-duplicate content, you will still rank lower, much lower than those that has no or little duplicate content (again according to google) site wide.

kwngian, that is not necessarily true, and in fact often not true. I can't say with certainty that G is using LocalRank, but I believe that they are, and if so, a page's overall reputation does not guarantee that it will rank for any given search above other pages with lower overall reputation.

CIML, as is so often the case when dup issues come up, I think there are multiple sorts of issues being discussed in this thread, partly my fault. If joeduck was referring only to pages that reappear when the &filter=0 is used, then I've been OT part of the time, because I'm actually exploring what appear to be two similiar but not necessarily related examples, only one of which brings all pages up with the &filter=0 search.

Case One
This is actually a variety of cases, but all behaving similarly. The &filter=0 always adds the missing page back, sometimes under the winner, sometimes above the winner. But indented results are factoring in here too, and the differences between result sets are subtle. So much so that it is hard to work out what the determining factor(s) are.

But actually, I'm wondering if a lot of what people are discussing re dup content in various threads right now involves searches where the &filter=0 does not bring the missing pages back, as in the following:

Case Two
The filter does not help the vanished pages show. The vanished pages had previously been top performing, with TBPR of 6 (homepage TBPR 7). Now, similar (competitive) pages are in the place of our vanished pages. The competitive pages have TBPR of 4. Notably, the competitive pages stole part of what appears on our pages, though the competitive pages are much newer and show far lower TBPR.

I don't think that this is a hand check, partly for reasons I can't discuss, and partly because our vanished pages are more useful than the newer competitive pages showing in their place.

I don't know what the latest thinking is on &filter=0 but IMO, this only shows 'acceptable, similar, related' pages that were not shown on the filtered search. It's still very possible to be filtered for duplication of various kinds and not show up with the &filter=0 in place.

Marcia has an interesting theory that she posted in the Suporters thread I referenced in msg#11 above. If I understand her correctly, she's wondering if perhaps some dup filtering is occuring at a different time than before, possibly in the BlockRank phase, if they're even using that (I'd guess they are). The more I think about her theory and the way she connects it to the original backrub paper, the more convinced I am that it would explain what I'm seeing here. But as I noted over there, if they're doing this, it seems to be a very blunt instrument approach to an issue that rests on subtleties (near duplicated pages, or stolen snippets of content).

This might account for some of the many recent complaints about quality subpages vanishing in the Bourbon update.

Marcia

8:09 pm on Jul 8, 2005 (gmt 0)

Marcia has an interesting theory that she posted in the Suporters thread I referenced in msg#11 above. If I understand her correctly, she's wondering if perhaps some dup filtering is occuring at a different time than before

The theory could be totally wrong, but I'm convinced of it happening. Figure that the more the crawl engineers can accomplish algorithmically, the more efficient real-time processing use of resources is.

The more I think about her theory and the way she connects it to the original backrub paper, the more convinced I am that it would explain what I'm seeing here.

Again, I could be way off the wall with these blown-out theories, but I've dug up a paper authored by J. Cho, Garcia-Molina and Larry page that talks about crawl ordering that also gets into IDF a bit - which is also mentioned significantly in some of the LSI papers, and crazy as it seems, the concepts seem to fit.

But as I noted over there, if they're doing this, it seems to be a very blunt instrument approach to an issue that rests on subtleties (near duplicated pages, or stolen snippets of content).
This might account for some of the many recent complaints about quality subpages vanishing in the Bourbon update.

Whether or not BlockRank (or crawling) is being used, imho the dup issue as it relates to site-wide issues within a site isn't the same thing as dup issues with external or scraped pages. Particularly when it comes to IDF and/or site-wide anchor text.

jomaxx

8:30 pm on Jul 8, 2005 (gmt 0)

To reinforce what caveman said, the other day I found dozens of sites ranking for various phrases that had been directly copied from PR5 and PR6 pages of my site. (My site has a kind of educational/reference theme to it and paragraphs of it are copied frequently.)

In one case, MY site was in the "similar results" list; in another it way way down in the listings. I don't know what criteria they use, but PR and inbound links from external sites and the age of the text don't seem to be among them. The date-time stamp of the page, maybe, I've wondered about that, but rewarding webmasters for not keeping their sites up to date would be a curious way to run a search engine.

ncgimaker

8:54 pm on Jul 8, 2005 (gmt 0)

I have to say I don't understand a lot of what you talk about. Maybe its just me, but if you have N pages that you think are near duplicates, you have:

1. Spam with lots of duplicate pages.

2. Repetition, e.g. Google's 1460 pages about "Googol,Milton etc."
[search.yahoo.com...]

3. Specialist niche site with a page about
"fluffy yellow widget pink trim",
"fluffy yellow widget yellow trim",
"fluffy yellow widget blue trim",
and so on.

So it tells you nothing about quality, relevence or anything useful.

Its funny hearing engineers cover for Google with complicated discussions, but to paraphrase Googles own quality guides, "Another test would be 'would I feel comfortable explaining this to other search engines'?".

Here search for "google googol milton"
[google.com...]

The first entry from Google.com comes in at 56 and its a FRENCH page and I'm searching in ENGLISH. Google has determined that the best page to present to me from its site is the French one and that Chinese pages (position 29) should outrank that.

Now lets run the test on other engines:

[search.yahoo.com...]
[search.msn.com...]
[walhello.info...]
[kanoodle.com...]

Google site is top. Only Vroosh lets us down by showing it at 2nd.

Clearly they have messed themselves and the only people to clean up the mess is Google.

joeduck

10:07 pm on Jul 8, 2005 (gmt 0)

If joeduck was referring only to pages that reappear when the &filter=0 is used, then I've been OT part of the time, because I'm actually exploring what appear to be two similiar but not necessarily related examples, only one of which brings all pages up with the &filter=0 search.

You are totally on topic as far as I'm concerned Caveman. I've noted both the cases you describe, but assumed (wrongly/simplistically?) that they are related to a sitewide downgrade that somehow devalues all results from the unlucky site.

Some of those devalued pages make it into "omitted results" and some do not. I'm not sure I'm understanding Marcia's observations but will review her posts here and in supporters. Google is damaging my brain again.

caveman

10:23 pm on Jul 8, 2005 (gmt 0)

It's Marcia that's making my brain hurt. ;-) Problem is, every time she cites another one of these darned papers, I gotta go back and reread, since I'm not capable of the photographic memory techniques that she obviously employs. As if I were'nt busy enough already.

joeduck

11:20 pm on Jul 8, 2005 (gmt 0)

the dup issue as it relates to site-wide issues within a site isn't the same thing as dup issues with external or scraped pages. Particularly when it comes to IDF and/or site-wide anchor text

Marcia - it certainly makes sense that the two would be treated differently. Do you think &filter=0 removes both types? Are there any other filtering terms?

patchacoutek

7:12 pm on Jul 9, 2005 (gmt 0)

About the site wide anchor text issue that was mentionned, is there any good rules about the internal linking and the anchor text?

My site uses a navigational menu, but the choices in that menu are they keywords that are also in the title of my pages. Is it possible that that kind of optimized menu could be causing a over optimization penalty or a dup content penalty ( using always the same anchor text?

So what is the best way to make a site wide menu still optimized, but that does not feel spammy to google?

Thanks

zeus

8:43 pm on Jul 10, 2005 (gmt 0)

We all know that the omitted results comes quic�klier than ever and many good sites dont show up, all this alos happen when they added all those fake sites,
try a site:www.f or what ever you like you will get results and when you click the links they are 404, thats another proove that google index everything also 302 links as a page.

ciml

8:51 pm on Jul 10, 2005 (gmt 0)

caveman, the main difference between the two classes of dupe handling is that the older filter removes URLs completely, until &filter=0 is appended. This applies to URLs only for specific searches, and is therefore believed to be applied at search time. It helps to keep specific Google result sets relatively free of duplicates.

The newer dupe factor doesn't remove URLs, but causes the rankings to be lower. This appears to apply to URLs across all searches, and is therefore believed to be applied at index time (or periodically). It helps to keep all Google result sets relatively free of high ranking feed pages.

jk3210

8:52 pm on Jul 10, 2005 (gmt 0)

<<Is it possible that that kind of optimized menu could be causing a over optimization penalty or a dup content penalty ( using always the same anchor text?>>

No problem, patchacoutek.

activeco

9:46 pm on Jul 10, 2005 (gmt 0)

<<Is it possible that that kind of optimized menu could be causing a over optimization penalty or a dup content penalty ( using always the same anchor text?>>
No problem, patchacoutek.

I wouldn't be so sure. It could be that your experience tells otherwise, but I agree with the following:

" Synthetically generated web graphs, which are usually indicative of an intent to spam, are based on coordinated decisions, causing the profile of growth in anchor words/bigrams/phrases to likely be relatively spiky.
One reason for such spikiness may be the addition of a large number of identical anchors from many documents. "

This 37 message thread spans 2 pages: 37