New Duplicate Content Filter?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

New Duplicate Content Filter?

irldonalb

5:20 pm on Dec 17, 2007 (gmt 0)

I just searched Google for a movie quote. It was actually a quote from shortbus1662’s Liar Liar post here [webmasterworld.com] - "Because if I take it to small claims court"

After just one results I got the standard Google message -

In order to show you the most relevant results, we have omitted some entries very similar to the 1 already displayed.
If you like, you can repeat the search with the omitted results included.

I've seen this message before but I don't recall it blocking different websites. I thought it was used to block the same page /content within a site.

Is this new? It seems like a ridiculous considering the sites have nothing in common except that line.

tedster

9:59 pm on Dec 17, 2007 (gmt 0)

I've seen this from time to time. Even though the filtered sites may be somewhat different, I guess the idea is that relative to this particular query, they are pretty much duplicate.

lorien1973

10:32 pm on Dec 17, 2007 (gmt 0)

That's gonna put a real damper on electronics stores that all have essentially the same item descriptions (content) for many of their pages.

Marcia

4:13 am on Dec 18, 2007 (gmt 0)

It's what I call the "similarity" filter, and it isn't new. It'll block pages with the same page title/meta description/page_top text (and/or, most usually "and"), no matter what site they're on.

Technically, it may not be duplicate pages, and there is another paper (or patent application) that's more relevant, and deals more with query specific clustering and filtering, but this one gives a good idea of the concept:

Detecting query-specific duplicate documents [patft.uspto.gov]

The idea is that "portions" of pages are being used, what other papers sometimes refer to as "footprints" or "fingerprints."

[edited by: Marcia at 4:32 am (utc) on Dec. 18, 2007]

tedster

4:54 am on Dec 18, 2007 (gmt 0)

Good point, Marcia. The Abstract section of that patent says it all: "An improved duplicate detection technique that uses query-relevant information to limit the portion(s) of documents to be compared for similarity"

That's gonna put a real damper on electronics stores that all have essentially the same item descriptions (content) for many of their pages.

If someone is searching on the words that are in that duplicated copy, then they probably don't need a full selection of websites where it appears - at least not on the first result set. But if their query is more like a brand name and model number, then because the duplicate detection is query-dependent, the filtering of results would also be different and usually much more relaxed.

Robert Charlton

7:12 am on Dec 18, 2007 (gmt 0)

From the patent:

...a final set of query results, wherein the final set of query results is a sub-set of the ranked query results, and wherein the final set of query results does not include any two query results corresponding to documents that have similar query-relevant parts....

Google's treatment of dupe content is very definitely query related, and it also seems to relate to how a site might be linked. Inbound links related to a query will mitigate what's seen as duplicate content. (You might say that this, in a way, reflects the off-page portion of Google's ranking algo).

I've seen scraped pages temporarily get filtered out as dupes for, say, a full ten-word sentence in quotes that doesn't contain any "optimized" terms, yet continue to rank as normal for a competitive phrase for which it had good inbound anchor text.

I also continue to see multiple copies (that are pretty much identical) of authoritative articles rank on different domains for competitive searches if both copies have good backlinks.

ChicagoFan67

8:29 am on Dec 18, 2007 (gmt 0)

I was ranking No 1 for a four word search phrase (which is the page title) for well over a year. Just this last weekend gone, someone copied the entire content, including image links, of a five page tutorial from my site and posted it on a forum. That forum page now ranks #1 and my page has been dumped in the supplementals. The Poster used my page title as the title of their post. I'm still ranking well for other related search phrases.

I'm so tempted to force removal of that post by switching those hotlinked images with something much less savoury.

The interesting thing is, that forum page is not accessible unless someone is registered and logged into that site. There is currently no cache of the page, although there was one showing 2 days ago.

[edited by: ChicagoFan67 at 8:49 am (utc) on Dec. 18, 2007]

irldonalb

9:55 am on Dec 18, 2007 (gmt 0)

It's what I call the "similarity" filter, and it isn't new. It'll block pages with the same page title/meta description/page_top text (and/or, most usually "and"), no matter what site they're on.

I do remember Maria talking about this before but I thought it wouldn't come down to sentences. Am I right in saying that famous quotes, presidential speeches etc will be included in this filter?

potentialgeek

3:59 am on Dec 20, 2007 (gmt 0)

Why does Google still find it so difficult to figure out who is the original source of text, and who copied it? Is this really an engineering nightmare?

You know there are sites/programs that "track" text, to see who if anyone lifts content from other sites. Why can't Google create code to do the same thing or buy the company?

It has cached site pages, so why doesn't it compare the cached text, and stop rewarding thieves with top SERPs!?!

Hint to Google: if one site has content before another, the one that got it later might be the thief.

p/g

BradleyT

5:39 am on Dec 20, 2007 (gmt 0)

Why does Google still find it so difficult to figure out who is the original source of text, and who copied it? Is this really an engineering nightmare?

Using the OP's quote as an example, what if the director of Liar Liar started a website and put that quote on there - how would google know that website should be considered the original source of the text?

Whitey

8:08 am on Dec 28, 2007 (gmt 0)

Is there a difference between "duplicate content" and "similar content" as far as Google's filters are concerned.

I think not, albeit the filters may be applied in different ways to different situations. I think it should all be called "duplicate content" in " Webmaster language " to avoid confusion.

If a block of text is identical, then it's a duplicate block of content. Google will often filter it as described.

If a page contains the same block of text amongst other differing text , then it may not be considered duplicate, depending on the proportion on the page / URL.