Welcome to WebmasterWorld Guest from

Forum Moderators: phranque

Message Too Old, No Replies

Duplicate Content? How is it determined?

Need your opinion or expertise.

6:37 am on Mar 8, 2006 (gmt 0)

5+ Year Member

Ok so I know duplicate content is penalized. Lets say I have site A and site B on two different urls , two different hosts, two different IP addresses. I have an article on A and on site B I have the same article with a few extra words or a few words substituted.

Now does a search engine bot pick up on this? Or is duplicate content simply a word for word manifest?

6:58 am on Mar 8, 2006 (gmt 0)

10+ Year Member

On one level, sites that are mirrors of each other may be identified when sites are crawled, and a choice may be made not to index some mirror sites.

During the serving of results, comparisons of snippets and titles may be made based upon fingerprints surrounding the words around queries, and some results may be filtered from search results.

There are a couple of patents from Google on duplicate content that describe approaches that Google is likely using. Chances are good that similar methods are employed in other search engines.

7:39 am on Mar 8, 2006 (gmt 0)

5+ Year Member

can you link some?
7:51 am on Mar 8, 2006 (gmt 0)

10+ Year Member

Here are the two Google patents on duplicate content:

Detecting duplicate and near-duplicate files [patft.uspto.gov]

Detecting query-specific duplicate documents [patft.uspto.gov]

These were granted over two years ago, and filed with the USPTO more than five years ago, so it is possible that if Google ever used them that they may have moved on since then. Regardless, they both show methods that appear like they would be effective.

7:54 am on Mar 8, 2006 (gmt 0)

WebmasterWorld Senior Member marcia is a WebmasterWorld Top Contributor of All Time 10+ Year Member

Google patents:

Detecting duplicate and near-duplicate files [patft.uspto.gov]

Detecting query-specific duplicate documents [patft.uspto.gov]

Stanford paper, authors N. Shivakumar and Hector Garcia-Molina:

Finding near-replica documents on the web [dbpubs.stanford.edu]

8:01 am on Mar 8, 2006 (gmt 0)

5+ Year Member

Thank you.

I wonder to what extent a "fingerpring" of a document is taken. For example.

A website has the opening sentence.

Welcome to my site.


Welcome to my website


Welcome to this website

etc.. Those all could basicly have a finger print?
Would that all be considered duplicate content or near duplicate content?

I guess no one really knows how far this kind of system goes.

2:18 am on Apr 7, 2006 (gmt 0)

10+ Year Member

Hi my search engine notes duplicate content as it indexes a website, the results show where the content is duplicated and how close the match is.
The spider does this as the results come back to the search engine.
We often get pages of duplicate results, some are where a site has simlar content (75% content match) but some are just duplicate content hosted on a related website (Owners name or details the same)quite often the domain names are simlar or versions of the same domain like a dot net and a dot com site.
One webmaster who shall remain nameless registered ten domains with simlar names and they were all hosted on the same server, the content was the same down to the files sizes and dates, the only difference was the links pages (They all pointed to themselves) YAWN!
Thing is that if we were to index them all they would still score the same and show in the same results so does he expect the visitor to go to all his sites or just some of them?