Forum Moderators: open
Now does a search engine bot pick up on this? Or is duplicate content simply a word for word manifest?
During the serving of results, comparisons of snippets and titles may be made based upon fingerprints surrounding the words around queries, and some results may be filtered from search results.
There are a couple of patents from Google on duplicate content that describe approaches that Google is likely using. Chances are good that similar methods are employed in other search engines.
Detecting duplicate and near-duplicate files [patft.uspto.gov]
Detecting query-specific duplicate documents [patft.uspto.gov]
These were granted over two years ago, and filed with the USPTO more than five years ago, so it is possible that if Google ever used them that they may have moved on since then. Regardless, they both show methods that appear like they would be effective.
Detecting duplicate and near-duplicate files [patft.uspto.gov]
Detecting query-specific duplicate documents [patft.uspto.gov]
Stanford paper, authors N. Shivakumar and Hector Garcia-Molina:
Finding near-replica documents on the web [dbpubs.stanford.edu]
I wonder to what extent a "fingerpring" of a document is taken. For example.
A website has the opening sentence.
Welcome to my site.
Another
Welcome to my website
Another
Welcome to this website
etc.. Those all could basicly have a finger print?
Would that all be considered duplicate content or near duplicate content?
I guess no one really knows how far this kind of system goes.