Forum Moderators: Robert Charlton & goodroi
Here are some Google patents about detecting and handling duplicate content. This can give you some picture of various approaches they have been trying:
Methods and apparatus for estimating similarity [patft.uspto.gov]
Detecting duplicate and near-duplicate files [patft.uspto.gov]
Detecting query-specific duplicate documents [patft.uspto.gov]