Page is a not externally linkable
msgraph - 4:50 pm on May 23, 2003 (gmt 0)
Basically there is no clean answer. If you want to understand some of the technology or theories out there then you need to start reading. Sure you can take the easy way out and ask for examples and theories from site owners but what better way to learn than from those who study these problems for the search engines themselves. A good starting point for the challenges is here: Section 4. Duplicate Hosts Algorithmic Challenges in Web Search Engines [internetmathematics.org] published in Volume 1.1 Journal or Internet Mathematics by Monika R. Henzinger (Research Director - Google,Inc.) 2003 Follow and read every reference listed in that section and you will get a good idea of how duplication detection works and their challenges. Note: This does not imply that Google currently employs any or all of these methods although I'm sure they use a large part of them. The bottom line is that straight or very-near duplication, similar site structures, and similar sites hosted on the same server can be detected easily. When you start to get into paragraph/article duplication, things get fuzzy and detection is very very difficult, with the "determined" authority beating out the rest.
There have been loads of threads on how search engines, specifically Google, try to detect duplicates or near duplicates.