Page is a not externally linkable
- Marketing and Biz Dev
-- SEM Research Topics
---- Duplicates and the challenges search engines face


msgraph - 4:50 pm on May 23, 2003 (gmt 0)


There have been loads of threads on how search engines, specifically Google, try to detect duplicates or near duplicates.

Basically there is no clean answer. If you want to understand some of the technology or theories out there then you need to start reading. Sure you can take the easy way out and ask for examples and theories from site owners but what better way to learn than from those who study these problems for the search engines themselves.

A good starting point for the challenges is here:

Section 4. Duplicate Hosts

Algorithmic Challenges in Web Search Engines [internetmathematics.org]

published in Volume 1.1 Journal or Internet Mathematics by Monika R. Henzinger (Research Director - Google,Inc.) 2003

Follow and read every reference listed in that section and you will get a good idea of how duplication detection works and their challenges.

Note: This does not imply that Google currently employs any or all of these methods although I'm sure they use a large part of them.

The bottom line is that straight or very-near duplication, similar site structures, and similar sites hosted on the same server can be detected easily. When you start to get into paragraph/article duplication, things get fuzzy and detection is very very difficult, with the "determined" authority beating out the rest.


Thread source:: http://www.webmasterworld.com/sem_seo_research/522.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com