Forum Moderators: open
i m taking my Algorythm course this semester :o and i cant think of what kind of algorythm google is using 2 see that 2 pages have duplicate entries
google " 2004 Google - Searching 4,285,199,774 web pages "
if they have 4,285,199,774 pages how can they know if 2 pages are the same?
(4,285,199,774!) for each page? sure no ...
any idea?
Obviously, it is much more complicated than that, but it is the concept of pattern recognition that is the heart of duplicate page detection.
Evaluating Strategies for Similarity Search on the Web [dbpubs.stanford.edu]
Finding replicated web collections [dbpubs.stanford.edu]
This is the full text of Google's patent
Detecting duplicate and near-duplicate files [patft.uspto.gov]
Enjoy!
> (4,285,199,774!)
2) to find doublettes wouldn't be a facultative function, just a square devided by two, because you don't have to check two sides vice versa.
that means you'd have two loops with the inner one decreasing from 2exp32 to 1 steps in a brute force attack on an unordered set.
I guess google is working with dozens of different indices, such as lenght, type, date, checksum etc of the documents stored. If you have the urls ordered by e.g. lenght you'd only have to compare each url with its predecessor, which is well within the scope of a single pc provided the page-caches are stored elsewhere and easily accessed.