Forum Moderators: open

Message Too Old, No Replies

Duplication algorythm

how google make 2 know?

         

Blackguy

5:33 am on Oct 15, 2004 (gmt 0)

10+ Year Member



Hello , i have a little question about google duplicatae entries ...

i m taking my Algorythm course this semester :o and i cant think of what kind of algorythm google is using 2 see that 2 pages have duplicate entries

google " 2004 Google - Searching 4,285,199,774 web pages "

if they have 4,285,199,774 pages how can they know if 2 pages are the same?

(4,285,199,774!) for each page? sure no ...

any idea?

Brett_Tabke

12:36 pm on Oct 15, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Each page is give a sliding checksum. A simple comparison lookup of the checksum value is done, and then the page itself is looked at and determines if it has a duplicate. This is nearly the same way that templates are stripped out...

Obviously, it is much more complicated than that, but it is the concept of pattern recognition that is the heart of duplicate page detection.

Marcia

3:49 am on Oct 16, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If it's for a course, you might want to include these in your reading. From the Stanford document server:

Evaluating Strategies for Similarity Search on the Web [dbpubs.stanford.edu]

Finding replicated web collections [dbpubs.stanford.edu]

This is the full text of Google's patent

Detecting duplicate and near-duplicate files [patft.uspto.gov]

Enjoy!

Oliver Henniges

11:29 am on Oct 16, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



1) It's far more than 4,285,199,774 websites meanwhile

> (4,285,199,774!)

2) to find doublettes wouldn't be a facultative function, just a square devided by two, because you don't have to check two sides vice versa.
that means you'd have two loops with the inner one decreasing from 2exp32 to 1 steps in a brute force attack on an unordered set.

I guess google is working with dozens of different indices, such as lenght, type, date, checksum etc of the documents stored. If you have the urls ordered by e.g. lenght you'd only have to compare each url with its predecessor, which is well within the scope of a single pc provided the page-caches are stored elsewhere and easily accessed.

cyberprosper

11:41 am on Oct 18, 2004 (gmt 0)

10+ Year Member



Can anybody explain in English what that patent says?