Duplication algorythm

Forum Moderators: open

Message Too Old, No Replies

Duplication algorythm

how google make 2 know?

Blackguy

5:33 am on Oct 15, 2004 (gmt 0)

Hello , i have a little question about google duplicatae entries ...

i m taking my Algorythm course this semester :o and i cant think of what kind of algorythm google is using 2 see that 2 pages have duplicate entries

google " 2004 Google - Searching 4,285,199,774 web pages "

if they have 4,285,199,774 pages how can they know if 2 pages are the same?

(4,285,199,774!) for each page? sure no ...

any idea?

Brett_Tabke

12:36 pm on Oct 15, 2004 (gmt 0)

Each page is give a sliding checksum. A simple comparison lookup of the checksum value is done, and then the page itself is looked at and determines if it has a duplicate. This is nearly the same way that templates are stripped out...

Obviously, it is much more complicated than that, but it is the concept of pattern recognition that is the heart of duplicate page detection.

Marcia

3:49 am on Oct 16, 2004 (gmt 0)

If it's for a course, you might want to include these in your reading. From the Stanford document server:

Evaluating Strategies for Similarity Search on the Web [dbpubs.stanford.edu]

Finding replicated web collections [dbpubs.stanford.edu]

This is the full text of Google's patent

Detecting duplicate and near-duplicate files [patft.uspto.gov]

Enjoy!

Oliver Henniges

11:29 am on Oct 16, 2004 (gmt 0)

1) It's far more than 4,285,199,774 websites meanwhile

> (4,285,199,774!)

2) to find doublettes wouldn't be a facultative function, just a square devided by two, because you don't have to check two sides vice versa.
that means you'd have two loops with the inner one decreasing from 2exp32 to 1 steps in a brute force attack on an unordered set.

I guess google is working with dozens of different indices, such as lenght, type, date, checksum etc of the documents stored. If you have the urls ordered by e.g. lenght you'd only have to compare each url with its predecessor, which is well within the scope of a single pc provided the page-caches are stored elsewhere and easily accessed.

cyberprosper

11:41 am on Oct 18, 2004 (gmt 0)

Can anybody explain in English what that patent says?