Welcome to WebmasterWorld Guest from 174.129.151.95

Message Too Old, No Replies

What are Google Fingerprints?

Fingerprints in "duplicate and near-duplicate detection" techniques.

   
12:51 am on Jul 29, 2005 (gmt 0)

10+ Year Member



From one of Google's patents regarding duplicate content:

----------------------------------------
Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.

----------------------------------------

Does anyone know what qualifies as "fingerprints?"

11:40 am on Jul 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Isn't that explained in the same patent filing? Do you have a link?
12:35 pm on Jul 29, 2005 (gmt 0)

10+ Year Member



6658423,

Its just hashing of paragraphs (an order sensitive hash and a non order sensitive hash). They then match up the hashes. Sort the hashes into lists for comparisons.

Similar to sja65s suggestion, which has the same problem I've pointed out before:

[webmasterworld.com...]

12:54 pm on Jul 29, 2005 (gmt 0)

10+ Year Member



I'm not sure if the moderators will let me link to a CRM114 paper but its worth reading and comparing against the Google patent.

[crm114.sourceforge.net...]

"In this talk, we will examine the Sparse Binary Polynomial Hash (SBPH) filtering technique, a generalization of the Bayesian method that can match mutating phrases as well as single words. "

I hope they aren't granted that patent, a lot of more intelligent algorithms presuppose hashing of text blocks.
It would be like giving a patent on the rubber tyre when other companies already make high performance multi layer radials with water pressure ejection....