Welcome to WebmasterWorld Guest from 54.167.177.207

What are Google Fingerprints?

Fingerprints in "duplicate and near-duplicate detection" techniques.

   
12:51 am on Jul 29, 2005 (gmt 0)

10+ Year Member



From one of Google's patents regarding duplicate content:

----------------------------------------
Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.

----------------------------------------

Does anyone know what qualifies as "fingerprints?"

11:40 am on Jul 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Isn't that explained in the same patent filing? Do you have a link?
12:35 pm on Jul 29, 2005 (gmt 0)

10+ Year Member



6658423,

Its just hashing of paragraphs (an order sensitive hash and a non order sensitive hash). They then match up the hashes. Sort the hashes into lists for comparisons.

Similar to sja65s suggestion, which has the same problem I've pointed out before:

[webmasterworld.com...]

12:54 pm on Jul 29, 2005 (gmt 0)

10+ Year Member



I'm not sure if the moderators will let me link to a CRM114 paper but its worth reading and comparing against the Google patent.

[crm114.sourceforge.net...]

"In this talk, we will examine the Sparse Binary Polynomial Hash (SBPH) filtering technique, a generalization of the Bayesian method that can match mutating phrases as well as single words. "

I hope they aren't granted that patent, a lot of more intelligent algorithms presuppose hashing of text blocks.
It would be like giving a patent on the rubber tyre when other companies already make high performance multi layer radials with water pressure ejection....

 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month