Welcome to WebmasterWorld Guest from

Message Too Old, No Replies

What are Google Fingerprints?

Fingerprints in "duplicate and near-duplicate detection" techniques.



12:51 am on Jul 29, 2005 (gmt 0)

10+ Year Member

From one of Google's patents regarding duplicate content:

Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.


Does anyone know what qualifies as "fingerprints?"


11:40 am on Jul 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

Isn't that explained in the same patent filing? Do you have a link?


12:35 pm on Jul 29, 2005 (gmt 0)

10+ Year Member


Its just hashing of paragraphs (an order sensitive hash and a non order sensitive hash). They then match up the hashes. Sort the hashes into lists for comparisons.

Similar to sja65s suggestion, which has the same problem I've pointed out before:



12:54 pm on Jul 29, 2005 (gmt 0)

10+ Year Member

I'm not sure if the moderators will let me link to a CRM114 paper but its worth reading and comparing against the Google patent.


"In this talk, we will examine the Sparse Binary Polynomial Hash (SBPH) filtering technique, a generalization of the Bayesian method that can match mutating phrases as well as single words. "

I hope they aren't granted that patent, a lot of more intelligent algorithms presuppose hashing of text blocks.
It would be like giving a patent on the rubber tyre when other companies already make high performance multi layer radials with water pressure ejection....


Featured Threads

Hot Threads This Week

Hot Threads This Month