| What are Google Fingerprints? Fingerprints in "duplicate and near-duplicate detection" techniques. |
tama

msg:806257 | 12:51 am on Jul 29, 2005 (gmt 0) | From one of Google's patents regarding duplicate content: ---------------------------------------- Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match. ---------------------------------------- Does anyone know what qualifies as "fingerprints?"
|
HitProf

msg:806258 | 11:40 am on Jul 29, 2005 (gmt 0) | Isn't that explained in the same patent filing? Do you have a link?
|
ncgimaker

msg:806259 | 12:35 pm on Jul 29, 2005 (gmt 0) | 6658423, Its just hashing of paragraphs (an order sensitive hash and a non order sensitive hash). They then match up the hashes. Sort the hashes into lists for comparisons. Similar to sja65s suggestion, which has the same problem I've pointed out before: [webmasterworld.com...]
|
ncgimaker

msg:806260 | 12:54 pm on Jul 29, 2005 (gmt 0) | I'm not sure if the moderators will let me link to a CRM114 paper but its worth reading and comparing against the Google patent. [crm114.sourceforge.net...] "In this talk, we will examine the Sparse Binary Polynomial Hash (SBPH) filtering technique, a generalization of the Bayesian method that can match mutating phrases as well as single words. " I hope they aren't granted that patent, a lot of more intelligent algorithms presuppose hashing of text blocks. It would be like giving a patent on the rubber tyre when other companies already make high performance multi layer radials with water pressure ejection....
|
|
|