Welcome to WebmasterWorld Guest from 54.158.55.251

Forum Moderators: Robert Charlton & andy langton & goodroi

Message Too Old, No Replies

What are Google Fingerprints?

Fingerprints in "duplicate and near-duplicate detection" techniques.

     
12:51 am on Jul 29, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:June 20, 2004
posts:158
votes: 0


From one of Google's patents regarding duplicate content:

----------------------------------------
Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.

----------------------------------------

Does anyone know what qualifies as "fingerprints?"

11:40 am on July 29, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 30, 2002
posts:1377
votes: 0


Isn't that explained in the same patent filing? Do you have a link?
12:35 pm on July 29, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 6, 2005
posts:91
votes: 0


6658423,

Its just hashing of paragraphs (an order sensitive hash and a non order sensitive hash). They then match up the hashes. Sort the hashes into lists for comparisons.

Similar to sja65s suggestion, which has the same problem I've pointed out before:

[webmasterworld.com...]

12:54 pm on July 29, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 6, 2005
posts:91
votes: 0


I'm not sure if the moderators will let me link to a CRM114 paper but its worth reading and comparing against the Google patent.

[crm114.sourceforge.net...]

"In this talk, we will examine the Sparse Binary Polynomial Hash (SBPH) filtering technique, a generalization of the Bayesian method that can match mutating phrases as well as single words. "

I hope they aren't granted that patent, a lot of more intelligent algorithms presuppose hashing of text blocks.
It would be like giving a patent on the rubber tyre when other companies already make high performance multi layer radials with water pressure ejection....

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members