Detecting duplicate and near-duplicate files - (deprecated) SEM Research Topics forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Detecting duplicate and near-duplicate files

Google patent - Henzinger & Pugh

vitaplease

8:34 am on Jan 9, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Detecting duplicate and near-duplicate files [patft.uspto.gov] - December 2, 2003

Inventors: Pugh; William (Kensington, MD); Henzinger; Monika H. (Menlo Park, CA)
Assignee: Google, Inc. (Mountain View, CA)
Filed: January 24, 2001

Filed while ago, but I did not see a mentioning of it here yet.

amongst others:

...In response to the detected duplicate documents, the present invention may also function to eliminate duplicate documents (e.g., keeping the one with best PageRank, with best trust of host, that is the most recent) Alternatively, the present invention may function to generate clusters of near-duplicate documents, in which a transitive property is assumed (i.e., if document A is a near-duplicate of document B, and document B is a near-duplicate of document C, then document A is considered a near-duplicate of document C). Each document may have an identifier for identifying a cluster with which it is associated. In this alternative, in response to a search query, if two candidate result documents belong to the same cluster and if the two candidate result documents match the query equally well (e.g., have the same title and/or snippet) if both appear in the same group of results (e.g., first page), only the one deemed more likely to be relevant (e.g., by virtue of a high PageRank, being more recent, etc.) is returned...

I found the part with the "best trust of host" interesting.

brotherhood of LAN

8:45 am on Jan 9, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

>interesting

A lot of people talk about "penalties", on the flip side, those who rank higher could be "more trustworthy".

hashing each of the extracted parts to generate a hash value for each of the extracted parts....
The method of claim 15 wherein the act of determining a fingerprint uses a hashing function with a low probability of collision

I wonder if the hash includes HTML markup, extra spaces, added word at beginning/end etc. Either way the "low probability of collision" sounds worrying, considering how many permutations there can be of putting words in a sentence, in a given order, they must have a very good way of hashing it into a 4/8/whatever byte number :)

zgb999

10:47 am on Jan 9, 2004 (gmt 0)

10+ Year Member

I checked one of the articles mentioned in the patent:
Some applications of Rabin's fingerprinting method [citeseer.nj.nec.com]
[pdf file]

But I couldn't find the other article mentioned in the patent:
M. O. Rabin. Fingerprinting by random polynomials. Center for Research in
Computing Technology, Harvard University, Report TR-15-81, 1981.

Though I have to admit that I didn't understand too much from the first article I wonder if anybody knows where to find the second one.

Even better: Can anybody explain how the fingerprints are generated and how the documents are compared?

As far as I understood so far all words (e.g. in the body text of a html document) are counted and if there is another document that has for most of the words in that document the same word count then it is considered as near duplicate.

Is that correct?

[edited by: msgraph at 3:00 pm (utc) on Jan. 9, 2004]
[edit reason] fixed link [/edit]

Smiley

9:27 am on Jan 12, 2004 (gmt 0)

10+ Year Member

>I found the part with the "best trust of host" interesting.

I agree, any ideas how "best trust of host" can be determined?

contrast compare

5:27 am on Jan 13, 2004 (gmt 0)

10+ Year Member

Where can you find all of these google patents? Id like to read them all. ;)

zgb999

10:29 am on Jan 13, 2004 (gmt 0)

10+ Year Member

I just got the Axandra Newsletter today that covers all patents issued in 2003 to search engines. This might answer part of your question:
[axandra.com ]

George

8:09 pm on Jan 19, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Funny, "best trust of host," hit me between the eyes too.
I had a bit of a think about how they could be split:

1 Free
2 Commercial

Is there another way?
Would Demon in the UK suffer for example, as there are so many domains on asn IP?
Would one domain to one IP address be considered more trustworthy?
Or would it be a function of the links into an IP address, or C Class?
Or links out? (less likely perhaps)

There can only be a limited number of ways of judging it though. Anything to add?