Forum Moderators: open
Inventors: Pugh; William (Kensington, MD); Henzinger; Monika H. (Menlo Park, CA)
Assignee: Google, Inc. (Mountain View, CA)
Filed: January 24, 2001
Filed while ago, but I did not see a mentioning of it here yet.
amongst others:
...In response to the detected duplicate documents, the present invention may also function to eliminate duplicate documents (e.g., keeping the one with best PageRank, with best trust of host, that is the most recent) Alternatively, the present invention may function to generate clusters of near-duplicate documents, in which a transitive property is assumed (i.e., if document A is a near-duplicate of document B, and document B is a near-duplicate of document C, then document A is considered a near-duplicate of document C). Each document may have an identifier for identifying a cluster with which it is associated. In this alternative, in response to a search query, if two candidate result documents belong to the same cluster and if the two candidate result documents match the query equally well (e.g., have the same title and/or snippet) if both appear in the same group of results (e.g., first page), only the one deemed more likely to be relevant (e.g., by virtue of a high PageRank, being more recent, etc.) is returned...
I found the part with the "best trust of host" interesting.
A lot of people talk about "penalties", on the flip side, those who rank higher could be "more trustworthy".
hashing each of the extracted parts to generate a hash value for each of the extracted parts....The method of claim 15 wherein the act of determining a fingerprint uses a hashing function with a low probability of collision
I wonder if the hash includes HTML markup, extra spaces, added word at beginning/end etc. Either way the "low probability of collision" sounds worrying, considering how many permutations there can be of putting words in a sentence, in a given order, they must have a very good way of hashing it into a 4/8/whatever byte number :)
But I couldn't find the other article mentioned in the patent:
M. O. Rabin. Fingerprinting by random polynomials. Center for Research in
Computing Technology, Harvard University, Report TR-15-81, 1981.
Though I have to admit that I didn't understand too much from the first article I wonder if anybody knows where to find the second one.
Even better: Can anybody explain how the fingerprints are generated and how the documents are compared?
As far as I understood so far all words (e.g. in the body text of a html document) are counted and if there is another document that has for most of the words in that document the same word count then it is considered as near duplicate.
Is that correct?
[edited by: msgraph at 3:00 pm (utc) on Jan. 9, 2004]
[edit reason] fixed link [/edit]
1 Free
2 Commercial
Is there another way?
Would Demon in the UK suffer for example, as there are so many domains on asn IP?
Would one domain to one IP address be considered more trustworthy?
Or would it be a function of the links into an IP address, or C Class?
Or links out? (less likely perhaps)
There can only be a limited number of ways of judging it though. Anything to add?