Forum Moderators: Robert Charlton & goodroi
Beware of duplicate content!
A similarity engine generates compact representations of objects called sketches. Sketches of different objects can be compared to determine the similarity between the two objects. The sketch for an object may be generated by creating a vector corresponding to the object, where each coordinate of the vector is associated with a corresponding weight. The weight associated with each coordinate in the vector is multiplied by a predetermined hashing vector to generate a product vector, and the product vectors are summed. The similarity engine may then generate a compact representation of the object based on the summed product vector.
We do want to keep this thread focused on the patent, so I suggest that people take site-specific questions to another spot.
Thomas Phelps and Robert Wilensky, for their project, Robust Hyperlinks. Traditional hyperlinks are very brittle, in that they are useless if the page later moves to a different URL. This project improves upon traditional hyperlinks by creating a signature of the target page, selecting a set of very rare words that uniquely identify the page, and relying on a search engine query for those rare words to find the page in the future
There are situations where dupli is “legal”
Sure, but that doesn't mean Google wants (or that users want) duplicate content in search results.
Press releases are a good example: The same release from Widgetco about its new WC-1 digital camera might turn up on 50 different photo sites, but from a user's point of view, there's no value in having all 50 copies of the press release listed in a Google SERP.
I tried to find this current patent there, but it said: "Your search - ininventor:Charikar - did not match any documents." and the same held true for a search for the title or specific keyords. Is it just too new, or is google patents broken similar to the link: command?
I'm new to patents at all: Is there a considerable increase in the number of patents released by google? does the new Big-Daddy-infrastructure allow google to implement the functionality of such patents easier than before?
As for this specific patent: I was again wondering about the status of "words" as the basis of a vector matrix (or overall analysis like in this thread [webmasterworld.com]). Sounds as if an arbitrary deletion of spaces (or substitution of spaces by hyphens) would disable the detection of duplicate content. It cannot be that easy, can it?
[edited by: Oliver_Henniges at 9:33 pm (utc) on Jan. 2, 2007]
By the way, we mostly find these through a combination of alerts, feeds, and
just plain keeping lots of feelers out. Plus we have a lot of mods on the job here
-- that improves our chance of catching a story early on.
[edited by: tedster at 8:17 pm (utc) on Feb. 10, 2008]
SUMMARY OF THE INVENTION
....
The method includes generating a vector corresponding to the object, each coordinate of the vector being associated with a corresponding weight and multiplying the weight associated with each coordinate in the vector by a corresponding hashing vector to generate a product vector. The method further includes summing the product vectors and generating the compact representation of the object using the summed product vectors.
That is a 200-year old mathematics. How this could possibly stand up in any courtroom is beyond my understanding. Boo.
Other than that, interesting leasure reading
Is it just me or does this sound like a forum killer?
All forums inherently place the titles and tags in exactly the same places for every page. Would forums be penalized further with this? Its no secret that big G dislikes forums and dynamic content already, so webmasters go to great lengths to make their forums more friendly. Is this the return of webmasters needing to create every page differently, manualy and one by one?
I second your opinion, though in practice this supposedly depends on the lawyers one can afford.
Indeed the patent is held very common: the vector analysis is primarily based on "words" or phrases, but "...the concepts described could also be implemented based on any object that contains a series of discrete elements." The question is whether it HAS been impemented that way, or whether google has developed some other (prior) engines to "normalize" text-bodies according to their lexemes beyond typing mistakes.
To me, one of the key future issues is the fact, that google now probably has the infrastructure to COMBINE and TWEAK all such patents quite easily. As Anna Patterson said here [acmqueue.com]:
"The really hard problem with crawlers is to perform dynamic duplicate elimination—eliminating both duplicate URLs and duplicate content..."
Looks as if crawling and evaluation of websites under this new infrastructure are performed in one and the same big (everchanging) process, thereby from time to time shoveling big parts of the results to the outside world in "data-refreshes." And most of us are wondering, how pagerank-calculation fits into this scheme.
Has anyone yet done a synopsis of some of these patents? Maybe "similarity engine 124", "server device 110", "memory 109" and all these denotations remain the same in all writings? What did the latest patents on pagerank say in this respect?
Detecting duplicate and near-duplicate files [patft.uspto.gov]
Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints
to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to
one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the
populated lists. Two documents may be considered to be near-duplicates if any one of their
fingerprints match.
Detecting query-specific duplicate documents [patft.uspto.gov]
An improved duplicate detection technique that uses query-relevant information to limit the portion(s)
of documents to be compared for similarity is described. Before comparing two documents for similarity,
the content of these documents may be condensed based on the query. In one embodiment,
query-relevant information or text (also referred to as "snippets") is extracted from the documents and
only the extracted snippets, rather than the entire documents, are compared for purposes of
determining similarity.