Google's New Patent (duplicate content detection)

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google's New Patent (duplicate content detection)

pageoneresults

2:36 pm on Jan 2, 2007 (gmt 0)

Methods and apparatus for estimating similarity [patft.uspto.gov]

Beware of duplicate content!

A similarity engine generates compact representations of objects called sketches. Sketches of different objects can be compared to determine the similarity between the two objects. The sketch for an object may be generated by creating a vector corresponding to the object, where each coordinate of the vector is associated with a corresponding weight. The weight associated with each coordinate in the vector is multiplied by a predetermined hashing vector to generate a product vector, and the product vectors are summed. The similarity engine may then generate a compact representation of the object based on the summed product vector.

beren

4:56 pm on Jan 2, 2007 (gmt 0)

Anything they do to reduce duplicate content showing up in SERPS is a good thing. Thanks to Google for trying to help web users sick of seeing copies.

engine

5:06 pm on Jan 2, 2007 (gmt 0)

Interesting reading.

I agree, beren, except when it impacts upon resources that have become a leading resource and everything else is just a scattered fragment of that resource. So, when is works, it's a great tool.

tedster

5:08 pm on Jan 2, 2007 (gmt 0)

The drawings help understanding a bit. Looks to me like the secret sauce is, at least partly, in the assigment of the hashing vectors.

henry0

5:57 pm on Jan 2, 2007 (gmt 0)

There are situations where dupli is �legal�
For example I just have been hired to dev a new large site for a company that sales, installs, maintains and markets products manufactured by A, B and C etc.
I have contacted A B and C and both require that product description (could be at least 20 to 40 lines) to be the exact dupe of their existing content!
Of course this will be a very tiny part of the new site but nevertheless �legal� content dupli, so where will I go from there �?

tedster

6:28 pm on Jan 2, 2007 (gmt 0)

Well, you have identified a real issue to be addressed. If you don't want those products to be filtered out of relevant searches, you will need to find some approach that generates urls with adequate amouts of unique content.

We do want to keep this thread focused on the patent, so I suggest that people take site-specific questions to another spot.

vitaplease

7:49 pm on Jan 2, 2007 (gmt 0)

Vector..with a corresponding weight reminds me a bit of Robust hyperlinks [google.com...]
under the Honorable Mentions:

Thomas Phelps and Robert Wilensky, for their project, Robust Hyperlinks. Traditional hyperlinks are very brittle, in that they are useless if the page later moves to a different URL. This project improves upon traditional hyperlinks by creating a signature of the target page, selecting a set of very rare words that uniquely identify the page, and relying on a search engine query for those rare words to find the page in the future

europeforvisitors

8:12 pm on Jan 2, 2007 (gmt 0)

There are situations where dupli is �legal�

Sure, but that doesn't mean Google wants (or that users want) duplicate content in search results.

Press releases are a good example: The same release from Widgetco about its new WC-1 digital camera might turn up on 50 different photo sites, but from a user's point of view, there's no value in having all 50 copies of the press release listed in a Google SERP.

Oliver Henniges

9:32 pm on Jan 2, 2007 (gmt 0)

The second homepaged thread on a google patent within less than a week. How do you find all these? Has it anything to do with [google.com ], to which vitaplease kindly pointed our attention here [webmasterworld.com]?

I tried to find this current patent there, but it said: "Your search - ininventor:Charikar - did not match any documents." and the same held true for a search for the title or specific keyords. Is it just too new, or is google patents broken similar to the link: command?

I'm new to patents at all: Is there a considerable increase in the number of patents released by google? does the new Big-Daddy-infrastructure allow google to implement the functionality of such patents easier than before?

As for this specific patent: I was again wondering about the status of "words" as the basis of a vector matrix (or overall analysis like in this thread [webmasterworld.com]). Sounds as if an arbitrary deletion of spaces (or substitution of spaces by hyphens) would disable the detection of duplicate content. It cannot be that easy, can it?

[edited by: Oliver_Henniges at 9:33 pm (utc) on Jan. 2, 2007]

tedster

12:13 am on Jan 3, 2007 (gmt 0)

No, I am 100% sure it is not that easy. I had to note that the patent talks about "estimating similarity" and not "detecting duplication". The math, as far as I can follow it so far, is really a measure of "how close" two objects are, and it allows for all kinds of tweaking to get the results you are after.

By the way, we mostly find these through a combination of alerts, feeds, and
just plain keeping lots of feelers out. Plus we have a lot of mods on the job here
-- that improves our chance of catching a story early on.

[edited by: tedster at 8:17 pm (utc) on Feb. 10, 2008]

aleksl

12:42 am on Jan 3, 2007 (gmt 0)

SUMMARY OF THE INVENTION
....
The method includes generating a vector corresponding to the object, each coordinate of the vector being associated with a corresponding weight and multiplying the weight associated with each coordinate in the vector by a corresponding hashing vector to generate a product vector. The method further includes summing the product vectors and generating the compact representation of the object using the summed product vectors.

That is a 200-year old mathematics. How this could possibly stand up in any courtroom is beyond my understanding. Boo.

Other than that, interesting leasure reading

mattg3

1:40 am on Jan 3, 2007 (gmt 0)

That is a 200-year old mathematics.

It's also a 2001 patent, but Google seems to reinvent the wheel quite a bit it seems.

Filed: December 31, 2001

Kurgano

2:37 am on Jan 3, 2007 (gmt 0)

"Sketches" - relativity of h tags, vectors and such.

Is it just me or does this sound like a forum killer?

All forums inherently place the titles and tags in exactly the same places for every page. Would forums be penalized further with this? Its no secret that big G dislikes forums and dynamic content already, so webmasters go to great lengths to make their forums more friendly. Is this the return of webmasters needing to create every page differently, manualy and one by one?

mattg3

4:59 am on Jan 3, 2007 (gmt 0)

Why has the headline changed to "Google's New Patent (duplicate content detection)" when it is a 2001 patent?

Did they write somewhere they are only using this now?

tedster

5:09 am on Jan 3, 2007 (gmt 0)

The patent was applied for in 2001, but just now granted. The word "new" was in the original thread title; the bit about (duplicate content detection) is what I added to make the thread easier to locate in the future.

[edited by: tedster at 5:12 am (utc) on Jan. 3, 2007]

mattg3

5:12 am on Jan 3, 2007 (gmt 0)

The patent was applied for in 2001, but just now granted. The word "new" was originally in the title; the bit about duplicate content detectio is what I added to make the thread easier to locate in the future.

Aha .. that was the missing bit of info :)

Hanu

5:27 am on Jan 3, 2007 (gmt 0)

aleksl, this is not supposed to be a mathematical invention. It's applied math. Although E = m * c^2 is even simpler math, that equation happens to be a corner-stone of one of the greatest scientific theories of the 1900s.

Oliver Henniges

9:52 am on Jan 3, 2007 (gmt 0)

> How this could possibly stand up in any courtroom is beyond my understanding.

I second your opinion, though in practice this supposedly depends on the lawyers one can afford.

Indeed the patent is held very common: the vector analysis is primarily based on "words" or phrases, but "...the concepts described could also be implemented based on any object that contains a series of discrete elements." The question is whether it HAS been impemented that way, or whether google has developed some other (prior) engines to "normalize" text-bodies according to their lexemes beyond typing mistakes.

To me, one of the key future issues is the fact, that google now probably has the infrastructure to COMBINE and TWEAK all such patents quite easily. As Anna Patterson said here [acmqueue.com]:

"The really hard problem with crawlers is to perform dynamic duplicate elimination�eliminating both duplicate URLs and duplicate content..."

Looks as if crawling and evaluation of websites under this new infrastructure are performed in one and the same big (everchanging) process, thereby from time to time shoveling big parts of the results to the outside world in "data-refreshes." And most of us are wondering, how pagerank-calculation fits into this scheme.

Has anyone yet done a synopsis of some of these patents? Maybe "similarity engine 124", "server device 110", "memory 109" and all these denotations remain the same in all writings? What did the latest patents on pagerank say in this respect?

marcn

4:57 pm on Jan 4, 2007 (gmt 0)

Great idea, but there is something that scares me about this.

I have a website that contains many pages that are copied dayly. It drives me nuts to see my content somewhere else.

But who decides who the original author is?

Marc

pageoneresults

6:02 pm on Jan 4, 2007 (gmt 0)

But who decides who the original author is?

Hopefully the algo. In the beginning, it's a first come, first serve basis. As time goes on, the original author usually gains natural links which are then factored into the equation. That's how I've seen it work.

bznayot

8:17 pm on Jan 4, 2007 (gmt 0)

"In practice, however, it may be desirable to weight certain words, such as words that occur relatively infrequently in the corpus, more heavily when determining the similarity of documents."

If spam was a spear, this would certainly be a great heel.
I'd be prone to keeping it short as well.

tedster

12:51 am on Jan 5, 2007 (gmt 0)

For the die-hard researchers, here are two additional and more recent (2003) patent applications about duplicate content. Both are still pending:

Detecting duplicate and near-duplicate files [patft.uspto.gov]
Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints
to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to
one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the
populated lists. Two documents may be considered to be near-duplicates if any one of their
fingerprints match.

Detecting query-specific duplicate documents [patft.uspto.gov]
An improved duplicate detection technique that uses query-relevant information to limit the portion(s)
of documents to be compared for similarity is described. Before comparing two documents for similarity,
the content of these documents may be condensed based on the query. In one embodiment,
query-relevant information or text (also referred to as "snippets") is extracted from the documents and
only the extracted snippets, rather than the entire documents, are compared for purposes of
determining similarity.

percentages

8:53 am on Jan 5, 2007 (gmt 0)

Looks like they are in the process of digging "the road to hell" to me....but, we won't see that for many years as they haven't even thought it out yet!

Go Google Go! Act on upon your instincts, and then learn!

Pirates

12:19 am on Jan 8, 2007 (gmt 0)

I have to applaud google here. This is excellent stuff and although may need playing with to get right the recognisation of unique content and penalisation of duplicate is a great idea.