Google's percentage threshold for duplicate content

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google's percentage threshold for duplicate content

MLHmptn

9:26 am on May 11, 2010 (gmt 0)

I've never seen this question posted before but I'm presuming somebody in SEO land has figured this out. Lets say I have 25 widgets that all share a common description except specifications (sizes, colors, etc.) are different. Now we all know that if pages are too much alike they are filtered and either thrown into supplemental results or simply not indexed because Google sees them as duplicate content. My question is if anybody knows an approximate threshold of on-page keyword density (or copy) before Google looks at something as duplicate content? It would obviously be easy to just provide spec's and that would eliminate duplicate content but SEO wise it obviously hinders the on-page factors and consequently ranking.

tedster

11:41 am on May 11, 2010 (gmt 0)

The technology goes quite a bit beyond percentage calculations, as I understand it. Here's a thread about a 2001 Google duplicate content patent [webmasterworld.com] application. The language they used to describe their heuristic, even back then, was about "estimating similarity" rather than "duplicate detection".

SUMMARY OF THE INVENTION
....
The method includes generating a vector corresponding to the object, each coordinate of the vector being associated with a corresponding weight and multiplying the weight associated with each coordinate in the vector by a corresponding hashing vector to generate a product vector. The method further includes summing the product vectors and generating the compact representation of the object using the summed product vectors.

So, no simple text match percentages ;(

jk3210

4:51 pm on May 11, 2010 (gmt 0)

Ha, you'll like this...

Lately I've been wondering if Google isn't attempting to replicate something similar to the DNA->RNA->genetic transcription process to produce a digital equivalent of "genetic code" for each page.

I mean, if your goal is to record and analyze each bit of information in the universe, it would probably make your job easier if you were using a similar language as in nature to describe it.

pontifex

5:42 pm on May 11, 2010 (gmt 0)

�We took the human genome, cut it into 173 puzzle pieces and rearranged it to make a pig,� explains animal geneticist Lawrence Schook. �Everything matches up perfectly. The pig is genetically very close to humans.�

that would explain why the net is full of p*rn, but not how the vector generation is done on documents.

because there are so many documents out there, the usage of different words and "stop words" to speak "search engine lingo" and their order must define the vector pretty much.

IMHO should 25% different words and a different order of the sentences make it hard for an algorithm to detect duplication!

P!