Welcome to WebmasterWorld Guest from 188.8.131.52
Forum Moderators: ergophobe
Now, this got me wondering. Do search engines really have the capacity to detect duplicate content? In the SMX conference, a yahoo rep said that yahoo checks and 'filters' duplicate content at crawl time. Can that really be true?
I mean, look at it this way. There are billions of websites and webpages on the internet. Is it even remotely possible for a search engine to take document A and compare it to scores of documents out there to find duplication? I really doubt that.
In case of copyscape, I pin point a document and check it with other sources present on the web. So I just perform a single check. I have an objective and a base point. But what base point do the search engines have?
Could it be that they manually recognize sources that allow articles to be reprinted, like ezinearticles and scan all articles present on these sites for duplication elsewhere?
Other than this method, I really doubt if the search engines can actually find out content duplication unless Doc A links to Doc B. Doc A and B being exact replicas.
The patent really does not talk much abt how the actual comparison is going to work.
Given that there are billions of webpages out there, comparing one with another doesn't seem spiderly possible. What would be the underlying concept behind this patent? I really doubt if anyone can explain that in plain English. Seems like a make-believe kind of a thing to me.
Let's take the simplest case - where you are only interested in detecting an EXACT duplicate. Of course, this wouldn't work in a practical sense, since EXACT duplicates are unlikely. But let's just do it as an exercise.
This is really easy - simply run a MD5 or similar hashing algorithm on the page. This is quick to do, and is done only once when the page is spidered.
The probability of two different pages generating the exact same MD5 is very low. One can pretty-much assume that two documents that have the same MD5 are identical. In any case, a rote comparison can be done among pages that have identical MD5s to make a final determination.
Simply create a database index of the MD5s, and now you can tell just about instantly whether duplicates exist, and also very quickly access all of the duplicates.
The problem with using a MD5 for this is that if you change just one letter in the document, it will generate a radically different MD5 value. MD5 was designed to do that. But imagine a different hashing algorithm that would generate a nearby number for a document that has one letter different, and a nearby but further number for a document that has one or two words different.
What Google is doing apparently is a bit more complex than this - they mention a vector, so they are digesting a page into a series of values, not just one, then summing those values (each multiplied by a weight).
I would bet that they have a number of algorithms that each boils some aspect of the document down to a value that can be compared linearly, such that the closer the values, the more likely the documents being compared are identical or similar.
The most obvious parameters of a document that might be measured thusly have trivial hashing algorithms - in fact, the hash is equal to the parameters - word count and letter count. If document A is 1000 words long, and document B is 10000 words long, they aren't likely to be identical. If they are 1000 and 1001 words long, more likely.
Now add letter frequencies, most-occurring words and their frequencies, etc.
As another example, there are statistical tests for "grade level" of writing. (Most word-processing packages have this buried somewhere...) A document written at a 10'th grade level isn't likely to be identical to a documented written at a college-graduate level.
Measure spelling errors. Grammar errors. Regionalisms. "Voice". (Oh, webwork will like that one! :) ) etc, etc, etc.
It's a game of divide and conquer. How can you quickly rule-out vast numbers of documents as "not duplicate" using simple, easily-compared and indexed values?
Now, start out by doing a search, but disable the parts of Google's search algorithm that deal with PR, links, authority, trust, etc. In other words, just a search for the terms the user asked for, with no other fiddling. You've already narrowed the universe of possible duplicates, and I'd imagine that's where the rest of the magic takes off from.
But I think one can still get away if he altered the doc good enough. Wat if I split the original doc into two, put if over two pages, change the heading and a few lines (first and last lines). Also change some words to their synonyms.
I did a quoted search for these terms "Hair care is an overall term for parts of hygiene" the original article is frm wikipedia, but wikipedia ranks second for the term.