Forum Moderators: open
Basically there is no clean answer. If you want to understand some of the technology or theories out there then you need to start reading. Sure you can take the easy way out and ask for examples and theories from site owners but what better way to learn than from those who study these problems for the search engines themselves.
A good starting point for the challenges is here:
Section 4. Duplicate Hosts
Algorithmic Challenges in Web Search Engines [internetmathematics.org]
published in Volume 1.1 Journal or Internet Mathematics by Monika R. Henzinger (Research Director - Google,Inc.) 2003
Follow and read every reference listed in that section and you will get a good idea of how duplication detection works and their challenges.
Note: This does not imply that Google currently employs any or all of these methods although I'm sure they use a large part of them.
The bottom line is that straight or very-near duplication, similar site structures, and similar sites hosted on the same server can be detected easily. When you start to get into paragraph/article duplication, things get fuzzy and detection is very very difficult, with the "determined" authority beating out the rest.
I'm a yacht broker captain ... not a miracle worker!(Going to have to go back to school to undertake that paper a second time!)
Think I'll stick to reading WebmasterWorld. Its more than enough for me to digest and attempt to understand! :)
But just ignore all that :) I'm worried that many see the math and just close the documents, when in fact, there are usually loads of info in easy to read descriptions. Well, if you are familiar with how sites are hosted and work.
The best tidbits of information to digest off of these papers are what can be achieved now and what cannot be achieved.
The duplicate host detection problem is easier than mirror detection since the URLs between duphosts differ only in the hostname component.
Of course, the hard questions are, what sketch to choose and how to avoid comparing all pairs of hosts. Since there are millions of different hosts, comparing all pairs is simply infeasible.
See? These types of papers point out that some limits have to be set on calculating things. Some of these comparisons take a huge load of resources and that's what many people do not understand. Google, and many other search engines, are not the all-seeing-eye that can detect every little item that breaks their rules. They might have the capability to do so, but then we would be having updates every 6 months to a year or so.
A perfect example is cheat codes for video games. If you type in the right search terms you'll get perhaps 200 results of almost the same cheat codes and hints with the only difference being the navigation structure of the sites. Why do these sites never get pulled? Because they are all on different hosts around the world and they have different linking structures. Now if you had ten sites with the same cheat codes sitting on the same server or host, a search engine could detect them in no time and penalize them as such.
Also, I would agree that this is a very grey area, and is confusing at best. For instance, many press releases, articles, etc. are widely distributed and/or syndicated. And, the idea that we all should be Pulitzer Prize candidates creating "unique, important, compelling", (you name it), content is clearly unacheivable.
I think, at one time, Brett commented that when it comes to boiler plate product descriptions, and the like, differing navigation structure and layouts are sufficient to avoid a duplication penalty.
Who really knows? I'm not even sure Googleguy knows.
Searches for numerous products show the same content in many of the top results in some product categories on a number of different syndication partners. (i.e. shoppingengine.portal.com) You have to scroll at least halfway down the first page to find something that is not duplicate content.
That category is big enough that a manual filter could probably knock out a lot of that duplication from some of the major sites if there is not a programming work around.
Yeah host is like "spam", one of those words that has several meanings. Heh, sometimes these papers get really confusing because they'll just label something host in different parts of the paper that relate to different "types of hosts".
What they are referring to here is host as in site, yet like I said above, they'll also use host as in a web server hosting multiple sites. Looking again at my cheat codes example above, I could have worded it better.
The duplicate host detection problem is easier than mirror detection since the URLs between duphosts differ only in the hostname component.
She could have worded this better because mirror detection defined in (Bharat and Broder 99) Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content [www8.org] is the almost same thing as in what she is referring to being easier to detect with duphosts problem:
Hence, we define two hosts to be mirrors if:i. A high percentage of paths (that is, the portions of the URL after the hostname) are valid on both web sites...
So basically, from what I gather, she is talking about something along the line of dmoz clones, where sites are almost exactly the same except for different color schemes, logos, etc. Plus the urls are all the same like:
www.dmoz.org/Reference/Museums/History/
www.dmozclone.com/Reference/Museums/History/
Dmoz is probably the best example to look at when seeing how they treat duplication. Many of the sites and pages are still in the results; they just don't have any serious weight to them where they can rank well.
The paper I listed above in this post and this one: A Comparison of Techniques to Find Mirrored Hosts on the WWW (1999) [citeseer.nj.nec.com], explain fairly well what is considered duphosts and mirrors. You'll also find information on sites hosted on the same server, external linkage, and so on. Pretty good reading and they lay out great examples.
Skibum:
That category is big enough that a manual filter could probably knock out a lot of that duplication from some of the major sites if there is not a programming work around.
And I think that is all they can do with most of this duplication filtering; manually tweak the algo to treat special areas that have gotten out of hand.
I think we all have seen examples (not just in Google) of duplicate content that would be easy to filter out but is not. The dMoz clones was mentioned as a good example: Easy to filter out - so why is it not? Also, you'll often find the same pages indexed under http and www - why? On top of that comes the indexing of CSS-files, JS-files, 404-files, robots.txt files, search engines own search results etc. All in all a lot of pages that there is absolutely no reason to keep in the index - except for being biggest ...
I am sorry, but I do not think the search engines realy want to filter out all the garbage in their indexes. Sure, the engineers want that, but the management don't. As long as top management (and the press) think index size is a good metric for quality I don't think we will see any major reduction of duplicate content.
[google.com...]
This is of particular interest now, I'm having to start dealing with blatant duplication, innocent as the original intentions were. Two sites identical in layout, same products, same page naming - linked all into one shopping cart on one of the sites.
[edited by: msgraph at 10:05 pm (utc) on Sep. 20, 2003]