Page is a not externally linkable
msgraph - 6:44 pm on May 24, 2003 (gmt 0)
Yeah host is like "spam", one of those words that has several meanings. Heh, sometimes these papers get really confusing because they'll just label something host in different parts of the paper that relate to different "types of hosts". What they are referring to here is host as in site, yet like I said above, they'll also use host as in a web server hosting multiple sites. Looking again at my cheat codes example above, I could have worded it better. She could have worded this better because mirror detection defined in (Bharat and Broder 99) Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content [www8.org] is the almost same thing as in what she is referring to being easier to detect with duphosts problem: i. A high percentage of paths (that is, the portions of the URL after the hostname) are valid on both web sites... So basically, from what I gather, she is talking about something along the line of dmoz clones, where sites are almost exactly the same except for different color schemes, logos, etc. Plus the urls are all the same like: www.dmoz.org/Reference/Museums/History/ Dmoz is probably the best example to look at when seeing how they treat duplication. Many of the sites and pages are still in the results; they just don't have any serious weight to them where they can rank well. The paper I listed above in this post and this one: A Comparison of Techniques to Find Mirrored Hosts on the WWW (1999) [citeseer.nj.nec.com], explain fairly well what is considered duphosts and mirrors. You'll also find information on sites hosted on the same server, external linkage, and so on. Pretty good reading and they lay out great examples. Skibum: And I think that is all they can do with most of this duplication filtering; manually tweak the algo to treat special areas that have gotten out of hand.
>>I was unable to tell exactly what they were referring to in using the term "host" The duplicate host detection problem is easier than mirror detection since the URLs between duphosts differ only in the hostname component.
Hence, we define two hosts to be mirrors if:
www.dmozclone.com/Reference/Museums/History/
That category is big enough that a manual filter could probably knock out a lot of that duplication from some of the major sites if there is not a programming work around.