Page is a not externally linkable
- Google
-- Google News Archive
---- Google problems from having all sites on same server


ILuvSrchEngines - 7:17 pm on Apr 25, 2004 (gmt 0)


Well, you may be right about the whois. This nutty stuff is really dry read. I may have fell asleep while reading it the first time. :-)

2.1 Detecting Host Affiliation

We define two hosts as affiliated if one or both of the following is true:
They share the same first 3 octets of the IP address.
The rightmost non-generic token in the hostname is the same.
We consider tokens to be substrings of the hostname delimited by "." (period). A suffix of the hostname is considered generic if it is a sequence of tokens that occur in a large number of distinct hosts. E.g., ".com" and ".co.uk" are domain names that occur in a large number of hosts and are hence generic suffixes. Given two hosts, if the generic suffix in each case is removed and the subsequent right-most token is the same, we consider them to be affiliated.

E.g., in comparing "www.ibm.com" and "ibm.co.mx" we ignore the generic suffixes ".com" and ".co.mx" respectively. The resulting rightmost token is "ibm", which is the same in both cases. Hence they are considered to be affiliated. Optionally, we could require the generic suffix to be the same in both cases.

The affiliation relation is transitive: if A and B are affiliated and B and C are affiliated then we take A and C to be affiliated even if there is no direct evidence of the fact. In practice some non-affiliated hosts may be classified as affiliated, but that is acceptable since this relation is intended to be conservative.

In a preprocessing step we construct a host-affiliation lookup. Using a union-find algorithm we group hosts, that either share the same rightmost non-generic suffix or have an IP address in common, into sets. Every set is given a unique identifier (e.g., the host with the lexicographically lowest hostname). The host-affiliation lookup maps every host to its set identifier or to itself (when there is no set). This is used to compare hosts. If the lookup maps two hosts to the same value then they are affiliated; otherwise they are non-affiliated.


Thread source:: http://www.webmasterworld.com/google_archive/23527.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com