Forum Moderators: open

Message Too Old, No Replies

How Altavista determines mirrors by IP or hostname

good reading since other SE's might use similar methods

         

msgraph

3:26 pm on Sep 24, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There have been a lot of discussions lately about having multiple domains on virtual servers and how the search engines deal with it. Here is AV's point of view and I'm sure others out there follow the same measures. Not sure if AV has or still uses this type of matching system but it is good reading nonetheless.


a. IP address based methods for determining a list of potential mirrors

FIG. 5 is an example of an IP (Internet Protocol) address format in the TCP/IP protocol. An IP address is made up of four "octets." Each octet of an IP address can have a value from 0 through 255 (one byte). If exactly two hosts translate into the same IP address, it is likely that they are mirrors accessing the same server, and that the two hostnames are just aliases. In contrast, a match on the first three octets signifies web servers on a same IP subnet. Such web servers are usually part of the same organization, and are often mirrored hosts, replicated to handle heavy traffic.

When many hosts resolve to the same IP address, however, it is often indicative of a web server that is a "virtual host," i.e., that is acting as a server to more than one web site. For example, certain "big IP" servers use the same IP address for many thousands of distinct web sites. These considerations led to the methods of FIGS. 6(a) and 6(b).

FIGS. 6(a) and 6(b) are flow charts showing methods of using IP addresses to detect mirrors. It will be understood that all methods discussed herein are preferably implemented as software instructions stored in a memory and executed by one or more appropriate processors. The steps of Stage I and Stage II are preferably performed by software 146 of FIG. 1 or by some similar or equivalent software.

It will also be understood that any constant numerical values used herein are provided for the purpose of example only and are not to be interpreted in a limiting sense. The methods of the present invention can be used with any appropriate constant values and any appropriate threshold values and/or weights.

In FIG. 6(a), step 602 inspects the IP addresses in the input information and clusters or groups together hosts that have the same IP address. Thus, the largest number of possible IP address clusters that can be found corresponds to the largest possible number of IP addresses (i.e., one host per IP address cluster). The clusters are ranked according to the number of hosts in each. Clusters with a smallest number of members (greater than one) arc most likely to be mirrors.

The method of FIG. 6(b) is performed similarly to the method of FIG. 6(a) except that, instead of considering an entire IP address, the method considers only the first three octets of an IP address.

b. URL based methods for determining a list of potential mirrors

FIG. 7 is an example of a format of a URL (Uniform Resource Locator) used to address resources in the world wide web. A URL includes an access method (such as "http," "ftp," etc.); a host name (for example,www.host1.com); and a path identifying the path on which a particular page is located in the host structure (for example, ABC/xy/z/hmtl). Paths can be any arbitrary reasonable length.

URL strings provide information about a potential mirroring relationship between hosts in two different ways:

1) similar hostnames suggest that hosts belong to the same or related organizations.

2) similar paths indicate a possible replication of directories. This is because paths usually reflect the host's directory structure. The following embodiment uses "term vector matching" to compute the likelihood that a pair of hosts are mirrors. Term vector matching is described in G. Salton and C. Buckley, "Term Weighting Approaches in Automatic Text Retrieval," Information Processing and Management, 24(5), pages 513-523, 1988, which is herein incorporated by reference. Based on the type of term used, various methods can be used, each with an appropriate weighting scheme. The method of FIGS. 8(a)-8(c) use terms. Possible terms include: bi-grams (see below), host names, paths, and portions of a URL, such as terms preceding a slash ("/") or other terminator character.

FIGS. 8(a)-8(c) show an example of how to determine host pairs in accordance with a URL. FIG. 8(a) is a flow chart showing a method of using URLs to detect mirrors, using "bi-grams" (also called "shingles") as terms. As described below, bi-grams are pairs of tokens from a URL along with associated depth information. Instead of two tokens, n-grams with more than two tokens can be used as well. For large data sets, such as the initial data set described herein, reducing the number of terms to be considered is a priority. One way to do this is to ignore any hosts that contribute a small number of URLs to the collection. Such hosts are either small in size and hence unlikely to have much mirrored content, or poorly sampled, in which case the samples in the collection may not be very representative. The described method considers only hosts that have at least 100 URLs in the collection of data. A further data reduction is obtained as follows: for every host, we first sort the list of URLs by alphabetical order, then consider only those paths whose strings upon hashing yield a value that is 0 mod m. Any conventional hashing method for strings can be used. For example, m can initially be 5. The aim of this step is to increase the correlation between the selected paths. If a path is selected on one host, it increases the likelihood that the same path is selected on its mirror.

After the first 20 paths are sampled this way, m is doubled, and this doubling is repeated at certain thresholds. The doubling thresholds are picked so that the sample count from a host is sub-linearly proportional to size. This ensures that, on the one hand, more paths are sampled from larger hosts than smaller hosts but, on the other hand, larger hosts do not dominate the list of terms. Alternately, different or no input sample reduction need be performed.

Steps 802 and 804 of FIG. 8(a) generate "bi-grams" for the paths of the input data sampled as described above. Bi-grams are pairs of tokens from a URL (along with associated depth information) and are used as terms in the example method shown. An example of bi-grams is shown in FIG. 8(b). In step 802, each URL is converted to lower case. Paths may or may not be case-sensitive, but the hostname is. Even so, case is preferably ignored. Any non-alphabetical chacters (such as numbers) are treated as word breaks when choosing tokens. Thus, in FIG. 8(b), a URL of www.a.b.com/cellblock16/inmates/me/personal/foo.html will yield tokens of (cellblock, inmates, me, personal, foo, and html). In step 804, bi-grams are generated by treating every continuous pair of tokens as a term and by adding depth information concerning the level of the term in the path. For example, the example of FIG. 8(b) generates the following set of bi-grams for URL

[ab.com...] {(Cellblock, inmates,0), (inmates, me,1), (me, personal,2), (personal, foo,3), and (foo, html,4)}. The depth information is useful when trying to find mirror hosts that share the same path structure. By fragmenting the path, we also associate depth information with the path.

.....more info on link

Method and apparatus for finding mirrored hosts by analyzing urls [164.195.100.11]