Forum Moderators: open
Paper here [research.compaq.com]
Measuring Index Quality Using Random Walks on the Web
Monika R. Henzinger, Allan Heydon,
Michael Mitzenmacher, and Marc Najork
"Recent research has studied how to measure the size of a search engine, in terms of the number of pages indexed. In this paper, we consider a different measure for search engines, namely the quality of the pages in a search engine index. We provide a simple, effective algorithm for approximating the quality of an index by performing a random walk on the Web, and we use this methodology to compare the index quality of several major search engines."
Paper here [research.compaq.com]
"For each of the extracted links, ensure that it is an absolute URL (derelativizing it if necessary), and add it to the list of URLs to download, provided it has not been encountered before."
I always figured absolute URL's were the way to go.
This should explain the 64-bit checksum right from DEC themselves
Explains their fingerprint method [wing.rug.nl]
Just finished looking through all of it and the references. That was one of the best documents I've read in a long time explaining certain spidering methods. Now I understand why they have grabbed almost every page on all my domains with various Mercator spiders.
I wonder if when they finish tweaking out Mercator they will use all the data collected to power other search engines. Similar to what Inktomi does? Or will they keep it for themselves? Probably the latter.
They say they struggle in two particular areas:
Alternative paths on the same host;
Replication across different hosts;
They still download these types of duplicates but after running the content-seen test do not process all but the first copy. This means that the URL's from these documents are not followed as they are assumed to lead to duplicate content.
On a practical level this could have serious implications for many "valid" sites. Imagine a large geographically spread company, with hosted sites in many different countries/domains, using the same initial "corporate" index page which leads to localised content within the differing sites. I have seen this with a recent client with over 150 regional branches/domains, their "local" domains were effectively banned from many SE's.