Mercator: A Scalable, Extensible Web Crawler

Forum Moderators: open

Message Too Old, No Replies

Mercator: A Scalable, Extensible Web Crawler

Allan Heydon and Marc Najork

NFFC

2:30 pm on Feb 1, 2001 (gmt 0)

"This paper describes Mercator, a scalable, extensible web crawler written entirely in Java. Scalable web crawlers are an important component of many web services, but their design is not well-documented in the literature."

Paper here [research.compaq.com]

msgraph

2:34 pm on Feb 1, 2001 (gmt 0)

Thanks NFFC!

I have been searching all over the world for some kind of info on Mercator and now it's good to finally see something.

The crunched version of Google's crawling methods are a perfect addition as well.

NFFC

2:47 pm on Feb 1, 2001 (gmt 0)

There are also a few published research papers regarding the use of Mercator:

Measuring Index Quality Using Random Walks on the Web

Monika R. Henzinger, Allan Heydon,
Michael Mitzenmacher, and Marc Najork

"Recent research has studied how to measure the size of a search engine, in terms of the number of pages indexed. In this paper, we consider a different measure for search engines, namely the quality of the pages in a search engine index. We provide a simple, effective algorithm for approximating the quality of an index by performing a random walk on the Web, and we use this methodology to compare the index quality of several major search engines."

Paper here [research.compaq.com]

msgraph

2:49 pm on Feb 1, 2001 (gmt 0)

I can see this topic reaching a level of over 1000 posts by the end of today. 700 of them from me.

"For each of the extracted links, ensure that it is an absolute URL (derelativizing it if necessary), and add it to the list of URLs to download, provided it has not been encountered before."

I always figured absolute URL's were the way to go.

msgraph

4:05 pm on Feb 1, 2001 (gmt 0)

"The content-seen test would be prohibitively expensive in both space and time if we saved the complete contents of every downloaded document. Instead, we maintain a data structure called the document fingerprint set that stores a 64-bit checksum of the contents of each downloaded document."

This should explain the 64-bit checksum right from DEC themselves

Explains their fingerprint method [wing.rug.nl]

msgraph

6:56 pm on Feb 1, 2001 (gmt 0)

Sorry NFFC but I have to say thanks again.

Just finished looking through all of it and the references. That was one of the best documents I've read in a long time explaining certain spidering methods. Now I understand why they have grabbed almost every page on all my domains with various Mercator spiders.

I wonder if when they finish tweaking out Mercator they will use all the data collected to power other search engines. Similar to what Inktomi does? Or will they keep it for themselves? Probably the latter.

tedster

8:00 pm on Feb 1, 2001 (gmt 0)

Very interesting, especially the fingerprinting checksum scheme to avoid duplicates.

I noted that, even with this fingerprint technique, Mercator still downloaded 8.5% duplicates. I'm not clear where the loophole is. Did anyone catch that?

NFFC

8:27 am on Feb 2, 2001 (gmt 0)

>still downloaded 8.5% duplicates

They say they struggle in two particular areas:
Alternative paths on the same host;
Replication across different hosts;

They still download these types of duplicates but after running the content-seen test do not process all but the first copy. This means that the URL's from these documents are not followed as they are assumed to lead to duplicate content.
On a practical level this could have serious implications for many "valid" sites. Imagine a large geographically spread company, with hosted sites in many different countries/domains, using the same initial "corporate" index page which leads to localised content within the differing sites. I have seen this with a recent client with over 150 regional branches/domains, their "local" domains were effectively banned from many SE's.