Welcome to WebmasterWorld Guest from 54.198.229.157

Forum Moderators: phranque

Message Too Old, No Replies

Mercator: A Scalable, Extensible Web Crawler

Allan Heydon and Marc Najork

     
2:30 pm on Feb 1, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member nffc is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 22, 2000
posts:3604
votes: 0


"This paper describes Mercator, a scalable, extensible web crawler written entirely in Java. Scalable web crawlers are an important component of many web services, but their design is not well-documented in the literature."

Paper here [research.compaq.com]

2:34 pm on Feb 1, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 29, 2000
posts:1425
votes: 0


Thanks NFFC!

I have been searching all over the world for some kind of info on Mercator and now it's good to finally see something.

The crunched version of Google's crawling methods are a perfect addition as well.

2:47 pm on Feb 1, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member nffc is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 22, 2000
posts:3604
votes: 0


There are also a few published research papers regarding the use of Mercator:

Measuring Index Quality Using Random Walks on the Web

Monika R. Henzinger, Allan Heydon,
Michael Mitzenmacher, and Marc Najork

"Recent research has studied how to measure the size of a search engine, in terms of the number of pages indexed. In this paper, we consider a different measure for search engines, namely the quality of the pages in a search engine index. We provide a simple, effective algorithm for approximating the quality of an index by performing a random walk on the Web, and we use this methodology to compare the index quality of several major search engines."

Paper here [research.compaq.com]

2:49 pm on Feb 1, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 29, 2000
posts:1425
votes: 0


I can see this topic reaching a level of over 1000 posts by the end of today. 700 of them from me.

"For each of the extracted links, ensure that it is an absolute URL (derelativizing it if necessary), and add it to the list of URLs to download, provided it has not been encountered before."

I always figured absolute URL's were the way to go.

4:05 pm on Feb 1, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 29, 2000
posts:1425
votes: 0


"The content-seen test would be prohibitively expensive in both space and time if we saved the complete contents of every downloaded document. Instead, we maintain a data structure called the document fingerprint set that stores a 64-bit checksum of the contents of each downloaded document."

This should explain the 64-bit checksum right from DEC themselves

Explains their fingerprint method [wing.rug.nl]

6:56 pm on Feb 1, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 29, 2000
posts:1425
votes: 0


Sorry NFFC but I have to say thanks again.

Just finished looking through all of it and the references. That was one of the best documents I've read in a long time explaining certain spidering methods. Now I understand why they have grabbed almost every page on all my domains with various Mercator spiders.

I wonder if when they finish tweaking out Mercator they will use all the data collected to power other search engines. Similar to what Inktomi does? Or will they keep it for themselves? Probably the latter.

8:00 pm on Feb 1, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Very interesting, especially the fingerprinting checksum scheme to avoid duplicates.

I noted that, even with this fingerprint technique, Mercator still downloaded 8.5% duplicates. I'm not clear where the loophole is. Did anyone catch that?

8:27 am on Feb 2, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member nffc is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 22, 2000
posts:3604
votes: 0


>still downloaded 8.5% duplicates

They say they struggle in two particular areas:
Alternative paths on the same host;
Replication across different hosts;

They still download these types of duplicates but after running the content-seen test do not process all but the first copy. This means that the URL's from these documents are not followed as they are assumed to lead to duplicate content.
On a practical level this could have serious implications for many "valid" sites. Imagine a large geographically spread company, with hosted sites in many different countries/domains, using the same initial "corporate" index page which leads to localised content within the differing sites. I have seen this with a recent client with over 150 regional branches/domains, their "local" domains were effectively banned from many SE's.