homepage Welcome to WebmasterWorld Guest from 54.237.78.165
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Marketing and Biz Dev / SEM Research Topics
Forum Library, Charter, Moderators: phranque

SEM Research Topics Forum

    
Mercator: A Scalable, Extensible Web Crawler
Allan Heydon and Marc Najork
NFFC

WebmasterWorld Senior Member nffc us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 38 posted 2:30 pm on Feb 1, 2001 (gmt 0)

"This paper describes Mercator, a scalable, extensible web crawler written entirely in Java. Scalable web crawlers are an important component of many web services, but their design is not well-documented in the literature."

Paper here [research.compaq.com]

 

msgraph

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 38 posted 2:34 pm on Feb 1, 2001 (gmt 0)

Thanks NFFC!

I have been searching all over the world for some kind of info on Mercator and now it's good to finally see something.

The crunched version of Google's crawling methods are a perfect addition as well.

NFFC

WebmasterWorld Senior Member nffc us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 38 posted 2:47 pm on Feb 1, 2001 (gmt 0)

There are also a few published research papers regarding the use of Mercator:

Measuring Index Quality Using Random Walks on the Web

Monika R. Henzinger, Allan Heydon,
Michael Mitzenmacher, and Marc Najork

"Recent research has studied how to measure the size of a search engine, in terms of the number of pages indexed. In this paper, we consider a different measure for search engines, namely the quality of the pages in a search engine index. We provide a simple, effective algorithm for approximating the quality of an index by performing a random walk on the Web, and we use this methodology to compare the index quality of several major search engines."

Paper here [research.compaq.com]

msgraph

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 38 posted 2:49 pm on Feb 1, 2001 (gmt 0)

I can see this topic reaching a level of over 1000 posts by the end of today. 700 of them from me.

"For each of the extracted links, ensure that it is an absolute URL (derelativizing it if necessary), and add it to the list of URLs to download, provided it has not been encountered before."

I always figured absolute URL's were the way to go.

msgraph

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 38 posted 4:05 pm on Feb 1, 2001 (gmt 0)

"The content-seen test would be prohibitively expensive in both space and time if we saved the complete contents of every downloaded document. Instead, we maintain a data structure called the document fingerprint set that stores a 64-bit checksum of the contents of each downloaded document."

This should explain the 64-bit checksum right from DEC themselves

Explains their fingerprint method [wing.rug.nl]

msgraph

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 38 posted 6:56 pm on Feb 1, 2001 (gmt 0)

Sorry NFFC but I have to say thanks again.

Just finished looking through all of it and the references. That was one of the best documents I've read in a long time explaining certain spidering methods. Now I understand why they have grabbed almost every page on all my domains with various Mercator spiders.

I wonder if when they finish tweaking out Mercator they will use all the data collected to power other search engines. Similar to what Inktomi does? Or will they keep it for themselves? Probably the latter.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 38 posted 8:00 pm on Feb 1, 2001 (gmt 0)

Very interesting, especially the fingerprinting checksum scheme to avoid duplicates.

I noted that, even with this fingerprint technique, Mercator still downloaded 8.5% duplicates. I'm not clear where the loophole is. Did anyone catch that?

NFFC

WebmasterWorld Senior Member nffc us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 38 posted 8:27 am on Feb 2, 2001 (gmt 0)

>still downloaded 8.5% duplicates

They say they struggle in two particular areas:
Alternative paths on the same host;
Replication across different hosts;

They still download these types of duplicates but after running the content-seen test do not process all but the first copy. This means that the URL's from these documents are not followed as they are assumed to lead to duplicate content.
On a practical level this could have serious implications for many "valid" sites. Imagine a large geographically spread company, with hosted sites in many different countries/domains, using the same initial "corporate" index page which leads to localised content within the differing sites. I have seen this with a recent client with over 150 regional branches/domains, their "local" domains were effectively banned from many SE's.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Marketing and Biz Dev / SEM Research Topics
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved