Efficient Crawling Through URL Ordering

Forum Moderators: open

Message Too Old, No Replies

Efficient Crawling Through URL Ordering

Junghoo Cho, Hector Garcia-Molina, Lawrence Page

NFFC

8:31 am on Jan 28, 2001 (gmt 0)

"In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without."

Very interesting paper which outlines a system of crawling the web using a version of PageRank. Some worthwhile insights are provided into the importance of off-page criteria, "a function of its location, not of its contents", and their possible uses in a ranking system.

[www-db.stanford.edu...]

NFFC

10:39 am on Mar 4, 2001 (gmt 0)

Efficient Crawling Through URL Ordering or How do I get those spiders to visit!

The paper outlines many of the methods that the modern day search engine may use when deciding which pages to crawl.

The basics:

A crawler/spider/bot is a program that retrieves pages from the web, most commonly for inclusion in a search engines index. Put simply a crawler starts off with an initial URL, retrieves the page and extracts any URL's from it. The URL's are then added to a list/queue, the URL's from this list are then crawled and so on and so forth.

In an ideal world the crawler would simply retrieve every web page in the queue but because of limited resources [bandwidth and storage costs money!] they have to make choices on which pages to retrieve. Additionally with the rapid growth of the web there may simply not be enough time to crawl all the pages in the queue.

Factors that may help in getting your pages crawled:

Probably the simplest way to ensure that your pages are crawled is to have as many incoming links as possible, both to the index page and subpages. Taking Google as an example they claim that they will not crawl your site unless you have at least one incoming link. If you only have one link make sure it points at the index page, in my experience they will follow links to subpages and list them but the absence of any links to the index page will prevent a full crawl.

The "quality" of the links pointing to your site is also a factor. A listing at ODP/Yahoo etc is almost a guarantee that your site will crawled.

There are other factors that may come into play which the paper refers to as Location Metrics. Roughly these are factors that can be seen simply by looking at the URL. For example some SE's may consider .com domains to be more important than .tv's, index pages may be prioritized or URL's with fewer slashes considered more worthy of a visit.

With arguably the two most important SE's [Google, Fast] regularly deep crawling it makes getting both the structure of your site and your external link program right as important as optimising individual pages.

2_much

6:06 am on Mar 5, 2001 (gmt 0)

Great summary NFFC, this is very useful info...

Once again this depicts the importance of incorporating "links" into our optimization efforts...

At the end of google week most of us agreed that Google is currently the leader in SE technology...and this paper was co-authored by the creator of this engine...so it's essential to take notice of this kind of info.

Thanks NFFC!!!