homepage Welcome to WebmasterWorld Guest from 54.227.41.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Marketing and Biz Dev / SEM Research Topics
Forum Library, Charter, Moderators: phranque

SEM Research Topics Forum

    
Efficient Crawling Through URL Ordering
Junghoo Cho, Hector Garcia-Molina, Lawrence Page
NFFC




msg:816685
 8:31 am on Jan 28, 2001 (gmt 0)

"In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without."

Very interesting paper which outlines a system of crawling the web using a version of PageRank. Some worthwhile insights are provided into the importance of off-page criteria, "a function of its location, not of its contents", and their possible uses in a ranking system.

[www-db.stanford.edu...]

 

NFFC




msg:816686
 10:39 am on Mar 4, 2001 (gmt 0)

Efficient Crawling Through URL Ordering or How do I get those spiders to visit!

The paper outlines many of the methods that the modern day search engine may use when deciding which pages to crawl.

The basics:

A crawler/spider/bot is a program that retrieves pages from the web, most commonly for inclusion in a search engines index. Put simply a crawler starts off with an initial URL, retrieves the page and extracts any URL's from it. The URL's are then added to a list/queue, the URL's from this list are then crawled and so on and so forth.

In an ideal world the crawler would simply retrieve every web page in the queue but because of limited resources [bandwidth and storage costs money!] they have to make choices on which pages to retrieve. Additionally with the rapid growth of the web there may simply not be enough time to crawl all the pages in the queue.

Factors that may help in getting your pages crawled:

Probably the simplest way to ensure that your pages are crawled is to have as many incoming links as possible, both to the index page and subpages. Taking Google as an example they claim that they will not crawl your site unless you have at least one incoming link. If you only have one link make sure it points at the index page, in my experience they will follow links to subpages and list them but the absence of any links to the index page will prevent a full crawl.

The "quality" of the links pointing to your site is also a factor. A listing at ODP/Yahoo etc is almost a guarantee that your site will crawled.

There are other factors that may come into play which the paper refers to as Location Metrics. Roughly these are factors that can be seen simply by looking at the URL. For example some SE's may consider .com domains to be more important than .tv's, index pages may be prioritized or URL's with fewer slashes considered more worthy of a visit.

With arguably the two most important SE's [Google, Fast] regularly deep crawling it makes getting both the structure of your site and your external link program right as important as optimising individual pages.

2_much




msg:816687
 6:06 am on Mar 5, 2001 (gmt 0)

Great summary NFFC, this is very useful info...

Once again this depicts the importance of incorporating "links" into our optimization efforts...

At the end of google week most of us agreed that Google is currently the leader in SE technology...and this paper was co-authored by the creator of this engine...so it's essential to take notice of this kind of info.

Thanks NFFC!!!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Marketing and Biz Dev / SEM Research Topics
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved