Forum Moderators: open
Very interesting paper which outlines a system of crawling the web using a version of PageRank. Some worthwhile insights are provided into the importance of off-page criteria, "a function of its location, not of its contents", and their possible uses in a ranking system.
[www-db.stanford.edu...]
The paper outlines many of the methods that the modern day search engine may use when deciding which pages to crawl.
The basics:
A crawler/spider/bot is a program that retrieves pages from the web, most commonly for inclusion in a search engines index. Put simply a crawler starts off with an initial URL, retrieves the page and extracts any URL's from it. The URL's are then added to a list/queue, the URL's from this list are then crawled and so on and so forth.
In an ideal world the crawler would simply retrieve every web page in the queue but because of limited resources [bandwidth and storage costs money!] they have to make choices on which pages to retrieve. Additionally with the rapid growth of the web there may simply not be enough time to crawl all the pages in the queue.
Factors that may help in getting your pages crawled:
Probably the simplest way to ensure that your pages are crawled is to have as many incoming links as possible, both to the index page and subpages. Taking Google as an example they claim that they will not crawl your site unless you have at least one incoming link. If you only have one link make sure it points at the index page, in my experience they will follow links to subpages and list them but the absence of any links to the index page will prevent a full crawl.
The "quality" of the links pointing to your site is also a factor. A listing at ODP/Yahoo etc is almost a guarantee that your site will crawled.
There are other factors that may come into play which the paper refers to as Location Metrics. Roughly these are factors that can be seen simply by looking at the URL. For example some SE's may consider .com domains to be more important than .tv's, index pages may be prioritized or URL's with fewer slashes considered more worthy of a visit.
With arguably the two most important SE's [Google, Fast] regularly deep crawling it makes getting both the structure of your site and your external link program right as important as optimising individual pages.
Once again this depicts the importance of incorporating "links" into our optimization efforts...
At the end of google week most of us agreed that Google is currently the leader in SE technology...and this paper was co-authored by the creator of this engine...so it's essential to take notice of this kind of info.
Thanks NFFC!!!