Forum Moderators: open
The study says "The Google search engine attempts to maintain a fresh index by crawling over 3 billion pages once a month, with more frequent crawls of hand-selected sites that are known to change more often. In addition, it offers access to cached copies of pages, to obviate problems arising some of the crawled URLs being out-of-date or having disappeared entirely."
"To improve the freshness of results returned by search engines and allow them to spend more of their efforts crawling and indexing pages which have changed, it is interesting and important to answer some questions about the dynamic nature of the web. How fast does the web change? Does most of the content remain unchanged once it has been authored, or are the documents being continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? Do pages change and then change back? How consistent are mirrors and near-mirrors of pages?"
Any thoughts ...
I've often wondered if Google actually follows how often a page changes and freshbot frequency is based on that or is it simply a matter of freshbot visits being more frequent the higher the PR.
Since I have a history topic the main changes are new articles and small corrections or additions. My sites still seem to get visited every day or two by Google. I do go into articles and add information or interesting related links as I come across them. The fact that an article can be fluid like this is unique to the Internet. I have even somewhat changed my view on one controversial topic in my field. Where else could you change an article just because you changed your mind?
I have noticed that during the last year, Google offers within its Top Ten results at least 4 or 5 Web sites that haven't changed content since two - three years ago.
In some topics old information may still be the best. Are you seeing a lot of pages in the top 10 that are really outdated in their information?
I think we sometimes imagine Google is able to do more than it actually can in terms of these things.
"In order to shed light on the issue, we turned to our sampled documents, selecting documents from Germany with high change rate. Careful examination of the first few pages revealed more than we cared to see: of the first half dozen pages we examined, all but one contained disjoint, but perfectly grammatical phrases of an adult nature together with a redirection to an adult web site. It soon became clear that the phrases were automatically generated on the fly, for the purpose of "stuffing" search engines such as Google with topical keywords surrounded by sensible-looking context, in order to draw visitors to the adult web site. Upon further investigation, we discovered that our data set contained 1.03 million URLs drawn from 116,654 hosts (4,745 of them being outside the .de domain), which all resolved to a single IP address. This machine is serving up over 15% of the .de URLs in our data set!
"We speculate that the purpose of using that many distinct host names as a front to a single server is to circumvent the politeness policies that limit the number of pages a web crawler will attempt to download from any given host in a given time interval, and also to trick link-based ranking algorithms such as PageRank into believing that links to other pages on apparently different hosts are non-nepotistic, thereby inflating the ranking of the pages in the clique."