Forum Moderators: open

Message Too Old, No Replies

PageTurner vs PageRank

The reason for Google's May update?

         

zafile

4:58 am on Jun 2, 2003 (gmt 0)



I found an interesting paper titled "A Large-Scale Study of the Evolution of Web Pages" at [research.microsoft.com...] .

The study says "The Google search engine attempts to maintain a fresh index by crawling over 3 billion pages once a month, with more frequent crawls of hand-selected sites that are known to change more often. In addition, it offers access to cached copies of pages, to obviate problems arising some of the crawled URLs being out-of-date or having disappeared entirely."

"To improve the freshness of results returned by search engines and allow them to spend more of their efforts crawling and indexing pages which have changed, it is interesting and important to answer some questions about the dynamic nature of the web. How fast does the web change? Does most of the content remain unchanged once it has been authored, or are the documents being continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? Do pages change and then change back? How consistent are mirrors and near-mirrors of pages?"

Any thoughts ...

annej

5:35 am on Jun 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Interesting. I skimmed and read the summary but it does make one think about how much change is true change and how much is either automated or just small changes like correcting a spelling of something.

I've often wondered if Google actually follows how often a page changes and freshbot frequency is based on that or is it simply a matter of freshbot visits being more frequent the higher the PR.

Since I have a history topic the main changes are new articles and small corrections or additions. My sites still seem to get visited every day or two by Google. I do go into articles and add information or interesting related links as I come across them. The fact that an article can be fluid like this is unique to the Internet. I have even somewhat changed my view on one controversial topic in my field. Where else could you change an article just because you changed your mind?

zafile

5:57 am on Jun 2, 2003 (gmt 0)



I keep track on Google's Top Ten search results for a particular topic made of "country" + "business" (an example: malaysia property).

I have noticed that during the last year, Google offers within its Top Ten results at least 4 or 5 Web sites that haven't changed content since two - three years ago.

annej

4:54 pm on Jun 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't think freshness (page changes) affects PR or the serps on keywords at all. That is unless the freshbot finds new information that brings the page up in a given search phrase.

In some topics old information may still be the best. Are you seeing a lot of pages in the top 10 that are really outdated in their information?

I think we sometimes imagine Google is able to do more than it actually can in terms of these things.

zafile

6:30 pm on Jun 2, 2003 (gmt 0)



The study found "that pages in .de, the German domain, exhibit a significantly higher rate and degree of change than those in any other domain. 27% of the pages we sampled from .de underwent a large or complete change every week, compared with 3% for the web as a whole. Even taking the fabled German industriousness into account, these numbers were hard to explain.

"In order to shed light on the issue, we turned to our sampled documents, selecting documents from Germany with high change rate. Careful examination of the first few pages revealed more than we cared to see: of the first half dozen pages we examined, all but one contained disjoint, but perfectly grammatical phrases of an adult nature together with a redirection to an adult web site. It soon became clear that the phrases were automatically generated on the fly, for the purpose of "stuffing" search engines such as Google with topical keywords surrounded by sensible-looking context, in order to draw visitors to the adult web site. Upon further investigation, we discovered that our data set contained 1.03 million URLs drawn from 116,654 hosts (4,745 of them being outside the .de domain), which all resolved to a single IP address. This machine is serving up over 15% of the .de URLs in our data set!

"We speculate that the purpose of using that many distinct host names as a front to a single server is to circumvent the politeness policies that limit the number of pages a web crawler will attempt to download from any given host in a given time interval, and also to trick link-based ranking algorithms such as PageRank into believing that links to other pages on apparently different hosts are non-nepotistic, thereby inflating the ranking of the pages in the clique."