Forum Moderators: open
Note that I'm just talking about ONE iteration. I think the normal G routine is to do multiple iterations, the number of iterations is probably dictated by how many times it takes until few PR changes take place in further iterations...
The reason I'm wondering is due to a comment by someone about Freshbot making it's pages "sticky" in the dB until it needs to recalculate PR for a few "days". I always felt that it took a long long time to calculate PR and that this was the major reason for the monthly update vs a rolling one. If it was as quick as a couple of days, why wouldn't G just crawl continuously, as soon as the last PR iterations were completed (meaning every few days), run the usual sanity/data validation tests, flick the switch and start over again? I assumed the reason was that a PR iteration was a week or more...
Amidst all this talk, I haven't seen anyone discussing some of the PR changes that have to happen to accomodate a rolling update. The thing with PR calculation is that you have to start fresh whenever you have new data ("new" means that some pages are removed too). So you can't simply begin your new iteration with your last PR number because the PR that was granted by pages that are now gone have to be accounted for (downgraded) somehow in a rolling update. UNLESS, G is keeping stats on exactly what gave each page it's PR, NOTING changes when deep/freshbot grabs new data, AND MODIFIES pr ON THE FLY. But again, I can't imagine that to be the case, because you change a small piece of the PR pie and it's like dominoes...each change of data affects PR webwide...
So if you were G, how would you handle PR in a rolling update? The answer to that question really requires knowing how long one iteration takes. So I'm trying to get into Google's head but can't do it without knowing the answer to that question :)
Thanks.
I thought it took some number like 100 times of estimating PR values before they started to converge on the real values.
There was a paper about the block algorithm which reduces the CPU. But there are also a number of other algorithms which improves calculations.
So you can't simply begin your new iteration with your last PR number because the PR that was granted by pages that are now gone have to be accounted for (downgraded) somehow in a rolling update. ... But again, I can't imagine that to be the case, because you change a small piece of the PR pie and it's like dominoes...each change of data affects PR webwide...
There is no problem starting with the old PR data. Of course, adding additional pages and removing old one influences not only pages directly connected, but also a large number of other ones. However, the final solution is independent from the initial values and an estimate is better than nothing.
I think it should be possilble to calculate an iteration within less than one day. Therefore, it would be possible to calculate PR on the fly. E.g. start with the old PR values, make three iteration and take these values as the new ones. Of course, this wouldn't be the 'final' values, because they are not necessarily stable. (The situation would strongly depend on the changes made in the neighbourhood of a page.) However, in my opinion it would improve the situation since you will get a PR within two or three days (PR would be fresh). Even if the new PR isn't stable, it would be better than the stable solution of the deep crawl some weeks ago. (i.e. it is better to have a PR which is between the old and the new one than having the old PR.) Also, the situation will get better (stable) within a few more days.
You may be also interested in this thread [webmasterworld.com]
So, whatever time it used to take to calculate PR, now itīs shorter, for sure. This might help explain the change in the way that Google updates.
In this paper, we make the following simple observation: the convergence rates of the PageRank values of individual pages during application of the Power Method is nonuniform1. That is, many pages converge quickly, with a few pages taking much longer to converge. Furthermore, the pages that converge slowly are generally those pages with high PageRank.
On an AMD Athlon 1533MHz machine with a 6-way RAID-5 disk volume and
2GB of main memory, each application of Algorithm 1 on the 80M page LARGEWEB
dataset takes roughly 10 minutes. Given that computing PageRank generally requires anywhere from 30-100 applications of Algorithm 1, depending on the desired error, the need for fast methods for graphs with billions of nodes is clear.
And thats only for a dataset of 80 M pages..
Computing PageRank, the ranking algorithm behind the Google search engine, for a billion Web pages can take several days.
That's probably as close to an informed answer as you are likely to find, given that Google, Inc. is not likely to volunteer the information. By the way, one member of the Stanford team recently said that the "five times faster" buzz that has been all over the media is not right. He said that using their technique at Google would probably result in a 30 percent rate increase. That correction is at the end of this article [wired.com].
That 30 percent speedup would probably get eaten up by an expanding web within a year or two. I think PageRank's days are numbered.
Assuming that you scale the stopping criteria with the number of pages n, the answer is as follows (as far as I remember): for simple algorithms (such as the original or the block algorithm) the CPU time needed scale with n^2.
By the way, the concept of block algorithms isn't very new. One advantage is that you can easily parallize these kinds of algorithms. However, there are faster algorithms. Also, one has to mention that Kamvar et.al. compare their block algorithm with a simple (slow) algorithm. I expect that Google has implemented a more 'state of the art' one.
What *might* change is the number of iterations, but that really depends more on the shape of the webgraph than the size.
While PR can take up a good chunk of the calculation time, there are a lot of other calculations going on to build the index at the same time. PR has nothing to do with building the new index. PR is *only* a number that is associated with a page.
On a per page basis, indexing takes far more time than generating PR. The difference is that the indexing work can be done at the time that the page is crawled. PageRank (as originally defined) has to be done on a complete set of data.
I do think it might be possible to run a continuous update by doing what a previous poster mentioned. Use freshie to add or delete pages and run the PR algo on a snapshot. As soon as it is done, run it again on a new snapshot. It seems that this would be a workable way to run a continuous update of the index with PR updated every couple of weeks. That would just give us some mini-dances that are only PR related, not inclusion related.
I expect that when increasing the number of pages, the PR calculation time increase would be linear. The calculation is only done once per page per iteration.What *might* change is the number of iterations, but that really depends more on the shape of the webgraph than the size.
Even if the number of iterations is the same, the scaling is not necessarily linear. The reason is that during an iteration there are calculations which scale with n (i.e. are linear) while there are other ones which scale with the power of two. For example, calculating the norm of a vector is linear while the matrix * vector multiplication scales with n^2. Therefore, the whole process scales with n^2. The situation is a little more complicated because the coresponding matrix is a sparse one. However, this shouldn't change the situation, but it also depends on the implementation of the algorithm.
PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))
is run once per page per iteration. The only variable in the size of this calculation is the number of incoming links.
Addition, subtraction and division are all very straight forward.
In fact, I see very few of the preprocessing steps as being anything other than linear. They could be done in a nonlinear fashion, but there really is little reason to when your goal is real world results instead of mathematical purism.
Of course I might be forgetting a step or two, I did not go over every step of setting up to run the calculations. If I'm missing something pleas feel free to let me know what I missed.
You could rewrite that equation with a vector PR (n dimensional, where n is the number of pages) and some kind of (static) transition matrix M:
PR_k+1 = (1-d) + d M*PR_k (k is the number of iteration)
Thus you mainly have to calculate the product of a matrix M (the transition matrix) and a vector PR (the PR vector of the previous iteration). (And that is how you compute PR in practice - at least for the simplest algorithm.) Therefore, the time for one iteration shouldn't scale linear, but with n^2.
As already said, the situation is more complicated, because M is a sparse matrix, i.e. most of the entries are zero. This could change the situation in principle, but it depends on the implementation of the algorithm. However, I think the scaling behaviour is not affected.
One thing I don't understand... say if takes 10mins for 80m pages. then 3000m pages should be calculable in an hour or less usign a small cluster.
What Iwould do then, have this clustercontinuously recalculate PR every hour one iteration. So removing a page would slowly fade out the effect for the next 50 hours (50 iterations) i.e. 2 day. so a removing a PR6 page, one day later would still ahve the effect of a link from a PR3 page, and two days later would have th eproper PR as if the page is gone. same for including backlinks. One day later these backlinks would have half their final effects.
In fact usign interpolation, as described in the stanford papers (use CiteSeer) You could always have the "poper" estimated PR for all pages after jsut one iteration. every hour it would get a bit closer, and using the new methods, only 10-15 iterations may be enough for a good PR.
since PR is inherently approximate anyways I think this would be VERY workable, having the additional "benefit" of makignchanges gradual and SEPRS less shaky.
I think this would not only be practical but an actual improvement over the existing system.
I wish GoogleGuy or a technical staff memeber would join and point out other relevant material that may not be strictly published. Scientific publishing is such a slow and innefficient process ;)
SN
The scaling is indeed linear, because just non-zero entries of M are stored. Therefore, calculating the product M * PR just requires n * (average number of links on a page) steps (instead of n^2). Thus an estimate for one iteration would be 4.5 hours (for the simple algorithm and the conditions mentioned above). More complicated algorithms increase the time for one iteration, but decrease the number of iterations (which are roughly less than 100 for the simple algorithm).