Forum Moderators: open

Message Too Old, No Replies

How long does one iteration of PR calculation take?

Very specifically? Meaning, not the whole update process, just the PR

         

Clark

10:14 pm on May 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Never seen anyone discuss this specifically. The old way of doing things was a monthly PR calculation. Anchor analysis and whatever else G needs to do to give rankings to all words in it's db. But what I'm wondering considering the changes coming up, is how long one full iteration takes for G to calculate PR? A day? A couple of days? Does anyone have insight.

Note that I'm just talking about ONE iteration. I think the normal G routine is to do multiple iterations, the number of iterations is probably dictated by how many times it takes until few PR changes take place in further iterations...

The reason I'm wondering is due to a comment by someone about Freshbot making it's pages "sticky" in the dB until it needs to recalculate PR for a few "days". I always felt that it took a long long time to calculate PR and that this was the major reason for the monthly update vs a rolling one. If it was as quick as a couple of days, why wouldn't G just crawl continuously, as soon as the last PR iterations were completed (meaning every few days), run the usual sanity/data validation tests, flick the switch and start over again? I assumed the reason was that a PR iteration was a week or more...

Amidst all this talk, I haven't seen anyone discussing some of the PR changes that have to happen to accomodate a rolling update. The thing with PR calculation is that you have to start fresh whenever you have new data ("new" means that some pages are removed too). So you can't simply begin your new iteration with your last PR number because the PR that was granted by pages that are now gone have to be accounted for (downgraded) somehow in a rolling update. UNLESS, G is keeping stats on exactly what gave each page it's PR, NOTING changes when deep/freshbot grabs new data, AND MODIFIES pr ON THE FLY. But again, I can't imagine that to be the case, because you change a small piece of the PR pie and it's like dominoes...each change of data affects PR webwide...

So if you were G, how would you handle PR in a rolling update? The answer to that question really requires knowing how long one iteration takes. So I'm trying to get into Google's head but can't do it without knowing the answer to that question :)
Thanks.

quotations

6:53 am on May 26, 2003 (gmt 0)

10+ Year Member



I think I remember (vaguely at this time of night) reading something about this ... I don't think it is so much that the calculation takes a long time as that it takes a lot of iterations of the "estimation" before you get decent values.

I thought it took some number like 100 times of estimating PR values before they started to converge on the real values.

bcc1234

7:15 am on May 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If I remember correctly, the original paper stated that you can stop iterating once the relative ranks of pages stop shifting.
But I never seen any info describing the number of iterations google makes over their whole index. Would be interesting to know.

doc_z

10:41 am on May 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The time for one iteration depends on many factors such as number of indexed pages, the algorithm (which is used for calculation), the speed of the computer(s) used, the architecture (i.e. a single computer or a parallel machine) and so on. Therefore, it's hard to give an answer. I think a rough guess would be 'some' hours. The time for the whole calculation also depends on the stopping criteria. In pratice the total number of iterations is probably less than 100.

There was a paper about the block algorithm which reduces the CPU. But there are also a number of other algorithms which improves calculations.

So you can't simply begin your new iteration with your last PR number because the PR that was granted by pages that are now gone have to be accounted for (downgraded) somehow in a rolling update. ... But again, I can't imagine that to be the case, because you change a small piece of the PR pie and it's like dominoes...each change of data affects PR webwide...

There is no problem starting with the old PR data. Of course, adding additional pages and removing old one influences not only pages directly connected, but also a large number of other ones. However, the final solution is independent from the initial values and an estimate is better than nothing.

I think it should be possilble to calculate an iteration within less than one day. Therefore, it would be possible to calculate PR on the fly. E.g. start with the old PR values, make three iteration and take these values as the new ones. Of course, this wouldn't be the 'final' values, because they are not necessarily stable. (The situation would strongly depend on the changes made in the neighbourhood of a page.) However, in my opinion it would improve the situation since you will get a PR within two or three days (PR would be fresh). Even if the new PR isn't stable, it would be better than the stable solution of the deep crawl some weeks ago. (i.e. it is better to have a PR which is between the old and the new one than having the old PR.) Also, the situation will get better (stable) within a few more days.

You may be also interested in this thread [webmasterworld.com]

julinho

11:34 am on May 26, 2003 (gmt 0)

10+ Year Member



There were some recent threads (which linked to some papers by Google techs) talking about improvements on the way that PR is calculated (basically, they are looking for a faster convergence of PR).

So, whatever time it used to take to calculate PR, now itīs shorter, for sure. This might help explain the change in the way that Google updates.

egomaniac

3:43 am on May 27, 2003 (gmt 0)

10+ Year Member



I remember reading that it took many days or even weeks to calculate PR once the deepcrawl was done. Can't remember where I read that though. I believe I read that was here at WebmasterWorld, but I have no idea if this was just someone's opinion, or if there was any basis in fact.

Slud

3:50 am on May 27, 2003 (gmt 0)

10+ Year Member



I think I got the impression from some paper that it was between 10-14 days. I'll see if I can dig up a source for that.

vitaplease

6:10 am on May 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Adaptive Methods for the Computation of Pagerank [dbpubs.stanford.edu]

In this paper, we make the following simple observation: the convergence rates of the PageRank values of individual pages during application of the Power Method is nonuniform1. That is, many pages converge quickly, with a few pages taking much longer to converge. Furthermore, the pages that converge slowly are generally those pages with high PageRank.

On an AMD Athlon 1533MHz machine with a 6-way RAID-5 disk volume and
2GB of main memory, each application of Algorithm 1 on the 80M page LARGEWEB
dataset takes roughly 10 minutes. Given that computing PageRank generally requires anywhere from 30-100 applications of Algorithm 1, depending on the desired error, the need for fast methods for graphs with billions of nodes is clear.

And thats only for a dataset of 80 M pages..

Clark

11:47 am on May 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Would I be correct to assume that as the web gets bigger, it's more of a logarithmic scale? IOW, 160 is more than twice as large a computation as 80?

Kackle

2:09 pm on May 27, 2003 (gmt 0)



In the National Science Foundation press release [eurekalert.org] regarding the Stanford project they funded on speeding up PageRank, there is this sentence:

Computing PageRank, the ranking algorithm behind the Google search engine, for a billion Web pages can take several days.

That's probably as close to an informed answer as you are likely to find, given that Google, Inc. is not likely to volunteer the information. By the way, one member of the Stanford team recently said that the "five times faster" buzz that has been all over the media is not right. He said that using their technique at Google would probably result in a 30 percent rate increase. That correction is at the end of this article [wired.com].

That 30 percent speedup would probably get eaten up by an expanding web within a year or two. I think PageRank's days are numbered.

doc_z

3:55 pm on May 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



> Would I be correct to assume that as the web gets bigger, it's more of a logarithmic scale? IOW, 160 is more than twice as large a computation as 80?

Assuming that you scale the stopping criteria with the number of pages n, the answer is as follows (as far as I remember): for simple algorithms (such as the original or the block algorithm) the CPU time needed scale with n^2.

By the way, the concept of block algorithms isn't very new. One advantage is that you can easily parallize these kinds of algorithms. However, there are faster algorithms. Also, one has to mention that Kamvar et.al. compare their block algorithm with a simple (slow) algorithm. I expect that Google has implemented a more 'state of the art' one.

BigDave

4:48 pm on May 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I expect that when increasing the number of pages, the PR calculation time increase would be linear. The calculation is only done once per page per iteration.

What *might* change is the number of iterations, but that really depends more on the shape of the webgraph than the size.

While PR can take up a good chunk of the calculation time, there are a lot of other calculations going on to build the index at the same time. PR has nothing to do with building the new index. PR is *only* a number that is associated with a page.

On a per page basis, indexing takes far more time than generating PR. The difference is that the indexing work can be done at the time that the page is crawled. PageRank (as originally defined) has to be done on a complete set of data.

I do think it might be possible to run a continuous update by doing what a previous poster mentioned. Use freshie to add or delete pages and run the PR algo on a snapshot. As soon as it is done, run it again on a new snapshot. It seems that this would be a workable way to run a continuous update of the index with PR updated every couple of weeks. That would just give us some mini-dances that are only PR related, not inclusion related.

doc_z

5:09 pm on May 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I expect that when increasing the number of pages, the PR calculation time increase would be linear. The calculation is only done once per page per iteration.

What *might* change is the number of iterations, but that really depends more on the shape of the webgraph than the size.

Even if the number of iterations is the same, the scaling is not necessarily linear. The reason is that during an iteration there are calculations which scale with n (i.e. are linear) while there are other ones which scale with the power of two. For example, calculating the norm of a vector is linear while the matrix * vector multiplication scales with n^2. Therefore, the whole process scales with n^2. The situation is a little more complicated because the coresponding matrix is a sparse one. However, this shouldn't change the situation, but it also depends on the implementation of the algorithm.

BigDave

6:46 pm on May 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



From what I can see, all the non linear aspects would be done in the preprocessing, not during the individual iterations.

PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))

is run once per page per iteration. The only variable in the size of this calculation is the number of incoming links.

Addition, subtraction and division are all very straight forward.

In fact, I see very few of the preprocessing steps as being anything other than linear. They could be done in a nonlinear fashion, but there really is little reason to when your goal is real world results instead of mathematical purism.

Of course I might be forgetting a step or two, I did not go over every step of setting up to run the calculations. If I'm missing something pleas feel free to let me know what I missed.

doc_z

7:25 pm on May 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))

You could rewrite that equation with a vector PR (n dimensional, where n is the number of pages) and some kind of (static) transition matrix M:

PR_k+1 = (1-d) + d M*PR_k (k is the number of iteration)

Thus you mainly have to calculate the product of a matrix M (the transition matrix) and a vector PR (the PR vector of the previous iteration). (And that is how you compute PR in practice - at least for the simplest algorithm.) Therefore, the time for one iteration shouldn't scale linear, but with n^2.

As already said, the situation is more complicated, because M is a sparse matrix, i.e. most of the entries are zero. This could change the situation in principle, but it depends on the implementation of the algorithm. However, I think the scaling behaviour is not affected.

killroy

12:39 pm on May 29, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well one of the papers treats this matetr. Sicne it is so large and sparse, most of the standard matrix methods don't work. So it is basically solved as a set of equations, and hterefore has to be quadratic.

One thing I don't understand... say if takes 10mins for 80m pages. then 3000m pages should be calculable in an hour or less usign a small cluster.

What Iwould do then, have this clustercontinuously recalculate PR every hour one iteration. So removing a page would slowly fade out the effect for the next 50 hours (50 iterations) i.e. 2 day. so a removing a PR6 page, one day later would still ahve the effect of a link from a PR3 page, and two days later would have th eproper PR as if the page is gone. same for including backlinks. One day later these backlinks would have half their final effects.

In fact usign interpolation, as described in the stanford papers (use CiteSeer) You could always have the "poper" estimated PR for all pages after jsut one iteration. every hour it would get a bit closer, and using the new methods, only 10-15 iterations may be enough for a good PR.

since PR is inherently approximate anyways I think this would be VERY workable, having the additional "benefit" of makignchanges gradual and SEPRS less shaky.

I think this would not only be practical but an actual improvement over the existing system.

I wish GoogleGuy or a technical staff memeber would join and point out other relevant material that may not be strictly published. Scientific publishing is such a slow and innefficient process ;)

SN

percentages

12:53 pm on May 29, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How long does one iteration of PR calculation take?

Iteration is very confusing, so I will work on the basis of a page instead.

Currently, and on average, about 23 seconds per page! for one PC running at 1MHz. Now guess how may PC's Google has on this task and you have the answer:)

doc_z

3:05 pm on May 29, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have to correct my statement. The situation changes because of the sparse matrix.

The scaling is indeed linear, because just non-zero entries of M are stored. Therefore, calculating the product M * PR just requires n * (average number of links on a page) steps (instead of n^2). Thus an estimate for one iteration would be 4.5 hours (for the simple algorithm and the conditions mentioned above). More complicated algorithms increase the time for one iteration, but decrease the number of iterations (which are roughly less than 100 for the simple algorithm).