Forum Moderators: open

Message Too Old, No Replies

No. of iterations for Page Rank calculation

         

harini

9:59 am on Sep 21, 2004 (gmt 0)

10+ Year Member



Hi,

I crawled/spidered through a server starting from a root URL and extracted 200 pages. I now also know which pages link to which other pages within this list of 200 pages. I now want to calculate the page Rank of these 200 pages.

This is what I do :-
---------------------
1) I initialize the page rank of all the 200 pages to 0.
2) For each of the 200 pages, I repeatedly compute :-

PR(A) = (1-d) + d ( PR(T1)/C(T1) + ... + PR(Tn)/C(Tn) )

where A : page I want to compute page rank of
PR(A) : page rank of A
T1, T2... Tn : pages that point to A
C(T1) : Number of outgoing links of T1
d = 0.85

Now, my questions are :-
---------------------------
1) When do I stop iterating? When the PR values of the nodes across 2 consecutive iterations is less than, some threshold, say 0.0001?

2) How many iterations does this usually take? I ran my algorithm for 40 iterations and it still doesnt seem to converge.

3) When convergence occurs, average PR is approximately 1? (Will this happen always or only when all links pass on votes to - i.e. point to - at least one other link?)

Hope to see a reply soon :)

Thanks a bunch,
Harini

DerekH

12:52 am on Sep 30, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



harini wrote
1) When do I stop iterating? When the PR values of the nodes across 2 consecutive iterations is less than, some threshold, say 0.0001?

Well, what are you trying to calculate?
a) PR using an old formula?
b) Your position in the results?
c) Something else?

If it's (a), then the formula you have is one of the last published formulae, but where will you get the incoming page ranks from - Google will only give you some, not all of these....
If it's (b), then PR is only a small contributor to your position in the SERPS - what on earth are you worried about the accuracy of one contributor (of unknown importance) for?
If it's (c), ask us again!

Sorry to sound jaundiced, but PR isn't the be all and end all of SERPS.

DerekH

doc_z

8:38 am on Sep 30, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



1) When do I stop iterating? When the PR values of the nodes across 2 consecutive iterations is less than, some threshold, say 0.0001?

Normally, one would take a stopping criteria of the following form:
sum over all pages (PR(A)_i+1 - PR(A)_i)^2 < N*epsilon^2,
where N is the number of pages, i denotes the iteration and epsilon is some small fixed number.

2) How many iterations does this usually take? I ran my algorithm for 40 iterations and it still doesnt seem to converge.

This strongly depends on the algorithm used for calculation, the linking structure of the system as well as the initial values used for the calcultion. For the Jacobian iteration you used and 200 pages, (normally) 40 iterations should be good enough. However, I would start with different initial values - taking PR=1 should speed up your calculations.

3) When convergence occurs, average PR is approximately 1? (Will this happen always or only when all links pass on votes to - i.e. point to - at least one other link?)

The average PR is one if there are no dead ends, i.e. pages with no outgoing links. Otherwise, it's smaller than one.

Some general remarks:

- You are neglecting external incoming links. These links play an important role for the PR value of a page. To simulate the effect of external incoming links, you can think of adding an external sources.

- d=0.85 is the original damping factor. Google has changed this value.

- There are better (=faster) iterations schemes to calculate PR. However, for 200 pages this isn't important.

harini

10:30 pm on Oct 1, 2004 (gmt 0)

10+ Year Member



Thanks doc_Z :).

1) I am trying to calculate the PR of all the pages (which is about 200) from a specific server. This is just for a project. So I can still keep d=0.85 right? (irrespective of whether Google has changed it or not).

2) It is sufficient as per the requirements to consider the availability of just these 200 pages and not take into account external links. So I guess it should be ok that I am not taking into account the external links.

doc_z

9:37 am on Oct 2, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is just for a project. So I can still keep d=0.85 right?

Yes, to examine general aspects of the PR calculation, the exact value of d doesn't matter.

I forgot to mention that the number of iterations also depends on the damping factor 0< d <=1. A higher damping factor leads to more iterations. Also, for a higher damping factor the initial initialization plays a more important role.

... and Welcome to WebmasterWorld harini.

doc_z

11:48 am on Oct 2, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am trying to calculate the PR of all the pages (which is about 200) from a specific server.

For this purpose you have to decide how self links and multiple links (to the same page) are handled. I would suggest to ignore self links and to treat multiple links as a single one.

I crawled/spidered through a server starting from a root URL and extracted 200 pages.

Which spider do you used?

harini

11:14 pm on Oct 2, 2004 (gmt 0)

10+ Year Member



> For this purpose you have to decide how self links
> and multiple links (to the same page) are handled. I > would suggest to ignore self links and to treat
> multiple links as a single one.
oh yes.. that is what I am doing :)

> Which spider do you used?
Websphinx

Critter

12:39 am on Oct 3, 2004 (gmt 0)

10+ Year Member



I got a semi-intelligent question:

It seems to me that pagerank, with its "iterations" would be well-suited to calculus, as the pagerank for a particular page or pages clearly would approach a "limit".

I'm really not into math, so I may be way off. Anyone ever done anything with this?

doc_z

6:55 pm on Oct 3, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Critter, I'm not sure if I understand your question correctly. (My English isn't that well.) However, I'm trying to answer the question.

PR calculation is linear algebra (not calculus). For d<1 it's solving a set of linear equation, while it's an eigen value problem for d=1. For d<1 your can easily give the answer, but you have to invert a matrix. Solving the set of linear equations is normally done numerically by some iteration schemes. During these iterations the PR values approach their 'limit'. Of course, you could also invert the matrix and get the exact value within one step. However, this would be much more time consuming.