Forum Moderators: open
I crawled/spidered through a server starting from a root URL and extracted 200 pages. I now also know which pages link to which other pages within this list of 200 pages. I now want to calculate the page Rank of these 200 pages.
This is what I do :-
---------------------
1) I initialize the page rank of all the 200 pages to 0.
2) For each of the 200 pages, I repeatedly compute :-
PR(A) = (1-d) + d ( PR(T1)/C(T1) + ... + PR(Tn)/C(Tn) )
where A : page I want to compute page rank of
PR(A) : page rank of A
T1, T2... Tn : pages that point to A
C(T1) : Number of outgoing links of T1
d = 0.85
Now, my questions are :-
---------------------------
1) When do I stop iterating? When the PR values of the nodes across 2 consecutive iterations is less than, some threshold, say 0.0001?
2) How many iterations does this usually take? I ran my algorithm for 40 iterations and it still doesnt seem to converge.
3) When convergence occurs, average PR is approximately 1? (Will this happen always or only when all links pass on votes to - i.e. point to - at least one other link?)
Hope to see a reply soon :)
Thanks a bunch,
Harini
1) When do I stop iterating? When the PR values of the nodes across 2 consecutive iterations is less than, some threshold, say 0.0001?
Well, what are you trying to calculate?
a) PR using an old formula?
b) Your position in the results?
c) Something else?
If it's (a), then the formula you have is one of the last published formulae, but where will you get the incoming page ranks from - Google will only give you some, not all of these....
If it's (b), then PR is only a small contributor to your position in the SERPS - what on earth are you worried about the accuracy of one contributor (of unknown importance) for?
If it's (c), ask us again!
Sorry to sound jaundiced, but PR isn't the be all and end all of SERPS.
DerekH
1) When do I stop iterating? When the PR values of the nodes across 2 consecutive iterations is less than, some threshold, say 0.0001?
Normally, one would take a stopping criteria of the following form:
sum over all pages (PR(A)_i+1 - PR(A)_i)^2 < N*epsilon^2,
where N is the number of pages, i denotes the iteration and epsilon is some small fixed number.
2) How many iterations does this usually take? I ran my algorithm for 40 iterations and it still doesnt seem to converge.
This strongly depends on the algorithm used for calculation, the linking structure of the system as well as the initial values used for the calcultion. For the Jacobian iteration you used and 200 pages, (normally) 40 iterations should be good enough. However, I would start with different initial values - taking PR=1 should speed up your calculations.
3) When convergence occurs, average PR is approximately 1? (Will this happen always or only when all links pass on votes to - i.e. point to - at least one other link?)
The average PR is one if there are no dead ends, i.e. pages with no outgoing links. Otherwise, it's smaller than one.
Some general remarks:
- You are neglecting external incoming links. These links play an important role for the PR value of a page. To simulate the effect of external incoming links, you can think of adding an external sources.
- d=0.85 is the original damping factor. Google has changed this value.
- There are better (=faster) iterations schemes to calculate PR. However, for 200 pages this isn't important.
1) I am trying to calculate the PR of all the pages (which is about 200) from a specific server. This is just for a project. So I can still keep d=0.85 right? (irrespective of whether Google has changed it or not).
2) It is sufficient as per the requirements to consider the availability of just these 200 pages and not take into account external links. So I guess it should be ok that I am not taking into account the external links.
This is just for a project. So I can still keep d=0.85 right?
Yes, to examine general aspects of the PR calculation, the exact value of d doesn't matter.
I forgot to mention that the number of iterations also depends on the damping factor 0< d <=1. A higher damping factor leads to more iterations. Also, for a higher damping factor the initial initialization plays a more important role.
... and Welcome to WebmasterWorld harini.
I am trying to calculate the PR of all the pages (which is about 200) from a specific server.
For this purpose you have to decide how self links and multiple links (to the same page) are handled. I would suggest to ignore self links and to treat multiple links as a single one.
I crawled/spidered through a server starting from a root URL and extracted 200 pages.
Which spider do you used?
PR calculation is linear algebra (not calculus). For d<1 it's solving a set of linear equation, while it's an eigen value problem for d=1. For d<1 your can easily give the answer, but you have to invert a matrix. Solving the set of linear equations is normally done numerically by some iteration schemes. During these iterations the PR values approach their 'limit'. Of course, you could also invert the matrix and get the exact value within one step. However, this would be much more time consuming.