Forum Moderators: open

Message Too Old, No Replies

Relation between PR and ToolbarPR

         

doc_z

9:37 pm on Feb 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Two days ago the was a speculation from NickCoons [webmasterworld.com] about the relation between PR and ToolbarPR. However, the given numbers coudn't be true for two reasons:

- The average PR is lower than 1. Therefore, the sum of all PRs of all pages indexed by google must be lower than 3,083,324,652 (at the moment). This it not the case.

- As mentioned before the average PR is lower than 1. There are a large number of high ToolbarPR sites, so that aprox. 95% of all pages have a PR lower than 1. This would lead to aprox. 95% of pages with ToolbarPR 0. This seems unrealistic. Also it would be a bad idea of google to give most of the pages the same ToolbarPR 0.

The following relation looks more realistic to me:
ToolbarPR -> PR
0 -> 0.15 - 0.16
1 -> 0.16 - 0.25
2 -> 0.25 - 1.15
3 -> 1.15 - 10
4 -> 10 - 100
5 -> 100- 1000
6 -> 1000 - 10000
7 -> 10000 - 100000
8 -> 100000 - 1000000
9 -> 1000000 - 10000000
10 -> 10000000 -

Are there any better ideas? Or does anybody knows the exact relation?

BigDave

10:08 pm on Feb 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Every page in the index must have another page pointing to it. Therefore there are no pages in the index with the lowest allowable PR.

ciml

10:08 pm on Feb 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



doc_z, you describe the 80/20 rule (Pareto's Law). In these situations, logarithmic scales are most helpful.

From "The PageRank Citation Ranking: Bringing Order to the Web":

The bar graphs and percentages shown are a log of the actual PageRank with the top page normalized to 100%...

From PR 0 to at least PR 6, this remains the case in my experience (except that we get to see 0 to 10, not 0 to 100). I would be surprised if the scale didn't keep its shape until the highest numbers, but who knows if there's a little tweaking at the very top? The difference between each Toolbar notch and the one before is hotly debated. Some say it's a factor of about six or seven, I say considerably higher, some people say lower.

It is important to be careful when mixing the various scales involved. In "The Anatomy of a Large-Scale Hypertextual Web Search Engine", Page & Brin write:

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

The raw PageRank values aren't helpful in analysing Google, it is important only to know accurate values on some reliable scale.

<added>
> Therefore there are no pages in the index with the lowest allowable PR.

A pedantic comment: If the lowest allowable PR is allowed then presumably there could be a page with that value? :-)

The lowest I've found ought to be about -4 on the scale (numbers below 1 are negative on a log scale, so that shouldn't be as daft as it may sound). I can't tell you whether it actually had less PR than the other Toolbar PR0 pages above it, or whether there's a floor.

doc_z

10:35 pm on Feb 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



ciml

the scale is exact logarithmic apart from the fact that I am leaving of the .15 for numbers larger than 10 (for simplicity). My scale just take the offset of 0.15 into account.

BigDave

Of course, no page can have a PR of 0.15 (the lower bound), since every page must have another page pointing to it.

rfgdxm1

11:14 pm on Feb 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>Of course, no page can have a PR of 0.15 (the lower bound), since every page must have another page pointing to it.

Incorrect. A page URL can be hand submitted to Google on the website. Such a page could be in Google with no pages pointing to it.

BigDave

11:39 pm on Feb 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have never seen a page stay in during an update if it has no pages pointing to it.

I could be mistaken about this, but I have never seen it, and everything I have read seems to imply that you must be able to reach the page from one of the seed sites for it to make it into the index.

TheComte

2:06 am on Feb 25, 2003 (gmt 0)

10+ Year Member



I have never seen a page stay in during an update if it has no pages pointing to it.

I agree. I know of at least one site that was ranked PR4 with only one link. After that link was removed, the site disappeared from the index. On request, the link was added back, and the site reappeared in a couple of months. I think the owner has since learned the value of backlinks.

NickCoons

3:42 am on Feb 25, 2003 (gmt 0)

10+ Year Member



doc_z,

<Two days ago the was a speculation from NickCoons about the relation between PR and ToolbarPR. However, the given numbers coudn't be true for two reasons:>

I didn't mean to confuse anyone and imply that those were real-world numbers.. I thought I had indicated clearly enough that it was just an example set :-).

<- The average PR is lower than 1. Therefore, the sum of all PRs of all pages indexed by google must be lower than 3,083,324,652 (at the moment). This it not the case.>

Where did you read that the average PR is lower than 1? I've read in several places that the average PR is 1, including a particular website that I believe has been posted to this forum before, but I don't know if posting the URL is allowed.

doc_z

9:39 am on Feb 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



NickCoons,

the average PR is only 1, if there is a link structure (of the whole internet) where are no "dead ends". However, there are a number of pages with no outgoing links. Also there are PDFs and other documents with no backlinks. Therefore the average PR is (a little bit) lower than 1.

I also agree with BigDave that sites with no incoming links will not appear in the index.

Markus

12:22 pm on Feb 25, 2003 (gmt 0)

10+ Year Member



doc_z, your chart looks very good but IMO your numbers are generally too high. I doubt that any page in the index can have a fraction of total PR in the order of 1/300. There is an interesting study that includes an experiment on PageRank distribution:

[citeseer.nj.nec.com...]

It indicates (for a 1.69 million document testbed) that the highest PR page can hardly exceed a fraction of 0.001 of the total PR. Furthermore, it indicates that there are very few pages with a PR of such an order. Such pages could be those that appear to have a toolbar PR of 11 which a member here has found out by looking at the Google Directory. So, a real PR of (possibly considerably) less than 1,000,000 should be enough for a toolbar PR of 10 (at the actual size of the index).

This would mean that either the log scale factor is lower than ten or that your chart is not correct in the low PR regions. According to the PR log plot in the paper above, about 40 to 50% of all pages appear to have a real PR lower than 1. 40 to 50% of all pages having a toolbar PR of 0, 1 or 2 sounds pretty accurate to me. So, IMO, the log scale factor would have to be lower than 10, but ciml certainly disagrees on that. :)

BTW, the average PR has to be slightly lower than 1. Not so much because of orphan pages (I doubt that there are so many) but because of "dead ends" (dangling links). They are certainly removed from the database before the PR calculations, but when their PR is computed afterwards, their average PR has to be lower than 1. (Larry explains how it works in one of his papers.)