Forum Moderators: open
[edited by: MOOSBerlin at 6:35 pm (utc) on Oct. 24, 2002]
What you describe makes perfect sense. The numbers don't work out exactly like that, but the logarithmic scale on the Toolbar PageRank graph will reflect the link structure of the Web to a good degree. PageRank graphs could be based on a rank distribution, but it would seem much easier to use some log of the raw data and let the natural structure of the Web bring it into line.
Power law distributions, (Zipf, Pareto) tend to reflect the properties of large populations. Similar curves can be seen with referrers or browser IP addresses if you look at the logs of a popular site (as Jakob Nielsen did some time ago) or the aggregate logs over a number of sites (as I do when I should be working).
This list should be completed...
If you are talking; number of sites of which the highest Pageranking Page is...
Some older threads:
PR10 and PR9 sites.
[webmasterworld.com...]
Number of PR 10 and 9 sites:
[webmasterworld.com...]
Geographical differences:
[webmasterworld.com...]
PR10: 9 pages (8.59x8.59)
PR9: 74 pages (8.59x8.59x8.59) etc
PR8: 634 pages
PR7: 5,445 pages
PR6: 46,770 pages
PR5: 401,753 pages
PR4: 3,451,057 pages
PR3: 29,644,581 pages
PR2: 254,646,947 pages
PR1: 2,187,417,279 pages
This totals 2,475,614,547 (closest 2 decimal places can get). The above assumes PR0 pages are not included in the index. Is that a correct assumption?
The PageRank plots in that paper indicate that more than 40 percent of the pages have a real PageRank smaller than 1 (or 1/N, depending on the algo used). If we start the logarithmic scalation of Toolbar PR at 0.15 (minimal PR of a page at a damping factor of 0.85) and if we assume a factor of 7 for the scalation, we would have more than 40 percent PR0 pages. At least for low toolbar PR, the scalation seems to work in another way.
Another interesting result is that there are some high PR pages that each have a fraction of 0.0005 to 0.001 of the total PR. This could apply to the whole web also (self-similar behaviour of the web), so, we may be able to figure out realistic values for the Toolbar scalation.
_# Name_______ URL_______________ Backlinks
*******************************************
_1 Google_____ www.google.com____ 243 000
_2 Yahoo!_____ www.yahoo.com_____ 657 000
_3 Microsoft__ www.microsoft.com_ 125 000
_4 USA Today__ www.usatoday.com__ 102 000
_5 Sun________ www.sun.com_______ 111 000
_6 Apple______ www.apple.com______ 82 000
_7 Lycos______ www.lycos.com_____ 193 000
_8 Real_______ www.real.com_______ 69 600
_9 Macromedia_ www.macromedia.com_ 50 200
10 Adobe______ www.adobe.com_____ 125 000
11 W3.org_____ www.w3.org (or w3c) 57 000
12 NASA_______ www.nasa.gov_______ 87 100
13 Netscape___ www.netscape.com___ 92 600
14 DMOZ_______ dmoz.org__________ 804 000
15 NSF________ www.nsf.gov________ 50 900
16 Energy.gov_ www.energy.gov_____ 40 700
17 White House www.whitehouse.gov_ 36 500
18 First Gov__ www.firstgov.gov___ 59 800
19 MIT________ mit.edu____________ 43 300
20 Mac________ www.mac.com________ 14 200
21 Nature_____ www.nature.com_____ 16 600
(I added "_" because I don't know how to put several spaces)
Good point, some of those PR10 pages Chris mentions may only have a handful of links.
The way I see the 0.15 figure, it is shared among all URLs in the index. So our starting value may be (0.15/2,469,940,685 = 0.000000000006) for a page with no inbound links.
With an effective Toolbar log base (i.e. log base taking into account final normalisation) of 7, that's about PR -13. My experience is that Googlebot spiders until about PR -4; that's 0.0004 with an effective Toolbar log base 7.
This suggests that Google needs a PR of about 68 million times the base amount in order to follow the link. That would seem unlikely, so please someone feel free to point out the error(s) of this thought stream.
> Another interesting result is that there are some high PR pages that each have a fraction of 0.0005 to 0.001 of the total PR...
In a power law distribution, what proportion of the Web would have about that amount of PR. The top 140/2,469,940,685 by any chance?
@ dantehemann: i think PR0-pages are included in the index, as the index is the database of google. There are also some PR-none pages (grey toolbar) in the index, which are spidered and get their pagerank after the next dance.
@ all: The above works also with a little correct:
PR10 - 20 or 30 or 40 (as correct variable) + 7.xx
PR 9 - correct variable + 7.xx*7.xx
PR 8 - correct variable + 7.xx*7.xx*7.xx
and so on.
It depends on the algo you use:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Total PR equals the number of pages (N). Average PR is 1. Minimum PR is (1-d) or
PR(A) = (1-d) / N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Total PR is 1. Average PR is 1/N. Minimum PR is (1-d)/N.
In the end, both are the same. The second is a probability distribution for the random surfer reaching a page after clicking on an infinite number of links. The first is an expected value for the same thing while the random surfer starts N times. Important for the log scale is that there is a minium PR. I prefer the first algo because it is easier not to calculate with so many decimal places, although I think that Google uses the second one.
> With an effective Toolbar log base (i.e. log base taking into account final normalisation) of 7, that's about PR -13. My experience is that Googlebot spiders until about PR -4; that's 0.0004 with an effective Toolbar log base 7.
The crawl may be based on PR, but I don't think that this applies to very low PR regions. It's a question of how many links a page is away from the starting points of the crawl (which probably is DMOZ). Imagine a page that is 50 links away from DMOZ, but every page in that row has 1000 other links on it. The page may be crawled but the PR of that page could become infinitely close to 0.15 (or 0.15 / N).
IMO, the log scale doesn't apply to low PR regions. Otherwise we would see the Toolbar PR distributions that have been mentioned in this thread and the Toolbar would be pretty worthless and definitely very frustrating for webmasters.
> In a power law distribution, what proportion of the Web would have about that amount of PR. The top 140/2,469,940,685 by any chance?
The paper that I've mentioned used a 1.500.000 pages sample of the web and there were 3 pages that had a fraction of 0.0005 to 0.001 of the total PR. In that paper, they calculated PageRanks for another 100.000 pages sample, and the power law distribution was almost the same (with a factor of 1/x^2.1). I cannot really imagine that the results completely apply to a web of 2.5 billion pages but it sounds reasonable that a small number of pages (the root pages of Google and Yahoo?) have such a fraction of the total PR. If there are still 140 PR10 pages, PR10 probably starts at a fraction of 0.0001 of the total PR of the web. (Just a guess. I'll have to think about it...)
And also, this is very nice work on the update process explaination [dance.efactory.de].
I think both offer some background worth reading.
gmoney, a "real" PR of 1 equals 1/1690000 with the algo used in that paper and at a sample of 1.690.000 pages. 1/1690000 is about 6e-7. So, if you add the first 5 fractions of PR in the plot, you should get the total fraction of pages with PR < 1. This is roughly 0.11 + 0.09 + 0.12 + 0.06 + 0.04 = 0.42.
This is just an estimation based on a plot. Most of all, we don't know how they aggregated the pages to a fraction. Intuitively, I'd say that more than half of the pages must have a PR smaller than one, because the average PR of a page is one and in such a distribution the median has to be below the average.
never checked your profile site until Brett mentioned it.
Well done, very nice overview.
The crawl may be based on PR, but I don't think that this applies to very low PR regions. It's a question of how many links a page is away from the starting points of the crawl (which probably is DMOZ). Imagine a page that is 50 links away from DMOZ, but every page in that row has 1000 other links on it. The page may be crawled but the PR of that page could become infinitely close to 0.15 (or 0.15 / N).
Interesting idea, if true, it could mean that you are better off making a site-map with 1000 links to your inner pages than having inner pages linking in the A -> B -> C -> D -> E etc manner.
> Important for the log scale is that there is a minium PR.
I am intrigued by that comment. In a thread [webmasterworld.com] last month, gmoney convinced me that the normalisation constant (for Toolbar purposes, after PageRank calculation is finished) is equivalent to a change in the log base when comparing different PR values.
When considering the results in power law distribution mode, not absolute values, I would expect the 1.5 million page sample to behave in a very similar way to Google's 2.5 billion URL index.
vitaplease:
> if true, it could mean that you are better off making a site-map with 1000 links to your inner pages than having inner pages linking in the A -> B -> C -> D -> E etc manner.
Indeed, but I find that this is not the case. For links within a domain, I find that a chain of links (A -> B -> C etc.) with about 250 links on each is crawled only two about links past Toolbar PR0, but with three links on each is crawled about nine links past PR0. This corresponds approximately to -4 on the Toolbar PageRank scale.
I often ponder over whether this is actually a PageRank limit, or if it's like Markus' "how many links a page is away from the starting points of the crawl" approach, where following from one page with two links out counts for as much distance as following two links deep with one link on each page.
In the former, the decay factor would multiply by how deep, while in the latter it would presumably not be taken into account. Therefore, if we knew the log scale and decay factor, then maybe we could answer that question?