Welcome to WebmasterWorld Guest from **54.166.111.36**

Forum
Moderators: **open**

Sorry for my poor english, but could it be, that pagerank is only a simple logarhytmical algo for the classification of the over 2 billions pages/sites in the Web in to 11 classes (0 and 1-10)?

For example:

2.471.658.428 sites/pages in the web and the factor is 7.5

that means

- PR10: 8 sites (really 7.5)

- PR9: 53 sites (7.5x7.5)

- PR8: 368 sites (7.5x7.5x7.5)

- PR7: 2.573 sites (7.5x7.5x7.5x7.5)

- PR6: 18.008 sites (and so on)

- PR5: 126.053 sites

- PR4: 882.368 sites

- PR3: 6.176.573 sites

- PR2: 43.236.008 sites

- PR1: 302.652.053 sites

- PR0: 2.118.564.368 sites

That also means:

- PR9 and higher: 60 sites

- PR8 and higher: 428 sites

- PR7 and higher: 3.000 sites

- PR6 and higher: 21.008 sites

- PR5 and higher: 147.060 sites

- PR4 and higher: 1.029.428 sites

- PR3 and higher: 7.206.000 sites

- PR2 and higher: 50.442.008 sites

- PR1 and higher: 353.094.060 sites

- PR0 and higher: 2.471.658.428 sites

any ideas?

[**edited by**: MOOSBerlin at 6:35 pm (utc) on Oct. 24, 2002]

joined:Sept 22, 2002

posts:82

votes: 0

Dont this so...

Welcome to WebmasterWorld [webmasterworld.com], MOOSBerlin.

What you describe makes perfect sense. The numbers don't work out exactly like that, but the logarithmic scale on the Toolbar PageRank graph will reflect the link structure of the Web to a good degree. PageRank graphs *could* be based on a rank distribution, but it would seem much easier to use some log of the raw data and let the natural structure of the Web bring it into line.

Power law distributions, (Zipf, Pareto) tend to reflect the properties of large populations. Similar curves can be seen with referrers or browser IP addresses if you look at the logs of a popular site (as Jakob Nielsen did some time ago) or the aggregate logs over a number of sites (as I do when I should be working).

Name URL Backlinks

Google www.google.com 243 000

Yahoo! www.yahoo.com 657 000

Microsoft www.microsoft.com 125 000

USA Today www.usatoday.com 102 000

Sun www.sun.com 111 000

Apple www.apple.com 82 000

Lycos www.lycos.com 193 000

Real www.real.com 69 600

Macromedia www.macromedia.com 50 200

Adobe www.adobe.com 125 000

W3.org www.w3.org 57 000

NASA www.nasa.gov 87 100

Netscape www.netscape.com 92 600

DMOZ dmoz.org 804 000

This list should be completed...

MoosBerlin,

If you are talking; number of sites of which the highest Pageranking Page is...

Some older threads:

PR10 and PR9 sites.

[webmasterworld.com...]

Number of PR 10 and 9 sites:

[webmasterworld.com...]

Geographical differences:

[webmasterworld.com...]

I think it's closer to log 8.59.

PR10: 9 pages (8.59x8.59)

PR9: 74 pages (8.59x8.59x8.59) etc

PR8: 634 pages

PR7: 5,445 pages

PR6: 46,770 pages

PR5: 401,753 pages

PR4: 3,451,057 pages

PR3: 29,644,581 pages

PR2: 254,646,947 pages

PR1: 2,187,417,279 pages

This totals 2,475,614,547 (closest 2 decimal places can get). The above assumes PR0 pages are not included in the index. Is that a correct assumption?

ciml, the authors of Using PageRank to Charaterize Web Structure [citeseer.nj.nec.com] calculated PageRanks for large sub sets of the web. They show that PageRank indeed follows a power law (except for pages with very low PageRank). Interestingly, they also show that PageRank and the number of inbound links are only very little correlated.

The PageRank plots in that paper indicate that more than 40 percent of the pages have a real PageRank smaller than 1 (or 1/N, depending on the algo used). If we start the logarithmic scalation of Toolbar PR at 0.15 (minimal PR of a page at a damping factor of 0.85) and if we assume a factor of 7 for the scalation, we would have more than 40 percent PR0 pages. At least for low toolbar PR, the scalation seems to work in another way.

Another interesting result is that there are some high PR pages that each have a fraction of 0.0005 to 0.001 of the total PR. This could apply to the whole web also (self-similar behaviour of the web), so, we may be able to figure out realistic values for the Toolbar scalation.

Here are the 21 sites with PR10 that I know of:

_# Name_______ URL_______________ Backlinks

*******************************************

_1 Google_____ www.google.com____ 243 000

_2 Yahoo!_____ www.yahoo.com_____ 657 000

_3 Microsoft__ www.microsoft.com_ 125 000

_4 USA Today__ www.usatoday.com__ 102 000

_5 Sun________ www.sun.com_______ 111 000

_6 Apple______ www.apple.com______ 82 000

_7 Lycos______ www.lycos.com_____ 193 000

_8 Real_______ www.real.com_______ 69 600

_9 Macromedia_ www.macromedia.com_ 50 200

10 Adobe______ www.adobe.com_____ 125 000

11 W3.org_____ www.w3.org (or w3c) 57 000

12 NASA_______ www.nasa.gov_______ 87 100

13 Netscape___ www.netscape.com___ 92 600

14 DMOZ_______ dmoz.org__________ 804 000

15 NSF________ www.nsf.gov________ 50 900

16 Energy.gov_ www.energy.gov_____ 40 700

17 White House www.whitehouse.gov_ 36 500

18 First Gov__ www.firstgov.gov___ 59 800

19 MIT________ mit.edu____________ 43 300

20 Mac________ www.mac.com________ 14 200

21 Nature_____ www.nature.com_____ 16 600

(I added "_" because I don't know how to put several spaces)

Markus:

> Interestingly, they also show that PageRank and the number of inbound links are only very little correlated.

Good point, some of those PR10 pages Chris mentions may only have a handful of links.

The way I see the 0.15 figure, it is shared among all URLs in the index. So our starting value may be (0.15/2,469,940,685 = 0.000000000006) for a page with no inbound links.

With an effective Toolbar log base (i.e. log base taking into account final normalisation) of 7, that's about PR -13. My experience is that Googlebot spiders until about PR -4; that's 0.0004 with an effective Toolbar log base 7.

This suggests that Google needs a PR of about 68 million times the base amount in order to follow the link. That would seem unlikely, so please someone feel free to point out the error(s) of this thought stream.

> Another interesting result is that there are some high PR pages that each have a fraction of 0.0005 to 0.001 of the total PR...

In a power law distribution, what proportion of the Web would have about that amount of PR. The top 140/2,469,940,685 by any chance?

For a quick start go to Google and then Google Images, Google Groups, All about Google, then try the same with Yahoo (I didn't), etc. Interestingly the url reached by clicking on Google Images from Google's homepage was 0. But when I went there from the toolbar I reached the pr 10 url. Remember its all about the page not the domain.

@ all: Thanks for the replies, it was a surprise for me that there are more than 20 PR10-pages.

@ dantehemann: i think PR0-pages are included in the index, as the index is the database of google. There are also some PR-none pages (grey toolbar) in the index, which are spidered and get their pagerank after the next dance.

@ all: The above works also with a little correct:

PR10 - 20 or 30 or 40 (as correct variable) + 7.xx

PR 9 - correct variable + 7.xx*7.xx

PR 8 - correct variable + 7.xx*7.xx*7.xx

and so on.

> The way I see the 0.15 figure, it is shared among all URLs in the index. So our starting value may be (0.15/2,469,940,685 = 0.000000000006) for a page with no inbound links.

It depends on the algo you use:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Total PR equals the number of pages (N). Average PR is 1. Minimum PR is (1-d) or

PR(A) = (1-d) / N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Total PR is 1. Average PR is 1/N. Minimum PR is (1-d)/N.

In the end, both are the same. The second is a probability distribution for the random surfer reaching a page after clicking on an infinite number of links. The first is an expected value for the same thing while the random surfer starts N times. Important for the log scale is that there is a minium PR. I prefer the first algo because it is easier not to calculate with so many decimal places, although I think that Google uses the second one.

> With an effective Toolbar log base (i.e. log base taking into account final normalisation) of 7, that's about PR -13. My experience is that Googlebot spiders until about PR -4; that's 0.0004 with an effective Toolbar log base 7.

The crawl may be based on PR, but I don't think that this applies to very low PR regions. It's a question of how many links a page is away from the starting points of the crawl (which probably is DMOZ). Imagine a page that is 50 links away from DMOZ, but every page in that row has 1000 other links on it. The page may be crawled but the PR of that page could become infinitely close to 0.15 (or 0.15 / N).

IMO, the log scale doesn't apply to low PR regions. Otherwise we would see the Toolbar PR distributions that have been mentioned in this thread and the Toolbar would be pretty worthless and definitely very frustrating for webmasters.

> In a power law distribution, what proportion of the Web would have about that amount of PR. The top 140/2,469,940,685 by any chance?

The paper that I've mentioned used a 1.500.000 pages sample of the web and there were 3 pages that had a fraction of 0.0005 to 0.001 of the total PR. In that paper, they calculated PageRanks for another 100.000 pages sample, and the power law distribution was almost the same (with a factor of 1/x^2.1). I cannot really imagine that the results completely apply to a web of 2.5 billion pages but it sounds reasonable that a small number of pages (the root pages of Google and Yahoo?) have such a fraction of the total PR. If there are still 140 PR10 pages, PR10 probably starts at a fraction of 0.0001 of the total PR of the web. (Just a guess. I'll have to think about it...)

The PageRank plots in that paper indicate that more than 40 percent of the pages have a real PageRank smaller than 1 - Markus"

Thanks for the reference to the paper. However, I can't seem to be able to determine how you arrived at that 40% number from the graphs. Any help would be appreciated.

Excellent link there Markus - I'd not read that one.

[citeseer.nj.nec.com...]

And also, this is very nice work on the update process explaination [dance.efactory.de].

I think both offer some background worth reading.

Thanks for the link, Brett. :)

gmoney, a "real" PR of 1 equals 1/1690000 with the algo used in that paper and at a sample of 1.690.000 pages. 1/1690000 is about 6e-7. So, if you add the first 5 fractions of PR in the plot, you should get the total fraction of pages with PR < 1. This is roughly 0.11 + 0.09 + 0.12 + 0.06 + 0.04 = 0.42.

This is just an estimation based on a plot. Most of all, we don't know how they aggregated the pages to a fraction. Intuitively, I'd say that more than half of the pages must have a PR smaller than one, because the average PR of a page is one and in such a distribution the median has to be below the average.

Markus,

never checked your profile site until Brett mentioned it.

Well done, very nice overview.

The crawl may be based on PR, but I don't think that this applies to very low PR regions. It's a question of how many links a page is away from the starting points of the crawl (which probably is DMOZ). Imagine a page that is 50 links away from DMOZ, but every page in that row has 1000 other links on it. The page may be crawled but the PR of that page could become infinitely close to 0.15 (or 0.15 / N).

Interesting idea, if true, it could mean that you are better off making a site-map with 1000 links to your inner pages than having inner pages linking in the A -> B -> C -> D -> E etc manner.

Markus, I agree that either model can be used to think about PageRank flow. I suspect that the second one (with the normalisation constant) may be more help with findding a mimimum PR value.

> Important for the log scale is that there is a minium PR.

I am intrigued by that comment. In a thread [webmasterworld.com] last month, gmoney convinced me that the normalisation constant (for Toolbar purposes, after PageRank calculation is finished) is equivalent to a change in the log base when comparing different PR values.

When considering the results in power law distribution mode, not absolute values, I would expect the 1.5 million page sample to behave in a very similar way to Google's 2.5 billion URL index.

vitaplease:

> if true, it could mean that you are better off making a site-map with 1000 links to your inner pages than having inner pages linking in the A -> B -> C -> D -> E etc manner.

Indeed, but I find that this is not the case. For links within a domain, I find that a chain of links (A -> B -> C etc.) with about 250 links on each is crawled only two about links past Toolbar PR0, but with three links on each is crawled about nine links past PR0. This corresponds approximately to -4 on the Toolbar PageRank scale.

I often ponder over whether this is actually a PageRank limit, or if it's like Markus' "how many links a page is away from the starting points of the crawl" approach, where following from one page with two links out counts for as much distance as following two links deep with one link on each page.

In the former, the decay factor would multiply by how deep, while in the latter it would presumably not be taken into account. Therefore, if we knew the log scale and decay factor, then maybe we could answer that question?

This 33 message thread spans 2 pages: 33

- Register For Free! -
**Become a Pro Member!** - See forum categories - Enter the Forum

- Moderator List | Top Contributors:This Week, This Month, Sept, Aug, Archive, Top 100 All Time, Top Voted Members

- Google Updates and SERP Changes - October 2016
- October 2016 AdSense Earnings & Observations
- Google Mobile-Only Search Index Coming Within Months
- Best SEO Tools for finding bad links to Disavow?
- WiGig Super-Fast Wireless Certified By WiFi Alliance
- Google Penguin 4.0 Confirmation, is Now Part of Core Algorithm, and Realtime
- Google Adds Fact Checking to Google News
- Linux "Dirty Cow" Exploit: Patch Your Systems Now
- DDoS Attack Brings Down Sites - Twitter, Github, Reddit
- Going Font-Less

- Twitter Q3 Beats Expectations: Revenue $616 million, Cuts Jobs by 9pct
- WiGig Super-Fast Wireless Certified By WiFi Alliance
- DDoS Attack Brings Down Sites - Twitter, Github, Reddit
- Linux "Dirty Cow" Exploit: Patch Your Systems Now
- Going Font-Less
- Google Mobile-Only Search Index Coming Within Months
- WebmasterWorld's Brett Tabke Wins Lifetime Achievement Award
- New and Less Common Webmaster Technologies and Software
- Server Farms - Update
- Google Updates and SERP Changes - October 2016