Pagerank Algo! Simply 11 classes?

Forum Moderators: open

Message Too Old, No Replies

Pagerank Algo! Simply 11 classes?

pagerank is only a simple logarhytmical algo

MOOSBerlin

5:25 pm on Oct 24, 2002 (gmt 0)

Sorry for my poor english, but could it be, that pagerank is only a simple logarhytmical algo for the classification of the over 2 billions pages/sites in the Web in to 11 classes (0 and 1-10)?
For example:
2.471.658.428 sites/pages in the web and the factor is 7.5
that means
- PR10: 8 sites (really 7.5)
- PR9: 53 sites (7.5x7.5)
- PR8: 368 sites (7.5x7.5x7.5)
- PR7: 2.573 sites (7.5x7.5x7.5x7.5)
- PR6: 18.008 sites (and so on)
- PR5: 126.053 sites
- PR4: 882.368 sites
- PR3: 6.176.573 sites
- PR2: 43.236.008 sites
- PR1: 302.652.053 sites
- PR0: 2.118.564.368 sites
That also means:
- PR9 and higher: 60 sites
- PR8 and higher: 428 sites
- PR7 and higher: 3.000 sites
- PR6 and higher: 21.008 sites
- PR5 and higher: 147.060 sites
- PR4 and higher: 1.029.428 sites
- PR3 and higher: 7.206.000 sites
- PR2 and higher: 50.442.008 sites
- PR1 and higher: 353.094.060 sites
- PR0 and higher: 2.471.658.428 sites
any ideas?

[edited by: MOOSBerlin at 6:35 pm (utc) on Oct. 24, 2002]

Paully

5:29 pm on Oct 24, 2002 (gmt 0)

Dont this so...

TinkyWinky

5:38 pm on Oct 24, 2002 (gmt 0)

So who are the 10's? Google.com, DMOZ & ......?

WebRankInfo

5:54 pm on Oct 24, 2002 (gmt 0)

The PR scale from 0 to 10 only applies to the Toolbar PageRank. The real PageRank, computed by Google but not revealed, may be greater and not only integers...

ciml

6:21 pm on Oct 24, 2002 (gmt 0)

Welcome to WebmasterWorld [webmasterworld.com], MOOSBerlin.

What you describe makes perfect sense. The numbers don't work out exactly like that, but the logarithmic scale on the Toolbar PageRank graph will reflect the link structure of the Web to a good degree. PageRank graphs could be based on a rank distribution, but it would seem much easier to use some log of the raw data and let the natural structure of the Web bring it into line.

Power law distributions, (Zipf, Pareto) tend to reflect the properties of large populations. Similar curves can be seen with referrers or browser IP addresses if you look at the logs of a popular site (as Jakob Nielsen did some time ago) or the aggregate logs over a number of sites (as I do when I should be working).

johnraphone

6:33 pm on Oct 24, 2002 (gmt 0)

Actually, there are about 40 PageRank-10 sites. If I had to guess about PageRank-9, I would say about 5,000.

There isn't 2 billion pagerank-0 pages. But I will say, there are many unranked pages.

MOOSBerlin

7:10 pm on Oct 24, 2002 (gmt 0)

Are there really about 40 PageRank-10 sites? Is there anywhere a list to see them?

WebRankInfo

9:06 pm on Oct 24, 2002 (gmt 0)

Name URL Backlinks
Google www.google.com 243 000
Yahoo! www.yahoo.com 657 000
Microsoft www.microsoft.com 125 000
USA Today www.usatoday.com 102 000
Sun www.sun.com 111 000
Apple www.apple.com 82 000
Lycos www.lycos.com 193 000
Real www.real.com 69 600
Macromedia www.macromedia.com 50 200
Adobe www.adobe.com 125 000
W3.org www.w3.org 57 000
NASA www.nasa.gov 87 100
Netscape www.netscape.com 92 600
DMOZ dmoz.org 804 000

This list should be completed...

vitaplease

12:58 pm on Oct 25, 2002 (gmt 0)

MoosBerlin,

If you are talking; number of sites of which the highest Pageranking Page is...

Some older threads:

PR10 and PR9 sites.
[webmasterworld.com...]

Number of PR 10 and 9 sites:
[webmasterworld.com...]

Geographical differences:
[webmasterworld.com...]

dantheman

1:35 pm on Oct 25, 2002 (gmt 0)

I think it's closer to log 8.59.

PR10: 9 pages (8.59x8.59)
PR9: 74 pages (8.59x8.59x8.59) etc
PR8: 634 pages
PR7: 5,445 pages
PR6: 46,770 pages
PR5: 401,753 pages
PR4: 3,451,057 pages
PR3: 29,644,581 pages
PR2: 254,646,947 pages
PR1: 2,187,417,279 pages

This totals 2,475,614,547 (closest 2 decimal places can get). The above assumes PR0 pages are not included in the index. Is that a correct assumption?

Markus

3:33 pm on Oct 25, 2002 (gmt 0)

ciml, the authors of Using PageRank to Charaterize Web Structure [citeseer.nj.nec.com] calculated PageRanks for large sub sets of the web. They show that PageRank indeed follows a power law (except for pages with very low PageRank). Interestingly, they also show that PageRank and the number of inbound links are only very little correlated.

The PageRank plots in that paper indicate that more than 40 percent of the pages have a real PageRank smaller than 1 (or 1/N, depending on the algo used). If we start the logarithmic scalation of Toolbar PR at 0.15 (minimal PR of a page at a damping factor of 0.85) and if we assume a factor of 7 for the scalation, we would have more than 40 percent PR0 pages. At least for low toolbar PR, the scalation seems to work in another way.

Another interesting result is that there are some high PR pages that each have a fraction of 0.0005 to 0.001 of the total PR. This could apply to the whole web also (self-similar behaviour of the web), so, we may be able to figure out realistic values for the Toolbar scalation.

jtoddv

4:23 pm on Oct 25, 2002 (gmt 0)

You guys are all wrong, they set it by hand.

They have 100 monkies trained to make the judgment in the backroom at GooglePlex.

JonB

4:27 pm on Oct 25, 2002 (gmt 0)

100 monkies
----

you mean pigeons i believe :)

tarheel2002

5:16 pm on Oct 25, 2002 (gmt 0)

Another PageRank 10 Site...

www.mac.com - 14,200 backlinks

tarheel2002

5:19 pm on Oct 25, 2002 (gmt 0)

oops, forgot this one...

www.nature.com - 16,600 backlinks

WebRankInfo

5:28 pm on Oct 25, 2002 (gmt 0)

Here are the 21 sites with PR10 that I know of:

_# Name_______ URL_______________ Backlinks
*******************************************
_1 Google_____ www.google.com____ 243 000
_2 Yahoo!_____ www.yahoo.com_____ 657 000
_3 Microsoft__ www.microsoft.com_ 125 000
_4 USA Today__ www.usatoday.com__ 102 000
_5 Sun________ www.sun.com_______ 111 000
_6 Apple______ www.apple.com______ 82 000
_7 Lycos______ www.lycos.com_____ 193 000
_8 Real_______ www.real.com_______ 69 600
_9 Macromedia_ www.macromedia.com_ 50 200
10 Adobe______ www.adobe.com_____ 125 000
11 W3.org_____ www.w3.org (or w3c) 57 000
12 NASA_______ www.nasa.gov_______ 87 100
13 Netscape___ www.netscape.com___ 92 600
14 DMOZ_______ dmoz.org__________ 804 000
15 NSF________ www.nsf.gov________ 50 900
16 Energy.gov_ www.energy.gov_____ 40 700
17 White House www.whitehouse.gov_ 36 500
18 First Gov__ www.firstgov.gov___ 59 800
19 MIT________ mit.edu____________ 43 300
20 Mac________ www.mac.com________ 14 200
21 Nature_____ www.nature.com_____ 16 600

(I added "_" because I don't know how to put several spaces)

Chris_R

5:58 pm on Oct 25, 2002 (gmt 0)

PR is by page - last time I checked - I had about 140 pages with PR10 listed.

Racecar78

6:07 pm on Oct 25, 2002 (gmt 0)

I am curious to see if any root-level documents of a PR10 (home-page) site are still worth 10, or if they move around as well.

ciml

8:30 pm on Oct 25, 2002 (gmt 0)

Markus:
> Interestingly, they also show that PageRank and the number of inbound links are only very little correlated.

Good point, some of those PR10 pages Chris mentions may only have a handful of links.

The way I see the 0.15 figure, it is shared among all URLs in the index. So our starting value may be (0.15/2,469,940,685 = 0.000000000006) for a page with no inbound links.

With an effective Toolbar log base (i.e. log base taking into account final normalisation) of 7, that's about PR -13. My experience is that Googlebot spiders until about PR -4; that's 0.0004 with an effective Toolbar log base 7.

This suggests that Google needs a PR of about 68 million times the base amount in order to follow the link. That would seem unlikely, so please someone feel free to point out the error(s) of this thought stream.

> Another interesting result is that there are some high PR pages that each have a fraction of 0.0005 to 0.001 of the total PR...

In a power law distribution, what proportion of the Web would have about that amount of PR. The top 140/2,469,940,685 by any chance?

Chris_R

9:07 pm on Oct 25, 2002 (gmt 0)

The 140 was when they had 1.6 billion I think.

WebRankInfo

9:02 am on Oct 26, 2002 (gmt 0)

Could you give us these 140 URLs?

Powdork

9:14 am on Oct 26, 2002 (gmt 0)

For a quick start go to Google and then Google Images, Google Groups, All about Google, then try the same with Yahoo (I didn't), etc. Interestingly the url reached by clicking on Google Images from Google's homepage was 0. But when I went there from the toolbar I reached the pr 10 url. Remember its all about the page not the domain.

MOOSBerlin

4:16 pm on Oct 26, 2002 (gmt 0)

@ all: Thanks for the replies, it was a surprise for me that there are more than 20 PR10-pages.

@ dantehemann: i think PR0-pages are included in the index, as the index is the database of google. There are also some PR-none pages (grey toolbar) in the index, which are spidered and get their pagerank after the next dance.

@ all: The above works also with a little correct:
PR10 - 20 or 30 or 40 (as correct variable) + 7.xx
PR 9 - correct variable + 7.xx*7.xx
PR 8 - correct variable + 7.xx*7.xx*7.xx
and so on.

Markus

3:50 am on Oct 27, 2002 (gmt 0)

> The way I see the 0.15 figure, it is shared among all URLs in the index. So our starting value may be (0.15/2,469,940,685 = 0.000000000006) for a page with no inbound links.

It depends on the algo you use:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Total PR equals the number of pages (N). Average PR is 1. Minimum PR is (1-d) or

PR(A) = (1-d) / N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Total PR is 1. Average PR is 1/N. Minimum PR is (1-d)/N.

In the end, both are the same. The second is a probability distribution for the random surfer reaching a page after clicking on an infinite number of links. The first is an expected value for the same thing while the random surfer starts N times. Important for the log scale is that there is a minium PR. I prefer the first algo because it is easier not to calculate with so many decimal places, although I think that Google uses the second one.

> With an effective Toolbar log base (i.e. log base taking into account final normalisation) of 7, that's about PR -13. My experience is that Googlebot spiders until about PR -4; that's 0.0004 with an effective Toolbar log base 7.

The crawl may be based on PR, but I don't think that this applies to very low PR regions. It's a question of how many links a page is away from the starting points of the crawl (which probably is DMOZ). Imagine a page that is 50 links away from DMOZ, but every page in that row has 1000 other links on it. The page may be crawled but the PR of that page could become infinitely close to 0.15 (or 0.15 / N).

IMO, the log scale doesn't apply to low PR regions. Otherwise we would see the Toolbar PR distributions that have been mentioned in this thread and the Toolbar would be pretty worthless and definitely very frustrating for webmasters.

> In a power law distribution, what proportion of the Web would have about that amount of PR. The top 140/2,469,940,685 by any chance?

The paper that I've mentioned used a 1.500.000 pages sample of the web and there were 3 pages that had a fraction of 0.0005 to 0.001 of the total PR. In that paper, they calculated PageRanks for another 100.000 pages sample, and the power law distribution was almost the same (with a factor of 1/x^2.1). I cannot really imagine that the results completely apply to a web of 2.5 billion pages but it sounds reasonable that a small number of pages (the root pages of Google and Yahoo?) have such a fraction of the total PR. If there are still 140 PR10 pages, PR10 probably starts at a fraction of 0.0001 of the total PR of the web. (Just a guess. I'll have to think about it...)

gmoney

4:20 am on Oct 27, 2002 (gmt 0)

The PageRank plots in that paper indicate that more than 40 percent of the pages have a real PageRank smaller than 1 - Markus"

Thanks for the reference to the paper. However, I can't seem to be able to determine how you arrived at that 40% number from the graphs. Any help would be appreciated.

wasmith

4:36 am on Oct 27, 2002 (gmt 0)

Personaly i thought it was around 2 Pi or ~6.28 links of the same level = +1. But if you can determine how many PR10 sites exists i will accept that number as the authority.

Brett_Tabke

4:44 pm on Oct 27, 2002 (gmt 0)

Excellent link there Markus - I'd not read that one.
[citeseer.nj.nec.com...]

And also, this is very nice work on the update process explaination [dance.efactory.de].

I think both offer some background worth reading.

Markus

12:18 pm on Oct 28, 2002 (gmt 0)

Thanks for the link, Brett. :)

gmoney, a "real" PR of 1 equals 1/1690000 with the algo used in that paper and at a sample of 1.690.000 pages. 1/1690000 is about 6e-7. So, if you add the first 5 fractions of PR in the plot, you should get the total fraction of pages with PR < 1. This is roughly 0.11 + 0.09 + 0.12 + 0.06 + 0.04 = 0.42.

This is just an estimation based on a plot. Most of all, we don't know how they aggregated the pages to a fraction. Intuitively, I'd say that more than half of the pages must have a PR smaller than one, because the average PR of a page is one and in such a distribution the median has to be below the average.

vitaplease

12:36 pm on Oct 28, 2002 (gmt 0)

Markus,

never checked your profile site until Brett mentioned it.
Well done, very nice overview.

The crawl may be based on PR, but I don't think that this applies to very low PR regions. It's a question of how many links a page is away from the starting points of the crawl (which probably is DMOZ). Imagine a page that is 50 links away from DMOZ, but every page in that row has 1000 other links on it. The page may be crawled but the PR of that page could become infinitely close to 0.15 (or 0.15 / N).

Interesting idea, if true, it could mean that you are better off making a site-map with 1000 links to your inner pages than having inner pages linking in the A -> B -> C -> D -> E etc manner.

ciml

8:06 pm on Oct 28, 2002 (gmt 0)

Markus, I agree that either model can be used to think about PageRank flow. I suspect that the second one (with the normalisation constant) may be more help with findding a mimimum PR value.

> Important for the log scale is that there is a minium PR.

I am intrigued by that comment. In a thread [webmasterworld.com] last month, gmoney convinced me that the normalisation constant (for Toolbar purposes, after PageRank calculation is finished) is equivalent to a change in the log base when comparing different PR values.

When considering the results in power law distribution mode, not absolute values, I would expect the 1.5 million page sample to behave in a very similar way to Google's 2.5 billion URL index.

vitaplease:
> if true, it could mean that you are better off making a site-map with 1000 links to your inner pages than having inner pages linking in the A -> B -> C -> D -> E etc manner.

Indeed, but I find that this is not the case. For links within a domain, I find that a chain of links (A -> B -> C etc.) with about 250 links on each is crawled only two about links past Toolbar PR0, but with three links on each is crawled about nine links past PR0. This corresponds approximately to -4 on the Toolbar PageRank scale.

I often ponder over whether this is actually a PageRank limit, or if it's like Markus' "how many links a page is away from the starting points of the crawl" approach, where following from one page with two links out counts for as much distance as following two links deep with one link on each page.

In the former, the decay factor would multiply by how deep, while in the latter it would presumably not be taken into account. Therefore, if we knew the log scale and decay factor, then maybe we could answer that question?

This 33 message thread spans 2 pages: 33