Forum Moderators: open
If with "root" you mean the log-factor then I can follow part of your write-up.
Please do not forget that the Pagerank-gift is diluted by the number of links on the page linking towards you and that there is a dampening factor to be considered plus all new undocumented variables the Googlers added in the last years :).
To discuss the main point of your post:
Which log factor would be ideal?
I am no mathematician, but I would say the ideal log factor would have to do with the total number of links and webpages in the index, and presenting some form of normalised distribution.
Related questions I ask myself are:
- Should one average Pagerank 9 link be credited as much as 100 average Pagerank 7 links (all other things being equal and considering a log factor of 10). I do not thinks so.
- Should the number of links dilution be linear at all times? I would say not, if the page would also have outbound links to external sites.
- Should the link dilution or dampening factor be the same from an index page and an internal page? I would think links from a homepage are more important.
- Should a site get more Pagerank credit if it has an even distribution of inbound links Pagerank wise. (is it generally considered interesting or are there just a few important pages giving all the credit).
Should one average Pagerank 9 link be credited as much as 100 average Pagerank 7 links (all other things being equal and considering a log factor of 10). I do not thinks so.
Why not? The PR 9 page will have 100 times as many incoming links (all other things being equal)...
Should the number of links dilution be linear at all times? I would say not, if the page would also have outbound links to external sites.
It pretty much has to be linear, otherwise the algorithm doesn't converge. The earlier links on a page may be transfer more than the later ones, but the total has to be fixed.
Should the link dilution or dampening factor be the same from an index page and an internal page? I would think links from a homepage are more important.
Already taken into account - if the homepage is more important it will probably have a higher Pagerank.
Should a site get more Pagerank credit if it has an even distribution of inbound links Pagerank wise.
This may affect rankings, but can't I think affect PageRank per se (without greatly complicating the PR calculation).
My impression is that the PageRank algorithm hasn't changed much, if at all, since the original Google - the refinements have gone into the ranking algorithm, of which PR is just one indirect input.
- Should one average Pagerank 9 link be credited as much as 100 average Pagerank 7 links (all other things being equal and considering a log factor of 10). I do not thinks so.
> Why not? The PR 9 page will have 100 times as many incoming links (all other things being equal)...
I understand, maybe this example makes clearer what I mean.
Its not about a PR9 site not deserving to be PR9 because it is recieving 100 times more links.
A webpage of obscurity buys one PR9 link (and through limited amount of links on the PR9 page) receives a PR8. Is this fair?
- Should the number of links dilution be linear at all times? I would say not, if the page would also have outbound links to external sites.
>It pretty much has to be linear, otherwise the algorithm doesn't converge. The earlier links on a page may be transfer more than the later ones, but the total has to be fixed.
I cannot argue on how feasable these distortions would be computation wise.
>...the refinements have gone into the ranking algorithm, of which PR is just one indirect input.
I agree, but with importance, for example, an incoming anchortext containing "blue-widget" from a PR9 page. Would you say that counts more in ranking than 100.000 incoming links with anchortext containing "blue-widget" from 100.000 seperate PR4 pages (or whatever the number should be)?
I do not think so.
I agree, but with importance, for example, an incoming anchortext containing "blue-widget" from a PR9 page. Would you say that counts more in ranking than 100.000 incoming links with anchortext containing "blue-widget" from 100.000 seperate PR4 pages (or whatever the number should be)?
Theoretically - yes. Forget the PR itself for a moment.
One link for CNN's home page would make you rich.
100k links from 100k janies' guestbooks - might bring you some traffic.
I'm guessing that's the beauty of the algo.
It's important to keep in mind that the log base is just a Toolbar thing. We do not know that Google's factoring of PageRank into their weighting mechanisms scale by the same curve. It seems likely that the scaling is logarithmic (study Pareto or Zipf for why), but I don't know anyone who's tested it properly. Or if they have, they're not telling.
vitaplease:
> Should one average Pagerank 9 link be credited as much as 100 average Pagerank 7 links
I agree with danny. That's half the point of PageRank. I think you'll agree that it works better than the earlier link popularity attempts of other engines.
If a "webpage of obscurity" buys one PR9 link then they've benefited from the link map of the Web, and therefore PageRank. If the PR9 site adds a page on a new and obscure topic, is this any different?
> Should the number of links dilution be linear at all times?
Yes, IMO. Otherwise the integrity of PageRank is under threat.
If you give extra weight (over what would be given now) to pages with many links, then we might be able to create a PageRank 'perpetual motion machine'. In other words, it might be possible to create a PageRank sink, causing a great pertubation in the fabric of the index and preventing convergence.
If you give less weight (over what would be given now) to pages with many links, then you'll not achieve much, as the PR transfer is already heavily damped in that case.
Should a site get more Pagerank credit if it has an even distribution of inbound links Pagerank wise
I agree with danny again. We might even be back to the rank sink problem by doing that. On the other hand, factoring in a 'traditional link pop' element separately from PageRank is sensible, and I suspect that Google may even be doing that now (not 100% sure).
danny:
> If the log base were 2, PageRank would have to range from 0 to maybe 25 in order to fit the same variation currently fitted into 0 to 10.
The scale is normalised, so that isn't a problem. The top of notch 10 on the Toolbar is the highest PR page.
vitaplease:
> Should one average Pagerank 9 link be credited as much as 100 average Pagerank 7 linksI agree with danny. That's half the point of PageRank. I think you'll agree that it works better than the earlier link popularity attempts of other engines.
I am not arguing against Pagerank, nor denying its succes over other link popularity attempts. I am just wondering if above a certain Pagerank threshold the "Pagerank gift" is out of proportion.
If Google would treat a link from a Pagerank 9 page as e.g. ten times as important as a Pagerank 7 link (Pagerank gift wise) instead of 100 times, it would put things more into perspective. It would also limit the effect of buying high Pagerank links.
Is Google's vote (a link from a PR10 Google webpage) really worth 1000 times more than a vote from a Pagerank 7 page? No, Google just happens to get an extravagent amount of links webwise through a search service it provides (same goes for Adobe with its pdf download).
I think Google does not follow/index many of the outbound external links on its pages for exactly that reason. (remember the newspapers Google was linking to)?
This seems like a reasonable thing for Google to do but I would be interested to hear a few reasons why we can be confident that this is true.
Also, could somebody expand on what is meant by “normalized” is the context ciml’s post? Does it refer to multiplying the logs of raw PageRank values by some number so that the maximum result is just less than 11?
If the toolbar were normalized in this manner then it would seem that the PR log scale (PR log base) is defined by the highest PageRank page (i.e. multiplying a log by a number is effectively like changing the log’s base).
> “The scale is normalised, so that isn't a problem. The top of notch 10 on the Toolbar is the highest PR page.” - cimlThis seems like a reasonable thing for Google to do but I would be interested to hear a few reasons why we can be confident that this is true.
How else can it be?
I believe to normalize in that context means to apply a function that would conver specified domain into specified range.
So something like y=f(x) for x (0,33453464564645] -> y (0,10].
Below are a couple doubtful possibilities but they are enough to make me wonder:
Maybe Google picks the log scale so that one of their pages is the cutoff point for PR10 and then calls every higher page a PR10 as well.
Maybe Google picks the log scale so there are a certain number (or percentage) of PR10 (or higher) pages and if it happens that a few are PR11 they just call them PR10.
Maybe Google selects the log scale to give a “best fit” of what Google thinks is the most desirable distribution of the number of pages at each toolbar PR level.
Anyway, I agree that ciml’s statement is the most likely possibility but I was just wondering if anybody had some more thoughts on the matter. A couple of things that trouble me with the “top of notch 10” idea is that the scale is based on a spike in the data and also that I don’t know how it will affect the distribution of pages at each toolbar level as the internet grows.
“. . . apply a function . . .” – bcc1234
I gave one possibility for the function in my earlier post but I am not sure if it is the same function that ciml was referring to.
Maybe Google picks the log scale so that one of their pages is the cutoff point for PR10 and then calls every higher page a PR10 as well.
Why would that matter?
I'm pretty sure google does not use that scale to rank the results.
They might use it to assign different depths of crawl and other tasks that require classification because that would have to be relative to the absolute max PR, but that's not so important, imo.
But in case of comparing two URLs and deciding which one should go first - I don't see a point in performing additional conversions to the absolute PR of the URLs.
It's like checking 15000 ~ 14854 and 0.157346 ~ 0.13654
in both cases the first one is higher.
How would one benefit from knowing their function for converting absolute PR into 1-10 scale?
First off: can anybody tell me how to use the Quote function? TIA.
ciml,
You mentioned a survey about the average number of links per page.
Well, in "The PageRank Citation Ranking: Bringing Order to the Web", section 3, they wrote: "Since there are about 11 links on an average page (depending on what you count as a link)..."
Given the authority of the authors ;-), could you please elaborate on your reasonings to relate the log base to the number of links per page?
Thanks,
PS: after previewing my post, I have another favour to ask: how to enable the smiling faces?
“How would one benefit from knowing their function for converting absolute PR into 1-10 scale?” - bcc1234
I am trying to find out the relationship between the toolbar PR and the absolute PageRank in order to try and get a better idea of the absolute PageRank of various Web pages. I feel that Google intertwines absolute PageRank with many of the “100 variables” it uses to determine ranking. I have no idea how their algorithm does this and I probably will never know. However, I am pretty confident that the higher the absolute PageRank the better it is for the rankings. By understanding the relationship between absolute PageRank and the toolbar PR, I hope to get a better idea about answer questions like (from a PageRank perspective): which directory listing is better, which paid directory is a better value, who benefits most from a particular link exchange etc.
The questions I asked in my earlier posts are just efforts to put another piece in the puzzle. By themselves, they aren’t of any value but hopefully if a lot of the puzzle can be put together it may be of some value. Mostly though, I am just curious about PageRank and at times the pursuit to understanding it can be fun. Besides, trying to put together the puzzle usually spins off into many interesting and beneficial tangents.
In Google's case, I'd say yes. I don't think that the world is as interested in who I link to from PR7 pages as they are in who Google links from their PR10 page. To put it another way, anything that Google links to is probably pretty important for many people.
If we were comparing PR8s and PR6s, then I would find it less easy to disagree.
gmoney:
> If the toolbar were normalized in this manner then it would seem that the PR log scale (PR log base) is defined by the highest PageRank page (i.e. multiplying a log by a number is effectively like changing the log’s base).
Changing the log base would be different. It would affect the relative dilution of the number of links on a page from update to update. I can't say that that doesn't happen but a changing log base should, I believe, change the decay factor. The decay factor seems pretty constant.
julinho, many thanks for the "links on an average page" quote from Bringing Order to the Web. I don't know how many times I've read that, yet this little nugget passed me by. On that basis, the Toolbar log scale should be eleven (or whatever Google now believe that figure to be). I doubt that considerably, but it wouldn't surprise if the use of PageRank in weighting did follow such a figure. The is pure speculation on my part, it's just would seem to make sense.
Toolbar PR = logb(PageRank+d)
Where “b” is the log base (log scale), "PageRank" is defined in the Brin and Page papers with the average PageRank set to 1, and “d” is the damping factor usually set around 0.85 (I just included it so there wouldn’t be negative numbers for the toolbar - maybe it is something else like 0.8 or 0.9).
1) Can you please define what you mean by Toolbar log scale?
To force the toolbar to take values between 0 and 11 you could include various constants (c1 and c2) as follows:
Toolbar PR = c1*logb(PageRank+d) # This is the approach I mentioned earlier.
Toolbar PR = logb(c2*(PageRank+d)) # I am starting to assume that this what you mean by normalization?
2) Can you please define what you mean by normalization?
”Changing the log base would be different. It would affect the relative dilution of the number of links on a page from update to update. I can't say that that doesn't happen but a changing log base should, I believe, change the decay factor. The decay factor seems pretty constant.” - ciml
With my interpretation of log scale and approach to forcing the toolbar to take values between 0 and 11 there is no effect on the “relative dilution” or “decay factor” regarding PageRank, but it would have an impact on what is perceived by the toolbar. I am under the impression that actual PageRank is what is important in rankings and that the Toolbar is just a blurred window showing us rough outlines of PageRank and often times guesses.
ciml, I value your comments and I hope you can find the time to answer my two questions above:)
ciml,
I agree with others who said that what really matters is the absolute value of PageRank (which, as far as I can see, doesn´t depend on any log base), not its graphical representation (which depends on the base). Google could change that hipothetical base at any time, and the rankings (what really matters to google and surfers) would remain the same.
That´s why I find strange that you try to correlate that base to the number of links in a page; actually, as the web branches out, I think that the average number of links per page is growing (more good sites to link to/from).
The point in knowing that base, whatever it is, is that you would have another piece of info to measure your sites against the others. Suppose my PR is 4 and I know my competitors have PR 7; if the base is 5, I can estimate that I would have to collect about 125 times the PageRank I gathered so far to get the PR4(which seems feasible); if, however, the base is 11, then I would have to go after about 1300 times the Page Rank I already have (maybe time to start new sites or explore other niches).
Hope this was clear.
:)
Yes, c2 is the normalisation constant. The PageRank value has already been normalised at each iteration. The normalisation variable in that process is 1/Sum_Of_All_PageRank, to keep the total PageRank the same for each iteration. It's in the backrub papers, I can't see why or how Google would change that.
The c2 figure is 10/Max(logb(PageRank)), where Max() is just the highest value in the set. This sets the top URL in the index to 10, and the rest below.
for the values I use above, 10 on the Toolbar would be from 9 to 10, 9, would be from 8 to 9, 0 would be from negative infinity to zero.
This is just my model of how it works, GoogleGuy may well sit behind his desk and chuckle about these threads.
> I am under the impression that actual PageRank is what is important in rankings and that the Toolbar is just a blurred window...
Yes, it's very blurred. If the Toolbar had a higher resolution then someone would have posted the log base months ago.
As for being important to rankings, it's as julinho says. You can see the value of a link relative to other links. Is it better to have a link from a Toolbar PR7 page with 100 links or a Toolbar PR6 page with 14 links? It depends on the log base. If you believe that the base is 6 then the PR6 page with 14 links is better, if you believe it's 10 then the PR7 page with 100 links is better.
Also, does a PR5 page with a good title beat a PR6 page with a lousy title? Most people reading here probably have a feel for that, but if you want to measure and predict then you need to know the scale of your tool.
The reason I would correlate the Toolbar log base to the mean or median number of links per page is just that the Toolbar is meant to be a predictor of importance. Any log scale fits the pareto 80/20 principle and therefore the Web (search for Jakob Nielsen's Zipf curve article for more on that), so the log base doesn't matter. The notches might as well map to something, so why not the average decay from one link to another?
Welcome to WebmasterWorld, julinho.
Toolbar PR = c2*(logb(PageRank)) – ciml
After changing the log base to incorporate c2, the equation reads:
Toolbar PR = logb2(PageRank)
Where “b2” (log base) is such that this equation is mathematically identical to ciml’s equation. I'll refer to this equation as the (proposed) "PR log scale equation".
Next I’ll show how the "PR log scale equation" forces “b2” to be less than 8.7. Since Google currently has 2.5*10^9 web pages, the maximum possible PageRank of any page is 2.5*10^9 (i.e. all other pages would have a PageRank of 0). From this you can determine that:
b2 < (2.5*10^9)^(1/10) = 8.7
Thus the maximum possible log base for the 2.5 billion pages index is 8.7. Obviously, one page does not come close to hogging all the PageRank and it follows that the log base does not come close to 8.7.
Next I try to get a more accurate upper bound on the log base. The first post in How many sites per PageRank toolbar value? [webmasterworld.com] lists the approximate number of sites with pages above various Toolbar PR values. Using these numbers for PR10 through PR6 only (since there was some skepticism about PR5), I found that:
b2 < 4.3 (the log scale/base is less than 4.3)
The following efforts were made to try and maximize the upper bound on the log scale.
- Only the maximum page on each site was included in the analysis.
- Every toolbar PR value was assumed to be the lowest possible.
- I only included three PR10 as indicated in the referenced post even though I currently know of at least 9 sites with a PR10 page and many of these sites have multiple PR10 pages
- I omitted all PR5 through PR0 pages from the analysis.
In conclusion, our current "PR log scale equation" relation between the Toolbar PR and actual PageRank combined with our estimates with the number of pages at each Toolbar PR value indicates that the base of our log scale is less than 4.3 (perhaps substantially less). This is very different from the widely held belief that the log scale is between 6 and 10.
Maybe we need to reevaluate our "PR log scale equation". Maybe we need to reevaluate our estimates for the number of pages at each Toolbar PR value. Maybe somebody needs to check my calculations (but I have approached this from a couple of different angles in another post [webmasterworld.com] and always seem to arrive at the log base around 3 or something).
I know that this doesn’t seem to agree with the decay of Toolbar PR values through web pages on sites. I don’t know the reason for the decay observations but I don’t think it is consistent to justify these observations by saying that the log scale is some high number like 10 or something.
I would appreciate any comments to try and help me figure this all out. I realize that my suggestion that the log base is 3 or 4 is a bit outlandish but the analysis I did on the "PR log scale equation" seems legitimate to me. I would especially appreciate critique on the assumptions, calculations, and even the "PR log scale equation".
your arithmetic goes beyond my capabilities, but a logscasle of 4 is well below what many would observe with their own site.
It is often easiest to check with links from your indexpage to new pages when the index page is also listed in the Google directory or when you are flipping between Pagerank 5 and 6.
- some observations: there are indeed many more PR10 and PR9 pages around. That link is old and only considers the estimated number of sites of which the highest page is PR9 or PR10
- the dampening factor, could be a lot higher or vary?
Since Google currently has 2.5*10^9 web pages, the maximum possible PageRank of any page is 2.5*10^9 (i.e. all other pages would have a PageRank of 0). From this you can determine that:
- Google is very conservative with listing the amount of pages in their index. Do a search for "the" and you will find more results than the 2,5 billion on the bottom of their homepage
- Not sure if this is relevant to your latest message but earlier the average amount of links per page was mentioned:
[webmasterworld.com...]
in this thread you will see it can be as high as 24
> Should the number of links dilution be linear at all times?ciml:
Yes, IMO. Otherwise the integrity of PageRank is under threat.
If you give extra weight (over what would be given now) to pages with many links, then we might be able to create a PageRank 'perpetual motion machine'. In other words, it might be possible to create a PageRank sink, causing a great pertubation in the fabric of the index and preventing convergence.
If you give less weight (over what would be given now) to pages with many links, then you'll not achieve much, as the PR transfer is already heavily damped in that case.
I guess the "perpetual motion machine' argument stands.
But doesn't every page with previously no outgoing links linking out, create Pagerank? normalisation just equals it out.
My main point is how is Google going to reward Hubiness?
I would suppose through ranking benefits and not through Pagerank benefits?
"Developing efficient implementations for large-scale mathematical problems, such as running Google's Pagerank™ algorithm on a graph of 3 billion nodes and 20 billion edges."
1 of these 3 billion nodes should be non-html documents without outbound links.
If you try to calculate the log scale, remember that there is a minimum PR for each page, which is 0.15 or 0.15/N (N = number of pages) when you take the normalisation that ciml mentioned into consideration. Hence, the maximum PR of a page is not N but 0.85 x N + 0.15 (or 0.85 + 0.15 / N)
IMO, there is no strict log scale. The larger the web becomes, the higher the real PR values of important pages get when you compute them by the simple (1-d) + d (...) algo and the lower the real PR values get when you compute them by the (1-d)/N + d (...) algo. So, you would have to find a pretty complex new formula once per month.
Regarding the damping factor, the length of a random walk is an exponential distribution with a mean of d/(1-d). So, at a damping factor of 0.85, the random surfer follows on average 5.67 links. At 0.9 he follows 9 links and at 0.8 only 4. The 5.67 sounds pretty good to me. I didn't find out about that myself. Check [levien.com...] (Brett posted it once, so, I think there's no problem about it).
Regarding the integrity of PR, I think there is no problem about giving more weight on links from certain pages. On the one hand, everything can be normalised. One the other hand, you can implement another variable in the algo which is interpolated during the iterations so that PR converges. However, I don't think that there is any "hubiness" benefit neither in the PR algo nor in the ranking algos.
ciml:
> Changing the log base would be different. It would affect the relative dilution of the number of links on a page from update to update.
I now believe this to be wrong. I'm trying to see the mistake. My initial guess is that I forgot that it's not a one way street. To get back from "ToolbarPR = c2*(logb(PageRank))" you need "PageRank = b(ToolbarPR/c2)".
Because we need to get values from the Toolbar before estimating the PR of the pages linked, C2 affects more than I imagined.
vitaplease:
> But doesn't every page with previously no outgoing links linking out, create Pagerank? ...
Yes, which I think is why they're removed for iterating and then put back in.
> ...normalisation just equals it out.
If you keep iterating and normalising, then at each step you push a little more PR into the rank sink.