Forum Moderators: open
The 12th World Wide Web conference has ended. As reported earlier, a major theme was Google becoming up to 5 times faster and how a team from Stanford University were to achieve it. To help in that feat there will be three new main ways or groups which wil help with regards to the way PageRank is to be newly calculated.
They are...
Blockrank:
On average, internal links take up to 80% of all links per page. Currently each of these internal links is calculated individually, the new method will group the internal links together and calculate them just the once, thereby gaining an increase of approximately 3 times the current calculation speed.
Extrapolation:
This deals with some pretty heavy mathematics that I’m not even going to begin getinto. Your best bet, if you are not mathematically challenged like myself lol, is to click on the link Extrapolation. Very roughly, it is said to act in a similar way as Alexa by factoring in number of visits / traffic.
Adaptive PageRank:
The final main group is Adaptive Pagerank. Quote: "PageRank of pages that have converged are not recomputed at each iteration after convergence." This suggests that pages that have a high PageRank and have been in the system for sometime may well not need to be recalculated in certain circumstances.
More informations about: http://www.stanford.edu/~sdkamvar/research.html
News sources:
http://asia.cnet.com/newstech/systems/0,39001153,39133747,00.htm
Quadratic extrapolation and in that paper, there are references to 'teleportation'.
Did you play dungeons & dragons as a kid, too? I know when I read a 'teleportation' into a paper on algorithms, and information retrieval, I keep thinking my wizard isn't high enough level to cast a 4th level spell yet. :) lol
Back on topic though - thanks for the link! Perhaps there is some new stuff that I haven't read yet.
[edited by: jeremy_goodrich at 5:32 am (utc) on May 28, 2003]
Extrapolation:
This deals with some pretty heavy mathematics that I’m not even going to begin getinto. Your best bet, if you are not mathematically challenged like myself lol, is to click on the link Extrapolation. Very roughly, it is said to act in a similar way as Alexa by factoring in number of visits / traffic.
Sorry, but I don't see that in the paper at all. The extrapolation method relies on no new data, just new calculations.
Very roughly, it is said to act in a similar way as Alexa by factoring in number of visits / traffic.
thats means, that simply used a known factor to extrapolate the other sites... alexa do the same to calculate the traffic rankings...they computes all alexa hits to get the "known factor" to computes the rankings.
As a first step, Alexa computes the reach and number of page views for all sites on the Web on a daily basis. Source [pages.alexa.com]
PageRank is iterative, startign at 1, each iteration brings it closer to the real value.
so if you start at 1, the first iteration yields 3 and the second 5, you can make a linear extrapolation to guess the real value to be at 7. That is the first part of the paper. Then it goes on to quadratic extrapolation, basically taking 3 consecutive deltas and extrapolating on a quadratic curve. If you've ever programmed graphical quadratic and cubic splines you'll find it very easy to follow.
I wondered why, at least the linear extrapolation, it wasn't in teh original pagerank computation, as it'S a common sense method for iterative algorithms.
SN.
[toolbar.google.com...]
Add a critical mass of Advanced Toolbar users into the mix (so that browsing data is fed back to Google) and you can start to see where Google is going with all this.
Very roughly speaking, page rank of a page is the number of incoming links divided by the respective numbers o links on the page linking to your page, multiplied by the PR of the source page. In other words, links from high PR pages count more, and links from apges with lots of other links on them count less.
Now the crux of the matter is that to know the PR of a page you need to know the PR of all the pages linking in, and ditto for those. It's th eclassical chicken and egg problem.
So what google does is to assume (pretend) that all pages have PR1. Then it gos to each page, and recalculates it's PR using the asumed Pr of 1 from all te other pages. Now it has a sligtly better approximation of the true PR for this page.
This is repeated about 50 to 100 times (50 in the original paper, 100 in more recent papers) for which empirical evidence (read: trial and error) has shown to yield accurate enough results (never perfect mind you.)
Now running this formula on 3 billion pages 100 times takes it's time as you can imagine.
These papers mostly deal with how to use information to skip parts of this process. The extrapolation for examply uses the fact that each iteration has a better approximation (rather than equally bad and finally, suddenly correct) to skip a few steps and get to a good anseer quicker.
The block algorithm uses the fact that a website has many internal links and few external links. In theory you could pretend each site is one page, calculate PR for each "site" as a whole, then take each site in turn and distribute te site Pr to each page. In praxis it's somewhere halfway between the traditional one and what I jsut described.
If you want it more clarified, and others are interested, I might write this up neatly and the Mods could do a locked thread or something...
Hope this helps.
SN
nice summary.
I just want to add some comments:
The solution of the iteration is indepent from the initial values. (You can easily show that for a damping factor d<1 an unique solution exists. For d=1 the situation changes because you will have a number of degenerated eigen vectors.) Also, in practice one would use the PR values of the last PR calculation as initial values to speed up the calculation.
One main advantage of the block algorithm is that you can easily introduce a parallized version.
Some comments on the standford papers (mentioned above):
1) While the calculation for d=1 corresponds to the determination of an eigen vector, it corresponds to solving a set of linear eqautions for d<1. However, they (Kamvar et. al.) always consider the calculation as determining eigen vectors. This is unnecessary complicated (and from a principle point of view not correct). They also get problems with dangling pages (dead ends) with wouldn't appear if they simply solve the equation system.
2) If you look at the improvement in CPU time for realistic scenarios (a realistic value of d and a realistic number of pages) you will find that in most of the cases they are small.
3) There are well-known (different) algorithms which are probably faster.
im far from a math whiz above adding and subtracting, but that your explanation makes close enough sense.
my question is why make the process faster? other than the obvious reason (faster serps) which are already lightning fast, what is the big picture? more pages? larger index? making room for additional points in the algo?
in our world of "bigger, better, faster, more"... i find it hard to believe the "faster" is the only point of address here.
any other ideas? or am i alone here?
Extrapolation:
This deals with some pretty heavy mathematics that I’m not even going to begin getinto. Your best bet, if you are not mathematically challenged like myself lol, is to click on the link Extrapolation. Very roughly, it is said to act in a similar way as Alexa by factoring in number of visits / traffic.
thats means, that simply used a known factor to extrapolate the other sites... alexa do the same to calculate the traffic rankings...they computes all alexa hits to get the "known factor" to computes the rankings.
OK, as long as you understand that they are not using any new data. From your initial post you made it sound like google was using traffic data..."by factoring in number of visits / traffic." This is not the case.
From your initial post you made it sound like google was using traffic data..."by factoring in number of visits / traffic." This is not the case.
you are right.. that´s what i wanted to say :) Google dont use any traffic/visit data to computes the known factor...but some thoughts.. they can use gg toolbar values...
Making it faster, from the obvious effect of cutting several days off the update cycle, and/or lettign them use less hardware, will mean by using the same tiem they could calculate several differently weighted PageRanks. The idea is to have topical pageranks, for example one that weights commercial sites heaveier, or informational sites and so on, so One could search different versions of the PR with a particular type of pages emphasized.
Ultimately, the ideal would be to have a pagerank for each user, emphasizing their favourite pages and pages benefitial to their general interests. But that, of course would require a speed up of a about 8 or 9 magnitudes (about 100 million times faster). This doesn'T seem feasable with the current system.
The BlockRank paper outlines the speed tests they did, which has a theoretical speed increase of a factor of 5 with the same dampening factor as the original Google papers (anatomy of a large scale hypertextual search engine etc).
So this new technique, BlockRank, can also be used in conjunction with the other adaptive method or the Quadratic Extrapolation method.
If you read through the papers closely, you'll noticed a few parallels between the math techniques that speed things up with adaptive control systems, which have been used in industrial manufacturing for years.
Given a set of unkowns, the system has to approximate a better value of the end result, even though it could figure out precisely what the values are, it's faster to simply approximate. By using fuzzy sets, they can arrive at the same or close enough to the same final values that makes the end results similar enough to not lessen the index quality.
SN
but some thoughts.. they can use gg toolbar values...
Indeed they could, a scary thought IMO. I imagine they at least collect this data even if they aren't acting on it.
The idea is to have topical pageranks, for example one that weights commercial sites heaveier, or informational sites and so on, so One could search different versions of the PR with a particular type of pages emphasized.
Another benefit is the potential for quicker pagerank updates (google dances) under the current system. Combine this with what many theorize to be the purpose behind current freshbot crawling and you could have much faster complete updates. I imagine we'll see this sort of continuous update system before topical/contextual pageranks, simply because it would be much easier to implement.
my question is why make the process faster?
As already said, to run a continuous update (and to provide personalized PageRank respectively results).
See also this thread. [webmasterworld.com]