|What goes on technically at Google during an update?|
Has anyone thought about this?
The processing required to detect pagerank internetwide must be incredible. I wonder what tools they use and how they do this? Think about it.
In order to calculate pagerank for a page, forgetting any other variables affecting PR than links and the value of the page linkting to it, they need to evaluate how many backlinks there are for every single page. Then they need to evaluate the importance of every single page of those backlinks. Seems like recursive logic. How do you know the importance of the page until you know the importance of the link, but the importance of the link requires knowing the importance of THAT page?
I got to think in the first round, it counts links and assigns PR first based on quantity. Then it does another iteration, bumping up PR for those with high PR. And then does another few iterations again? Anyone else have thoughts on this?
Yes, that is how it is done. There has been lots of discussion here about this topic. You cn read the original academic articles by the founders of Google. People have created PR calculator tools for which you enter a parameter to tell how many itterations you want to see.
The calculations appraoch (in a mathematical sense) the PR of the page with increased iterations. They stop when they run out of time or they have done a sufficient number of iterations to approximate the solution.
The site search should lead you to that stuff.
you got it right, I think.
As far as I understand
the PR-calculation is repeated about 40-50 times, each time starting with the PR-Value of the previous iteration.
What I do not know is what PR is used in the first iteration. It could either be identical for all pages, so all are starting with something like PR1 (as you said), or the current PR calculated in the last dance. Given the nature of the formula used, I think the difference resulting from initial values assigned could be shown to converge to zero for an infinite number of iterations.
In the actual world with finite iterations, I think taking last months value would get better results faster.
If nobody knows what they start from, I think you should be able to see it watching the dance. If they start from scratch, the listings at the beginning of the dance should be more different from the current than at the end.
[edited by: engine at 6:10 pm (utc) on Feb. 26, 2003]
[edit reason] No urls thanks ( see TOS #13) [/edit]
The difference between starting from scratch and starting from the previous month's values is at most a few iterations. The logistics of seeding the database with the previous values might take as much effort and time as running those extra iterations. I'd call it a wash.
The reason that there would not be much difference is thatwhole sections of the web appear and disappear each month. The old pages will probably become stable very quickly, but the new pages will have the PR flowing towards them like a wave, getting one page closer with each iteration. Then the new pages, and the pages that they link to have to also settle.
>In the actual world with finite iterations, I think taking last months value would get better results faster.
>If they start from scratch, the listings at the beginning of the dance should be more different from the current than at the end.
If I understood correctly how PR was calculated, the algorithm needs the same starting value for each page to converge accurately. The starting value is irrelevant, as long as it's the same for all pages. All examples I've seen show a starting value of 1.
One thing we don't know, even if they start from scratch every month, is when the results are made public? After a couple of iterations, after 5, 10...?
The algorithm converges pretty quicky so maybe a few iterations would give estimates good enough to start showing. If that was true,Google would of course be hiding results "up to a point" in the computational phase.
If this makes sense, these first few iterations could probably be happening right now, and we'd see the new index showing up once the results start converging.
Not sure this makes sense ... I'm not a mathematician :)
They could start out with all sorts of random numbers and the results would converge to the same point. Since it doesn't really matter, and the average value in the ideal system is 1, then why not just start with that.
There is a huge amount of computation that goes into indexing other than PR. The dance is only the final moments. All the factors that are involved in the indexing have to be calculated for every page. And they also need to preprocess as much of the searching factors as they can. Who knows when they do what part, but it is all going on long before the dance as we know it starts.
Yep, they can start with any numbers and the result will be the same. But of course they need to choose a number wisely so that it doesn't require too much interation until in reach the converging point.
I don't think they configure the interation with a static number. Instead, the interation is stop when it meets the criteria where the different between current value and last value is smaller than a prefer value, maybe 0.01.
yes, i'd agree to that, the algorithm would stop if the average difference to the last iteration is neglibile.
I do not think this would take an extra amount of seeding, the database records will have the pr-value for the search anyway, so the question is: drop the old ones, or use em. And I do not think that in terms of backlinks the web is that variable - not starting from scratch with yahoo and geocities/home/something/anything/nobodycares.html having the same pr in the first iteration would be a good idea, imho, it would save a few iterations, and that is a lot if you need thousands of linux-servers for at least some minutes to do one iteration ;-).
Hmm... My brain's melting a bit.. but here goes..
BigDave, I was going to agree, but actually I think you would have to reset the PR values..
I don't think you'd get the same results if you didn't (sure if the index was the same, you'd get the same result after 1 itteration).
I'm sure that you'd see an age effect if you didn't reset. It depends how the index has changed... and what the link-path distance across the index from your high ranking pages to your new pages is.. Ok, so if the internet follows the 'Six-degrees-of-separation' rule, then maybe it wouldn't matter, and maybe it would normalize quickly enough that you wouldn't have to reset.. but my gut sais it would be easier to do so...
In answer to Clark's original question, I think you're pretty much on.. But I would do it the other way round. Look at each page one by one, share your 85%'s worth of PR (is it still 85%?) between all the pages you link to.. More of a push method than a pull method.. I don't think the final raw PR calc is likely to be that much work when you look at it against spidering/parsing 3 billion pages of html! ;)
I'm not sure about this 'do you have to reset the PR' thing.. I've changed my mind about a dozen times while I've been typing.
must hit submit.. must hit submit..
|brotherhood of LAN|
|What goes on technically at Google during an update? |
I don't know the math, but Google must grab there full index in less than 28 days at a time (per update).
If they made a table containing the source and destination of a link, yeah, their must be trillions of hyperlinks that they must take into account before they begin. But I guess all they need to know is how many links are in there to start processing PR....then when they process the first one they know its value is 1/totaldocs...and it starts passing on its PR (AFAIK! hopefully someone will correct me if im wrong).
So after they have all hyperlinks they are going to include in their PR calc, they do it, and towards the end it sits on www2/www3 (?) who knows outside google.
Perhaps this patent [webmasterworld.com] has something to do with updating pagerank "on the fly"....i dont know, the page keeps timing out on me ;)