Forum Moderators: open
First, let me explain a few concepts the way I understand them:
1) Google has 8 billion pages in their index.
2) To calculate pagerank, google must do several "iterations" throught the data. On the first iteration (of 8 billion pages) google has to see all the outbound links from ALL of the pages. On the second iteration some pages gain rank because they have incoming links. But this is not enough, several more iterations must be completed in order to get a good reading and establish a rank.
3) The computing power required to do numerous iterations across 8 billion pages must be enourmous.
4) Google uses "supplemental results" in addition to the main index, alluding to the idea that PageRank may only be established for the first 4 billion or so pages, and the rest is just used to "fill in".
5) Before google moved to only doing (allegedly visible) updates once per quarter, there were numerous problems with Google keeping on their monthly schedule. People would become alarmed.
6) Even before the quarterly updates, Google was using "Freshbot" to help bring in new data between monthly updates. Please check me on this: Freshbot results did not have PageRank.
7) We have been told that even though there is no update to what we see in the little green bar, there is actually "Continuous PageRank Update"
I find continuous update of PageRank implausible. In order to get a true calcualtion it requires calcualtions across the entire dataset of 8 billion pages multiple times. We have already seen signs that there were issues in the past (missed updates), attempts to remedy the problem (freshbot), and use of additional measures to hide what is really going on (quarterly updates). Most likely, we are now in an age of "PageRank Lite".
And here is the "kicker": we have this mysterious "sandbox effect" (aka "March Filter") that seems to place a penalty on new links and/or new sites. Could it be a result of Google's failure to calculate PageRank across the entire index?
IMHO, Yes!
Quietly, Google has been building more datacenters. Recently, they opened a new one in Atlanta, GA, but there was no public announcement. There is not even a sign on the building. If you go up to the building to the door, there is a voice on the intercom that also doesn't tell you that you are at a Google facility (source: Atlanta Journal Constitution).
With the number of datacenters Google has already, the main reason for adding more is probably not uptime and dependability. Though these things are important, they certainly have plenty of datacenters, and you rarely see problems with downtime or server speed. The reason for adding these datacenters (quietly) must be they need more computing power to calculate PageRank.
I believe I have provided many examples to support the idea that continuous updating of PageRank is indeed a farce. I also feel that this "sandbox effect" is a result of the inability to do complete calcualtions across the entire dataset (especially new additions to the data).
I look forward to hearing what others think.
With constant revisits of the gbot on my sites, I can say that (at least on my sites), CPU is hapenning. I see recently added pages added to the serps and others taking higher positions. I can only see this as a sign that google is giving PR to the new pages and higher to the older that are being linked back to.
I agree with your assumption that the Sandbox Effect is due to the "inability" to complete calculations, though i dont consider this an inability. All of my latest sites took less than two weeks (not months as some ppl have suggested) to be spidered and to appear on G listings. This was before I went on a linking campaign.
Let's hear from some others on this CPU subjects. The sandbox effect has been covered extensively.
However, I don't agree that there are principle problems with a continuous PR update. Even with 8 billion pages, one iteration should take less than a day. Using the PR values of the last iteration as input and just updating the linking structure would lead to an almost continuous update. Even for complex large new sites, PR should be almost stable within a month.
Using the PR values of the last iteration as input and just updating the linking structure would lead to an almost continuous update.
Exactly my point...the last complete update was a long time ago. They are just using their "old" update.
8 billion pages in one day? Questionable at best. Why would they be using 2 indexes (ex. supplemental) if it was a easy to do one iteration in one day?
Again, these domains were not assisted with any other domains and were not registered early in the year. Im talking about new registered domains as of September or so. About 11 different domains with different content and linking structures, and suffered no sandbox effect.
Exactly my point...the last complete update was a long time ago. They are just using their "old" update.
"Continuously" does not necessarily mean that the final (stable) PR value is calculated for each configaration. Even a "continuous" PR update with a few (2-3) iterations a day would be better than the old monthly update, i.e. results in a faster propagation of PR.
8 billion pages in one day? Questionable at best.
The calculation itself (for one iteration) isn't very time consuming. The problematic point might be getting the data.
The calculation itself (for one iteration) isn't very time consuming.
So, approximately, how is the number of calculations related to the number of pages and how many calculations are currently required to completed one iteration. Also, how much memory is required.
I still have a hard time believing that an iteration can be completed in less than a day.
Kaled.
Nobody ever stated that calculations were done on the full data set. Also, remember that want you call a "true calculation" does not imply one fixed "true" value for a page - in stead, it is an approximation of that true value, as the calculation will come closer for each iteration, but it will never reach the true value. Even if it did, the underlying data would change along the way. There's a whole body of mathematics dealing with such issues.
Let's say the real issue is something else: How much data and how many iterations do you really need in order to compute values that are "good enough"? Perfect doesn't exist on this scale, it's all tolerances / probabilities / deviations / quality levels / whateveryoucallit.
So, the task is to minimize the use of ressources while maximizing the output quality. That's a pretty standard issue in the field of operations analysis, and you could probably express it in algebra if you wanted to, but computers tend to use brute force in stead, as that's what they're made for :)
>> "Continuously" does not necessarily mean that the final (stable) PR value is calculated for each configaration
Indeed, a continuous calculation implies that there is no final, or stable value, as the calculation is always in progress (and new data is most probably fed in continuously as well).
Ideally, we'll se a slower (more continuous, he) development in ranking for sites in stead of the update-jumps known from the past.
Still, when major changes occur (such as doubling the index size(*)) you would risk that the continuous method couldn't cope with all the new stuff in a satisfying manner. Then, you would perhaps find it worthwhile to do a real fullscale PR calculation, just like the old days, in order to get a new (set of) reference point(s).
For quality assurance, this should be done once-in-a-while anyway, like, say one or two times a year or so - just to make sure that the continous values don't degenerate too much (eg. if calculation speeds can't keep up with new data added or something).
... come to think of it; if you clean out the sand from your eyes and put away your "great google conspiracy" glasses (and the 32 bit lenses too) - that's quite an interesting post, huh?
;)
-----
(*) ... which, btw. didn't happen overnight - you just don't do that. It's been proceding since february 2004 [webmasterworld.com]. What happened overnight was that the figure on the Google front page changed.
So, approximately, how is the number of calculations related to the number of pages and how many calculations are currently required to completed one iteration. Also, how much memory is required.
Assuming that the standard iteration scheme (Jacobian) is used, the number of operations per page should be less than 100, yielding less than 800 billion operations. This shouldn't be the problem.
A rough estimate for the memory would be #pages * 18 bytes (2 double + 1 integer), i.e. this is a problem.
However, using block techniques and parallelization should solve the problem.
A rough estimate for the memory would be #pages * 18 bytes (2 double + 1 integer)
I would have thought that a database would be required that contained, for every page, a list of links on that page. That would increase the memory requirements considerably. Additionally, I'm not convinced that this problem lends itself readily to distribution (amongst many computers) but parallel processing (many CPUs sharing memory) might be ok I guess. Also, even if an iteration can be performed quickly, continuous updating of the database could be a real problem.
Google like to use networks of cheap PCs but this problem seems comparable to global weather modelling to me and that is typically the domain of the super-computer.
Kaled.
The point is that there is no hint that this behaviour is caused by a change to a continuous PR update or due to an increase in PR calculation time (due to an increase of pages).
The 18 bytes per page was a very rough lower bound for the memory, because I neglected the transition matrix. Therefore, the memory requirements are even higher, i.e. the problem is even worser.
The access to data base would heavily slow down the calculation.
I think they are going mostly off old (pre-March 2004) data, and just doing their best to infuse this data with Freshbot type data. Alternately, they may be continuing to calculate PageRank on a smaller dataset.
Also evident is they are probably not calculating PageRank on "supplemental results". I'll bet these supplemental results are post-March 2004, but do not contain enough criteria to make them part of a competitive keyword dataset.
Is any of this starting to make sense?
I truly think I am on to something here.
I'm not too sure about that. Difficulty, that is. As implied in my post above, it's not unreasonable to believe that they could find perfectly good reasons not to perform it, even though they could do it. Still, once in a while they should probably do it anyway.
One page of my site shows toolbar PR 0 yet it's in the Google directory with a PR of 3. Also, another page has a toolbar PR of 6 but is in the Google directory with a PR of 4.
Given that it is widely believed that PR has been devalued over the last year or so, it seems likely that Google are less concerned now to keep it uptodate. Furthermore, the technology to calculate it was designed, presumably, with monthly dances in mind, therefore, it is likely that iterations take several days or even weeks to complete.
Kaled.