Forum Moderators: open

Message Too Old, No Replies

Is Continuous PageRank Updating a Farce?

         

dvduval

12:43 am on Dec 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would appreciate it if someone could shed some light on this "continuous updating" that google is supposed to be doing.

First, let me explain a few concepts the way I understand them:
1) Google has 8 billion pages in their index.
2) To calculate pagerank, google must do several "iterations" throught the data. On the first iteration (of 8 billion pages) google has to see all the outbound links from ALL of the pages. On the second iteration some pages gain rank because they have incoming links. But this is not enough, several more iterations must be completed in order to get a good reading and establish a rank.
3) The computing power required to do numerous iterations across 8 billion pages must be enourmous.
4) Google uses "supplemental results" in addition to the main index, alluding to the idea that PageRank may only be established for the first 4 billion or so pages, and the rest is just used to "fill in".
5) Before google moved to only doing (allegedly visible) updates once per quarter, there were numerous problems with Google keeping on their monthly schedule. People would become alarmed.
6) Even before the quarterly updates, Google was using "Freshbot" to help bring in new data between monthly updates. Please check me on this: Freshbot results did not have PageRank.
7) We have been told that even though there is no update to what we see in the little green bar, there is actually "Continuous PageRank Update"

I find continuous update of PageRank implausible. In order to get a true calcualtion it requires calcualtions across the entire dataset of 8 billion pages multiple times. We have already seen signs that there were issues in the past (missed updates), attempts to remedy the problem (freshbot), and use of additional measures to hide what is really going on (quarterly updates). Most likely, we are now in an age of "PageRank Lite".

And here is the "kicker": we have this mysterious "sandbox effect" (aka "March Filter") that seems to place a penalty on new links and/or new sites. Could it be a result of Google's failure to calculate PageRank across the entire index?

IMHO, Yes!

Quietly, Google has been building more datacenters. Recently, they opened a new one in Atlanta, GA, but there was no public announcement. There is not even a sign on the building. If you go up to the building to the door, there is a voice on the intercom that also doesn't tell you that you are at a Google facility (source: Atlanta Journal Constitution).

With the number of datacenters Google has already, the main reason for adding more is probably not uptime and dependability. Though these things are important, they certainly have plenty of datacenters, and you rarely see problems with downtime or server speed. The reason for adding these datacenters (quietly) must be they need more computing power to calculate PageRank.

I believe I have provided many examples to support the idea that continuous updating of PageRank is indeed a farce. I also feel that this "sandbox effect" is a result of the inability to do complete calcualtions across the entire dataset (especially new additions to the data).

I look forward to hearing what others think.

scoreman

2:20 pm on Dec 13, 2004 (gmt 0)

10+ Year Member



In CPU I trust. Sandbox is a farce.

With constant revisits of the gbot on my sites, I can say that (at least on my sites), CPU is hapenning. I see recently added pages added to the serps and others taking higher positions. I can only see this as a sign that google is giving PR to the new pages and higher to the older that are being linked back to.

I agree with your assumption that the Sandbox Effect is due to the "inability" to complete calculations, though i dont consider this an inability. All of my latest sites took less than two weeks (not months as some ppl have suggested) to be spidered and to appear on G listings. This was before I went on a linking campaign.

Let's hear from some others on this CPU subjects. The sandbox effect has been covered extensively.

skunker

2:33 pm on Dec 13, 2004 (gmt 0)

10+ Year Member



Scoreman,
As I understand it, anyone can get indexed in Google easily, however, the sandbox effect occurs when you can not get ranked in that index for your search term.

scoreman

3:30 pm on Dec 13, 2004 (gmt 0)

10+ Year Member



Skunker, sorry forgot to mention that little tid-bit. I did get ranked for those search terms within two weeks.

hugo_guzman

3:47 pm on Dec 13, 2004 (gmt 0)

10+ Year Member



I'm also getting new content indexed and ranking well (for niche 2-4 word terms) within days.

skunker

4:13 pm on Dec 13, 2004 (gmt 0)

10+ Year Member



Yea, you can rank quickly if your domain (and I think site) was already registered/created before a certain date (March? February?).

dvduval

5:42 pm on Dec 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So whatever "continuous updating" is going on, I am saying that it is not based on a true PageRank update, but something else related to Freshbot (or similar). And I am also saying that it seems likely Google stopped doing complete updates a long time ago, most likely because their system can't compute PageRank for 8 billion pages anymore. And finally, new sites are not part of the last complete update (probably in March). I think Google has some problems, and this "continuous updating" is weak at best.

doc_z

6:12 pm on Dec 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I agree that Google has some problems.
See also this discussion about the status of Google's PR calculation. [webmasterworld.com]

However, I don't agree that there are principle problems with a continuous PR update. Even with 8 billion pages, one iteration should take less than a day. Using the PR values of the last iteration as input and just updating the linking structure would lead to an almost continuous update. Even for complex large new sites, PR should be almost stable within a month.

dvduval

6:26 pm on Dec 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Using the PR values of the last iteration as input and just updating the linking structure would lead to an almost continuous update.

Exactly my point...the last complete update was a long time ago. They are just using their "old" update.

8 billion pages in one day? Questionable at best. Why would they be using 2 indexes (ex. supplemental) if it was a easy to do one iteration in one day?

scoreman

8:13 pm on Dec 13, 2004 (gmt 0)

10+ Year Member



"Yea, you can rank quickly if your domain (and I think site) was already registered/created before a certain date (March? February?)."

Again, these domains were not assisted with any other domains and were not registered early in the year. Im talking about new registered domains as of September or so. About 11 different domains with different content and linking structures, and suffered no sandbox effect.

doc_z

9:09 pm on Dec 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Exactly my point...the last complete update was a long time ago. They are just using their "old" update.

"Continuously" does not necessarily mean that the final (stable) PR value is calculated for each configaration. Even a "continuous" PR update with a few (2-3) iterations a day would be better than the old monthly update, i.e. results in a faster propagation of PR.

8 billion pages in one day? Questionable at best.

The calculation itself (for one iteration) isn't very time consuming. The problematic point might be getting the data.

Clark

11:54 pm on Dec 17, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Supplemental can mean a lot of things. Page is less "worthy"? Hasn't been visited in a long time, so we'll show it as a last resort, but being supplemental, don't put too much faith in that page.

kaled

11:26 am on Dec 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The calculation itself (for one iteration) isn't very time consuming.

So, approximately, how is the number of calculations related to the number of pages and how many calculations are currently required to completed one iteration. Also, how much memory is required.

I still have a hard time believing that an iteration can be completed in less than a day.

Kaled.

claus

11:57 am on Dec 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> I find continuous update of PageRank implausible. In order to get a true calcualtion it requires
>> calcualtions across the entire dataset of 8 billion pages multiple times.

Nobody ever stated that calculations were done on the full data set. Also, remember that want you call a "true calculation" does not imply one fixed "true" value for a page - in stead, it is an approximation of that true value, as the calculation will come closer for each iteration, but it will never reach the true value. Even if it did, the underlying data would change along the way. There's a whole body of mathematics dealing with such issues.

Let's say the real issue is something else: How much data and how many iterations do you really need in order to compute values that are "good enough"? Perfect doesn't exist on this scale, it's all tolerances / probabilities / deviations / quality levels / whateveryoucallit.

So, the task is to minimize the use of ressources while maximizing the output quality. That's a pretty standard issue in the field of operations analysis, and you could probably express it in algebra if you wanted to, but computers tend to use brute force in stead, as that's what they're made for :)

>> "Continuously" does not necessarily mean that the final (stable) PR value is calculated for each configaration

Indeed, a continuous calculation implies that there is no final, or stable value, as the calculation is always in progress (and new data is most probably fed in continuously as well).

Ideally, we'll se a slower (more continuous, he) development in ranking for sites in stead of the update-jumps known from the past.

Still, when major changes occur (such as doubling the index size(*)) you would risk that the continuous method couldn't cope with all the new stuff in a satisfying manner. Then, you would perhaps find it worthwhile to do a real fullscale PR calculation, just like the old days, in order to get a new (set of) reference point(s).

For quality assurance, this should be done once-in-a-while anyway, like, say one or two times a year or so - just to make sure that the continous values don't degenerate too much (eg. if calculation speeds can't keep up with new data added or something).

... come to think of it; if you clean out the sand from your eyes and put away your "great google conspiracy" glasses (and the 32 bit lenses too) - that's quite an interesting post, huh?

;)
-----
(*) ... which, btw. didn't happen overnight - you just don't do that. It's been proceding since february 2004 [webmasterworld.com]. What happened overnight was that the figure on the Google front page changed.



Edit: Added answer to original poster.

doc_z

4:02 pm on Dec 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So, approximately, how is the number of calculations related to the number of pages and how many calculations are currently required to completed one iteration. Also, how much memory is required.

Assuming that the standard iteration scheme (Jacobian) is used, the number of operations per page should be less than 100, yielding less than 800 billion operations. This shouldn't be the problem.

A rough estimate for the memory would be #pages * 18 bytes (2 double + 1 integer), i.e. this is a problem.

However, using block techniques and parallelization should solve the problem.

dvduval

4:41 pm on Dec 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok, so it's not a problem, and there are no regular updates to visible pagerank, and google never had problems in previous updates, and there is no sandbox. Now I understand.

kaled

5:14 pm on Dec 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A rough estimate for the memory would be #pages * 18 bytes (2 double + 1 integer)

I would have thought that a database would be required that contained, for every page, a list of links on that page. That would increase the memory requirements considerably. Additionally, I'm not convinced that this problem lends itself readily to distribution (amongst many computers) but parallel processing (many CPUs sharing memory) might be ok I guess. Also, even if an iteration can be performed quickly, continuous updating of the database could be a real problem.

Google like to use networks of cheap PCs but this problem seems comparable to global weather modelling to me and that is typically the domain of the super-computer.

Kaled.

doc_z

5:28 pm on Dec 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



dvduval, you should read the posts accurately. Nobody is saying that there are no problems, there are regular updates to visible PR or that there is no sandbox.

The point is that there is no hint that this behaviour is caused by a change to a continuous PR update or due to an increase in PR calculation time (due to an increase of pages).

dvduval

5:33 pm on Dec 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sorry doc_z, I acknowledge your position, but I agree with Kaled that there is likely more computing power required than your initial hypothesis.

doc_z

6:51 pm on Dec 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The estimate of 800 billion operations assumes that all data are already stored in the memory.

The 18 bytes per page was a very rough lower bound for the memory, because I neglected the transition matrix. Therefore, the memory requirements are even higher, i.e. the problem is even worser.

The access to data base would heavily slow down the calculation.

dvduval

10:26 pm on Dec 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Okay good, so we are in some agreement that there is a strong likelihood that google is now having difficulty doing a complete PageRank analysis?

I think they are going mostly off old (pre-March 2004) data, and just doing their best to infuse this data with Freshbot type data. Alternately, they may be continuing to calculate PageRank on a smaller dataset.

Also evident is they are probably not calculating PageRank on "supplemental results". I'll bet these supplemental results are post-March 2004, but do not contain enough criteria to make them part of a competitive keyword dataset.

Is any of this starting to make sense?

I truly think I am on to something here.

claus

10:40 pm on Dec 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> having difficulty doing a complete PageRank analysis

I'm not too sure about that. Difficulty, that is. As implied in my post above, it's not unreasonable to believe that they could find perfectly good reasons not to perform it, even though they could do it. Still, once in a while they should probably do it anyway.

kaled

1:32 am on Dec 19, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's been said many times but is often forgotten - don't believe toolbar PR.

One page of my site shows toolbar PR 0 yet it's in the Google directory with a PR of 3. Also, another page has a toolbar PR of 6 but is in the Google directory with a PR of 4.

Given that it is widely believed that PR has been devalued over the last year or so, it seems likely that Google are less concerned now to keep it uptodate. Furthermore, the technology to calculate it was designed, presumably, with monthly dances in mind, therefore, it is likely that iterations take several days or even weeks to complete.

Kaled.