Page is a not externally linkable
- Google
-- Google News Archive
---- Why does the 'Google Lag' exist?


Scarecrow - 5:41 pm on Oct 2, 2004 (gmt 0)


I definitely am not an expert, but I believe the PR-algorithm is heavily based on a 32-bit-hardware architecture. As far as I know, PR is calculated by approximation thru about 100 iterations over the 4.29 billion cross 4.29 billion matrix, which means a huge number of calculations.

This is correct. I don't know whether they would use a 64-bit integer to expand beyond 4.29 for this calculation, or use one extra byte for a total of 5 bytes, and mask out the bits in the extra byte that aren't used. If space is the primary consideration, they will go with 5 bytes. In the inverted indexes, the space taken up by the docID is extremely important.

But for the old, classic PageRank calculation, assuming that they haven't abandoned this entirely by now, it's possible they'd go for speed over space. In this case it may be that a 64-bit integer requires fewer CPU cycles than an extra byte with masking.

But the point is that you have increased your CPU cycles for reading and writing the docID either way you do it -- whether you use 64 bits or one extra byte beyond the 32 bits.

The classic PR calculation, before Google crashed in April 2003, took several days after a crawl of the entire web. That was using the 32-bit integer. How many times to you think they need to read and write the docID during these few days? It's a huge number. Now add extra CPU cycles to every read and write. It's a massive performance hit.

I've long assumed that they blew off the classic PageRank calculation ever since April 2003. I think it would take weeks instead of days to calculate it, as soon as you accommodate numbers above 4.29, assuming that the original formula is used. In fact, there is a huge amount of evidence that the PageRank on new sites is approximated, based on values inherited before the April 2003 Cassandra crash.

But PageRank is just a Google fetish anyway. You can do perfectly well without that insane, recursive formula using a matrix of the entire web. All you want is a number that indicates page quality that is independent of any search terms relevant to the page. This allows a pre-sort of the inverted indexes, and cuts your access time for filling search requests to about one percent of what it might be otherwise.

But then, I've been arguing the 4-byte theory now for 16 months, and all the SEO wags have been steadily denouncing me. I finally gave up. I realized that the SEO wags have to put me down even if they privately agree, because they're in the business of telling people that they know how to predict Google rankings. The "capacity problem" theory gets in their way, and requires that I be denounced.

And another thing, I'm tired of the "+the" argument that shows 5.8 billion. Try allintext, allinurl, allinanchor and allintitle with +the and you also get the same 5.8 billion. Anyone who thinks that Google does anything beyond an extremely crude extrapolation for numbers above 1000 for anything, should know that Google has better things to do with their CPU power than to provide accurate counts on the fly for stop words. And even if they aren't extrapolating, isn't it possible that they're counting the main index plus the supplemental index plus the URL-only index plus the "lag" index?

Who cares if the "+the" count is real or not? Reminds me of Clinton, who said that it depends on what the definition of the word "is" is.


Thread source:: http://www.webmasterworld.com/google_archive/25989.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com