Why does the 'Google Lag' exist?

Forum Moderators: open

Message Too Old, No Replies

Why does the 'Google Lag' exist?

Trying to understand its purpose.

bakedjake

1:43 am on Sep 29, 2004 (gmt 0)

I had some in-depth discussion this weekend with some friends about the sandbox. Every theory on how to beat it kept coming back to one central problem - no one is sure why it exists.

I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.

I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.

So, why does the sandbox exist?

The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?

leveldisc

2:05 pm on Oct 2, 2004 (gmt 0)

Also as I understand it, although related the sandbox and google lag are not the same thing. The sandbox affects pages already in the index, seemingly penalising some of them. The lag is a time period delaying the acceptance of pages not yet in the index

They are the same thing. Someone at WebmasterWorld doesn't like the term sandbox so called it google lag.

I don't believe there is a delay in getting pages indexed. The delay is in getting pages ranked.

BeeDeeDubbleU

2:41 pm on Oct 2, 2004 (gmt 0)

Full stop!

dazzlindonna

2:50 pm on Oct 2, 2004 (gmt 0)

I got flamed for taking the thread offtopic by asking why someone here doesn't like the term sandbox. And maybe it is offtopic, but perhaps it isn't. (I have no way to judge without having the answer). If WebmasterWorld doesn't like the term sandbox, perhaps the reason has some reflection on the sandbox theory itself. I would know if it did or not, if I knew why the term is frowned upon here. So...why is it frowned upon? Does the reason have anything to do with the theory itself? If not, then I'll shut up about it. :)

BeeDeeDubbleU

3:07 pm on Oct 2, 2004 (gmt 0)

Summary of the thread "Why does the Google Lag Exist ...

1. The Google lag does exist, of that there is no doubt.

2. The Google lag is not related to sites being indexed. Sites are still being indexed very quickly by Google. This is all about ranking.

3. Why does it exist? No one has yet come up with a theory that lots of people like.

bakedjake

3:58 pm on Oct 2, 2004 (gmt 0)

Does the reason have anything to do with the theory itself?

No. Now let's drop the name issue and move on.

Google Lag = sandbox

I'm using them as interchangeable terms. Thanks.

Scarecrow

5:41 pm on Oct 2, 2004 (gmt 0)

I definitely am not an expert, but I believe the PR-algorithm is heavily based on a 32-bit-hardware architecture. As far as I know, PR is calculated by approximation thru about 100 iterations over the 4.29 billion cross 4.29 billion matrix, which means a huge number of calculations.

This is correct. I don't know whether they would use a 64-bit integer to expand beyond 4.29 for this calculation, or use one extra byte for a total of 5 bytes, and mask out the bits in the extra byte that aren't used. If space is the primary consideration, they will go with 5 bytes. In the inverted indexes, the space taken up by the docID is extremely important.

But for the old, classic PageRank calculation, assuming that they haven't abandoned this entirely by now, it's possible they'd go for speed over space. In this case it may be that a 64-bit integer requires fewer CPU cycles than an extra byte with masking.

But the point is that you have increased your CPU cycles for reading and writing the docID either way you do it -- whether you use 64 bits or one extra byte beyond the 32 bits.

The classic PR calculation, before Google crashed in April 2003, took several days after a crawl of the entire web. That was using the 32-bit integer. How many times to you think they need to read and write the docID during these few days? It's a huge number. Now add extra CPU cycles to every read and write. It's a massive performance hit.

I've long assumed that they blew off the classic PageRank calculation ever since April 2003. I think it would take weeks instead of days to calculate it, as soon as you accommodate numbers above 4.29, assuming that the original formula is used. In fact, there is a huge amount of evidence that the PageRank on new sites is approximated, based on values inherited before the April 2003 Cassandra crash.

But PageRank is just a Google fetish anyway. You can do perfectly well without that insane, recursive formula using a matrix of the entire web. All you want is a number that indicates page quality that is independent of any search terms relevant to the page. This allows a pre-sort of the inverted indexes, and cuts your access time for filling search requests to about one percent of what it might be otherwise.

But then, I've been arguing the 4-byte theory now for 16 months, and all the SEO wags have been steadily denouncing me. I finally gave up. I realized that the SEO wags have to put me down even if they privately agree, because they're in the business of telling people that they know how to predict Google rankings. The "capacity problem" theory gets in their way, and requires that I be denounced.

And another thing, I'm tired of the "+the" argument that shows 5.8 billion. Try allintext, allinurl, allinanchor and allintitle with +the and you also get the same 5.8 billion. Anyone who thinks that Google does anything beyond an extremely crude extrapolation for numbers above 1000 for anything, should know that Google has better things to do with their CPU power than to provide accurate counts on the fly for stop words. And even if they aren't extrapolating, isn't it possible that they're counting the main index plus the supplemental index plus the URL-only index plus the "lag" index?

Who cares if the "+the" count is real or not? Reminds me of Clinton, who said that it depends on what the definition of the word "is" is.

isitreal

6:26 pm on Oct 2, 2004 (gmt 0)

Scarecrow, the light blinks on: the question of 64 bit vs 32 bit, made me think of something, nobody is talking about the physical hardware used to run google. Their old system was built on homemade Linux boxes, running I'm going to assume on a 32 bit architecture.

Your point on the overhead involved in going to even a 5 Byte system makes enough sense to explain why they have not gone to it yet.

However....

It is now extremely easy to build very reasonably priced 64 bit white box servers running Linux, running AMD 64 bit processors, for probably the same or less per box than google spent building their 32bit system to begin with. Linux has supported 64 bit processors for a while now, definitely long enough for the technology to have become mature enough to implement on a google type scale.

How many times to you think they need to read and write the docID during these few days? It's a huge number. Now add extra CPU cycles to every read and write. It's a massive performance hit.

With this in mind, let's assume that there will be no need to change calculations done per cycle if they move up to a full 64 bit system. I'm going to assume that it's this that google has been waiting for: a full rebuild of their server farm, an upgrade to a full 64 bit docID, doing this halfway, to just 5 Bytes, would have been silly, better to hold off, mislead and obfuscate, to keep this process under wraps until the IPO was done, then start work hardcore.

Scarecrow

6:47 pm on Oct 2, 2004 (gmt 0)

It's only been in the last 18 months that Google has even been able to think about going to all-64-bit hardware, because it's only been that long that Google has been absurdly wealthy.

I'm sure they've considered it by now. Lots of considerations are involved. The main ones are cost, CPU throughput, the bill from the power company, etc.

I have no idea if 64-bit hardware would even be feasible for Google. It would take them some effort to figure it out too. They'd have to see if they can get a pricing break for quantity, they'd have to write new assembly-language library routines for compilers, etc. It's a big project.

If I were Google, I'd consider it smarter to keep the unwashed masses dazzled with my branding power, keep the Wall Street pundits hypnotized with new dog-and-pony shows like Gmail, get the IPO going, maintain the stock price by any means necessary until all the lockups expire, and cash in the options.

Then board your new yacht and sail to your private tropical island. No computers required. At that point you don't even need 32 bits!

isitreal

6:56 pm on Oct 2, 2004 (gmt 0)

I whipped out my calculator:

Assume a 20,000 machine server farm:
Assume SATA 2x80 gig per box.
Assume Linux
Assume $500 per box (that's a very high price, they will probably go much less due to volume buying, I could do this for $500 or so per box with no volume buying.

Power [at 300 watts per unit]: probably only 30% higher than their current useage.

Add in an aggressive hiring campaign for top level programmers.

500x20,000 = 10 million dollars.
This is chicken feed. Double the servers, triple them, it's still chicken feed. This is the least difficulty, doing the switchover itself would be the most difficult, obviously.

Further, look at MSN, they put off the MSN release til next June or July, from January roughly. Oddly, Windows is also late on their 64 bit stable OS as far as I know. Obviously MSN is going to eat MS dogfood, but they also need to be running these 64 bit machines. It all more or less adds up. There is no point in entering the market with a 32 bit system today.

It would take them some effort to figure it out too.

Yes, and prices have been plummetting on 64 bit hardware, especially on the processors, but now AMD has a full line, all very stable as far as I know. I would say 'it would take them a while' is correct, except I'd change that too: it is taking them a while. Obviously they couldn't start this during IPO time, but they can now, and equally obviously they were never going to admit what the real situation is.

Google can take off to the bahamas, they can rebuild their server farm, they can reprogram everything, they did it before with relatively no resources, now it's just a tiny drop in their cash reserves; they can do it all at once, and if they can't, there will be some unemployed googlers very soon.

SlyOldDog

7:17 pm on Oct 2, 2004 (gmt 0)

Just a possible explanation of ranking vs indexing

If Google uses a 4.29 billion X 4.29 billion matrix for calculating pagerank, it could well be that indexing is not the problem, but there is only space in the matrix to calculate pagerank for 4.29 billion pages.

The other pages in the index all get a nice shiney PR0 :)

No matter how much anchor text you have, you won't get anywhere without a little pagerank!

This 354 message thread spans 36 pages: 354