|Why does the 'Google Lag' exist?|
Trying to understand its purpose.
I had some in-depth discussion this weekend with some friends about the sandbox. Every theory on how to beat it kept coming back to one central problem - no one is sure why it exists.
I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.
I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.
So, why does the sandbox exist?
The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?
|Also as I understand it, although related the sandbox and google lag are not the same thing. The sandbox affects pages already in the index, seemingly penalising some of them. The lag is a time period delaying the acceptance of pages not yet in the index |
They are the same thing. Someone at WebmasterWorld doesn't like the term sandbox so called it google lag.
I don't believe there is a delay in getting pages indexed. The delay is in getting pages ranked.
I got flamed for taking the thread offtopic by asking why someone here doesn't like the term sandbox. And maybe it is offtopic, but perhaps it isn't. (I have no way to judge without having the answer). If WebmasterWorld doesn't like the term sandbox, perhaps the reason has some reflection on the sandbox theory itself. I would know if it did or not, if I knew why the term is frowned upon here. So...why is it frowned upon? Does the reason have anything to do with the theory itself? If not, then I'll shut up about it. :)
Summary of the thread "Why does the Google Lag Exist ...
1. The Google lag does exist, of that there is no doubt.
2. The Google lag is not related to sites being indexed. Sites are still being indexed very quickly by Google. This is all about ranking.
3. Why does it exist? No one has yet come up with a theory that lots of people like.
|Does the reason have anything to do with the theory itself? |
No. Now let's drop the name issue and move on.
Google Lag = sandbox
I'm using them as interchangeable terms. Thanks.
|I definitely am not an expert, but I believe the PR-algorithm is heavily based on a 32-bit-hardware architecture. As far as I know, PR is calculated by approximation thru about 100 iterations over the 4.29 billion cross 4.29 billion matrix, which means a huge number of calculations. |
This is correct. I don't know whether they would use a 64-bit integer to expand beyond 4.29 for this calculation, or use one extra byte for a total of 5 bytes, and mask out the bits in the extra byte that aren't used. If space is the primary consideration, they will go with 5 bytes. In the inverted indexes, the space taken up by the docID is extremely important.
But for the old, classic PageRank calculation, assuming that they haven't abandoned this entirely by now, it's possible they'd go for speed over space. In this case it may be that a 64-bit integer requires fewer CPU cycles than an extra byte with masking.
But the point is that you have increased your CPU cycles for reading and writing the docID either way you do it -- whether you use 64 bits or one extra byte beyond the 32 bits.
The classic PR calculation, before Google crashed in April 2003, took several days after a crawl of the entire web. That was using the 32-bit integer. How many times to you think they need to read and write the docID during these few days? It's a huge number. Now add extra CPU cycles to every read and write. It's a massive performance hit.
I've long assumed that they blew off the classic PageRank calculation ever since April 2003. I think it would take weeks instead of days to calculate it, as soon as you accommodate numbers above 4.29, assuming that the original formula is used. In fact, there is a huge amount of evidence that the PageRank on new sites is approximated, based on values inherited before the April 2003 Cassandra crash.
But PageRank is just a Google fetish anyway. You can do perfectly well without that insane, recursive formula using a matrix of the entire web. All you want is a number that indicates page quality that is independent of any search terms relevant to the page. This allows a pre-sort of the inverted indexes, and cuts your access time for filling search requests to about one percent of what it might be otherwise.
But then, I've been arguing the 4-byte theory now for 16 months, and all the SEO wags have been steadily denouncing me. I finally gave up. I realized that the SEO wags have to put me down even if they privately agree, because they're in the business of telling people that they know how to predict Google rankings. The "capacity problem" theory gets in their way, and requires that I be denounced.
And another thing, I'm tired of the "+the" argument that shows 5.8 billion. Try allintext, allinurl, allinanchor and allintitle with +the and you also get the same 5.8 billion. Anyone who thinks that Google does anything beyond an extremely crude extrapolation for numbers above 1000 for anything, should know that Google has better things to do with their CPU power than to provide accurate counts on the fly for stop words. And even if they aren't extrapolating, isn't it possible that they're counting the main index plus the supplemental index plus the URL-only index plus the "lag" index?
Who cares if the "+the" count is real or not? Reminds me of Clinton, who said that it depends on what the definition of the word "is" is.
Scarecrow, the light blinks on: the question of 64 bit vs 32 bit, made me think of something, nobody is talking about the physical hardware used to run google. Their old system was built on homemade Linux boxes, running I'm going to assume on a 32 bit architecture.
Your point on the overhead involved in going to even a 5 Byte system makes enough sense to explain why they have not gone to it yet.
It is now extremely easy to build very reasonably priced 64 bit white box servers running Linux, running AMD 64 bit processors, for probably the same or less per box than google spent building their 32bit system to begin with. Linux has supported 64 bit processors for a while now, definitely long enough for the technology to have become mature enough to implement on a google type scale.
|How many times to you think they need to read and write the docID during these few days? It's a huge number. Now add extra CPU cycles to every read and write. It's a massive performance hit. |
With this in mind, let's assume that there will be no need to change calculations done per cycle if they move up to a full 64 bit system. I'm going to assume that it's this that google has been waiting for: a full rebuild of their server farm, an upgrade to a full 64 bit docID, doing this halfway, to just 5 Bytes, would have been silly, better to hold off, mislead and obfuscate, to keep this process under wraps until the IPO was done, then start work hardcore.
It's only been in the last 18 months that Google has even been able to think about going to all-64-bit hardware, because it's only been that long that Google has been absurdly wealthy.
I'm sure they've considered it by now. Lots of considerations are involved. The main ones are cost, CPU throughput, the bill from the power company, etc.
I have no idea if 64-bit hardware would even be feasible for Google. It would take them some effort to figure it out too. They'd have to see if they can get a pricing break for quantity, they'd have to write new assembly-language library routines for compilers, etc. It's a big project.
If I were Google, I'd consider it smarter to keep the unwashed masses dazzled with my branding power, keep the Wall Street pundits hypnotized with new dog-and-pony shows like Gmail, get the IPO going, maintain the stock price by any means necessary until all the lockups expire, and cash in the options.
Then board your new yacht and sail to your private tropical island. No computers required. At that point you don't even need 32 bits!
I whipped out my calculator:
Assume a 20,000 machine server farm:
Assume SATA 2x80 gig per box.
Assume $500 per box (that's a very high price, they will probably go much less due to volume buying, I could do this for $500 or so per box with no volume buying.
Power [at 300 watts per unit]: probably only 30% higher than their current useage.
Add in an aggressive hiring campaign for top level programmers.
500x20,000 = 10 million dollars.
This is chicken feed. Double the servers, triple them, it's still chicken feed. This is the least difficulty, doing the switchover itself would be the most difficult, obviously.
Further, look at MSN, they put off the MSN release til next June or July, from January roughly. Oddly, Windows is also late on their 64 bit stable OS as far as I know. Obviously MSN is going to eat MS dogfood, but they also need to be running these 64 bit machines. It all more or less adds up. There is no point in entering the market with a 32 bit system today.
|It would take them some effort to figure it out too. |
Yes, and prices have been plummetting on 64 bit hardware, especially on the processors, but now AMD has a full line, all very stable as far as I know. I would say 'it would take them a while' is correct, except I'd change that too: it is taking them a while. Obviously they couldn't start this during IPO time, but they can now, and equally obviously they were never going to admit what the real situation is.
Google can take off to the bahamas, they can rebuild their server farm, they can reprogram everything, they did it before with relatively no resources, now it's just a tiny drop in their cash reserves; they can do it all at once, and if they can't, there will be some unemployed googlers very soon.
Just a possible explanation of ranking vs indexing
If Google uses a 4.29 billion X 4.29 billion matrix for calculating pagerank, it could well be that indexing is not the problem, but there is only space in the matrix to calculate pagerank for 4.29 billion pages.
The other pages in the index all get a nice shiney PR0 :)
No matter how much anchor text you have, you won't get anywhere without a little pagerank!
[quote]No matter how much anchor text you have, you won't get anywhere without a little pagerank![/quote}
Untrue, I have PR0 pages ranking on un-lagged sites, and PR5 pages that are google-lagged that are nowhere.
|Google can take off to the bahamas, they can rebuild their server farm, they can reprogram everything, they did it before with relatively no resources, now it's just a tiny drop in their cash reserves; they can do it all at once, and if they can't, there will be some unemployed googlers very soon. |
Your argument is convincing. Okay, let's assume that they're holding off for something big to happen, whether it is 64-bit hardware, or whatever.
Then the question becomes, "Why have the band-aids they've applied to their index in the last 18 months been so pathetic? This alone should have been enough to endanger the IPO!"
If Yahoo can do it without PageRank, why can't Google? (True, they both are overly-dependent on keywords in anchor text.)
If one person at Gigablast can do almost all of the programming for a very respectable engine, why does Google have to rely on cute colored logos to keep everyone impressed?
At the very least, I think there's a management problem at Google and their priorities are messed up. But it's an uphill argument when they're all getting stinkin' rich over there in Mountain View. Maybe after Bubble2.0 we'll be able to figure out what happened.
back to the topic: what and why the sandlag?
first, the things that we know about sandlag pages and believe are agreed upon:
- sites/pages are in the "index" - can find them using site: and similar queries
- they appear in the serps for non-competitive terms and hardly appear for competitive terms
- no clear pattern at this time when and how sites/pages leave the sandbox.
so how would google accomplish this and produce the above symptoms.
A way would be to use filters and penalties implemented by having "if-new-site" logic in their algorithm. This seems too messy considering that there is an easier alternative.
At this point, I bring up "supplementals", not because it is related to the sandbox (it has confused somebody previously). If you look at the symptoms, they are very similar to supplemental pages. pages are in the index and also appear in the serps for non-competitive terms. so why not use the same technology (i.e. separate index from the main) to implement this quarantine of new sites? this will avoid any messy "if" programming. what remains is to figure out when and how to migrate sites/pages from this separate index to the main index.
why would google do this? quarantine new sites? this is where the bigger contoversy is. some claim it is to fight spam. some (including me) claim google is out of-capacity in its main index. perhaps if we are able to answer this question, it will help us figure out what criteria google uses to choose which sites leave the box and integrated into the main index.
> I've been arguing the 4-byte theory now for 16 months
I bow my head and apologize for not having done any research on that.
> I would say 'it would take them a while' is correct, except I'd change that too: it is taking them a while. Obviously they couldn't start this during IPO time, but they can now, and equally obviously they were never going to admit what the real situation is.
> they can do it all at once
I conclude so far that the four-byte-theory all in all is not too unreasonable. Since in the past we all never knew what the 'real situation' was, why not stop reading tea leaves and proceed to more tactical efforts:
As a matter of fact, most of us webmasters suffer more or less heavily by pagerank of our new sites not being reindexed for three months now. Can you imagine a headline in the Financial Times saying "google facing serious technical problems" or so? Just an idea to maybe accelerate what is going on.
> Untrue, I have PR0 pages ranking on un-lagged sites, and PR5 pages that are google-lagged that are nowhere.
Maybe, but did you - as me - watch some of them bounce up and down in ranking almost every hour? This is not what we'd expect from a thoroughly working search engine, is it?
|Why have the band-aids they've applied to their index in the last 18 months been so pathetic? This alone should have been enough to endanger the IPO! |
Bandaids were enough when the press and their supporters didn't bother applying the kinds of critical standards they should have. Google has a cute name and company slogan, and for some reason this made everyone roll over and wave their legs in the air rather than just apply the same standards you apply to any other commercial/corporate entity. However, think of the damage it would have done if the press had started printing articles about the algo being maxed, IPO prices would have dropped dramatically, nobody wants a sick company. Then if you can implement some algo tweaks to force out enough pages to force webmasters to buy adwords, boost income, boost pre IPO bottom lines, presto. Then work out the engineering headaches all these hacks create afterwords, now that is.
|back to the topic: what and why the sandlag? |
We didn't leave the topic, the topic thread is why does it exist, the sandlag [haha] is a phenomena that is relatively easily explained by physical limitations on the algo.
|This is not what we'd expect from a thoroughly working search engine, is it? |
no, but it is exactly what I would expect from a holding pattern, full on system redo, the example I've given before is when your harddrive is basically full, you start shuffling stuff in and out, waiting to add stuff [this is just an analogy, I'm not saying that google is physically out of storage space, that would be stupid]. Then finally one day you break down are realize not only is it time for a new harddrive, it's time for a new system altogether, since in the meantime everything is faster and has more capacity. This analogy might be more accurate than we realize, remember that google runs on the same boxes you run on, more or less, it doesn't use supercomputers, so what you see happen on your own whitebox is what is happening, more or less, on google's. And what's happening now is a move to 64 bit computing on Linux.
> what's happening now is a move to 64 bit computing on Linux.
Seems so. The question is: How long'll it take and - since this is in the interest of most of us - what if anything can we do to accelerate this process.
I assume you all know this joke: "What does a german do if faced with a red traffic light at three o clock in the morning? He stops his car!" I hate that!
|The question is: How long'll it take |
Brett posted recently to expect a very big change this fall. We'll see. LOL on the stoplight thing, I saw that too, 2 am, no cars, street 8 feet wide, red no walk light, germans stared at us as if we were criminals when we crossed against the light...
17th Feb. GoogleGuy says "Oh S*** boss, something awful has happened. We reached the 32 bit limit 6 months early. We won't be able to index another page until you put 20 million on the line and then we still don't really know what we're doing, so it might take a loooong time to fix it"
Eric Schmidt: "How the hell did that happen? I'd better call back that pesky guy from Merrill Lynch who's been bugging me for ages about the IPO. By the way GoogleGuy you're demoted. New job working on Google holiday logos" (note: end of GoogleGuy on WebmasterWorld)
Merrill Banker: "You guys better IPO right now. I don't think any of my buddies at the pension funds would want these damaged goods. Let's make it an open auction and rip off the public instead."
Larry and Sergey: "Crap, how are we going to fix this mess? I know, we'll make 2 share classes so we retain a voting majority. That way even when the shareholders get pi**ed we'll be able to stop them firing us"
IPO announced 29th April to a huge sigh of relief at Google HQ.
Oh man, the 32 bit argument has to stop.
You show your ignorance in the biggest way. :)
Any URL is going to be stored using a hash algorithm that they've developed--probably not a very complex one at that. If you take a look at the query string parameters when you click on "Cached" on a Google search result page you'll notice that there's a cache:? entry. Most likely the URL's identifier is the question mark portion.
Google uses capital and small letters and numbers in this identifier, which gives 36^^12 or 4,738,381,338,321,616,896 possibilities, which should keep them going for a while.
And the point is?
What does a numbering system for cached pages have to do with capacity of the index?
From reading the comments of people who actualy understand the problem, the main problem seems not to be the capacity, but the increase in processing power required once the jump is made to a larger pagerank matrix.
critter, have you read the original white papers? Give them a read again if it's been a while. They were published not that long before google went fully live, this stuff doesn't get changed yearly. But it does need to get changed.
|entry. Most likely the URL's identifier is the question mark portion. |
I read this argument about a year ago. It wasn't impressive then, and it's not impressive now.
Yes, I've read the original white papers.
Where does it say anything about 32 bit integers in there?
Point is, even *if* there was some 32 bit value for document ids (dubious) it would take *nothing* to assign a version number to files, update the id to a longer value, and update the SE and crawl programs to recognize the version of the files as they served pages/crawled.
With all due respect, I don't think unless you are sitting at Google and you know their architecture you can say much about what would take *nothing* to do.
When you are networking large numbers of computers there have to be factors that normally don't come into play that many of us would not even think about (I can't speak for you of course).
Alright, alright. You're all correct then.
Google probably *does* use 32 bit integer document id's and most likely the entire cluster runs on Commodore 64s.
If there's anyone over at Google reading this thread they're peeing themselves laughing right now.
I'm a convert to the idea that Google is migrating to 64-bit Linux with a 64-bit file system. Presently they have a "virtual" 64-bit file system that involves lots of Ethernet links and distribution networking behind the scenes to go out and fetch the data that make up each chunk. With the addressing power of a real 64-bit system, Google would improve performance all across the system, and quite dramatically. If the cost is as low as isitreal says, then it's a complete no-brainer to migrate toward 64-bit computing. Look what they have to go through with their present 32-bit system:
|"Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunk servers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range." |
"The Google File System," by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Critter, a single inverted index consists of one docID per word per web page. Look at the space required for 4 bytes, assuming an average web page of 300 words. I've multiplied by two here because Google uses an average of two docIDs per word per page. That's because they have both a "fancy" and a "plain" inverted index, and also because the docID is used elsewhere in the system:
4 bytes: 300 * 4 billion * 4 * 2 = 9.6 terabytes
The first thing that happens when a search is requested is a lookup in the inverted index. To distribute this load, multiple copies of this index probably exist in each data center. Multiply the above by some unknown number.
This is a lot of space. That's just the space issue connected with the docID, whether it's all in memory or all on hard disk. Everything I've ever read about inverted indexes mentions the importance of compression. You cannot compress further and get more than 4.29 billion unique ones and zeros in 4 bytes (32 bits).
Now add the performance issue of the extra CPU cycles to fetch an expanded docID. (We're not even talking about calculating PageRank, because I think Google realized this was dead 18 months ago.)
Moving to 64-bit computing makes a lot of sense. They can define a new 5-byte integer type in the math library if they want to save space. But the point is, a 64-bit CPU could fetch this new type in one pass instead of two, and you don't take a performance hit.
Replicate the inverted index 4 times, distribute it over 2,500 machines and that's 12GB per machine. That's not to say that 2,500 machines isn't a low estimate. There's a lot of distribution at Google, we know that much already. Adding another couple thousand machines so that they can take advantage of a larger DocID isn't much of a stretch. Also it's reasonable to think that the *machine id* that the index is stored on is part of the DocID, further expanding the possibilities.
The lookups have to be painfully slow no matter which way you slice it, because of the numbers you get at the top of the results page (typical values fall in the 0.15 to 0.4 seconds range). The Google searches are so slow they *have* to distribute things around just to keep up with the requests. At 0.25 seconds average for a lookup and the probably 4,000 searches per second they get during peak periods they'll need 1,000 machines just to handle the load.
In my view a reported 10,000 machines at the plex and elsewhere easily handles a distributed inverted index/repository/etc with a larger DocId.
I dont think its handling lookups that is the main problem when adding more than 4.2 billion pages to the index. The biggest bottleneck would be calculating pagerank. At the moment using the 32 bit system they can store each page id as a straightforward integer. Adding a workaround to handle more than the 32 bit limit would drastically impair the speed to calculate PR.
Solution to pagerank: Do iterations over time, as things are crawled/stored. Don't do them all at once. Then pages move up and down in the index slowly over time, not all at once in a "dance".
I got a semi-intelligent question:
It seems to me that pagerank, with its "iterations" would be well-suited to calculus, as the pagerank for a particular page or pages clearly would approach a "limit".
Anyone ever done anything with this?
In about 2000, google was at 6000 machines, with I think about 1 billion pages indexed. Obviously harddrives have jumped up in size, so each machine can store more data.
Oh, critter, you really need to go back and reread the thing before making the types of comments you're making, your memory is playing tricks on you, or you just skimmed over this:
|Our compact encoding uses two bytes for every hit. There are two types of hits: fancy hits and plain hits. |
Is there something about that sentence that is unclear? Two types of hits, each 2 bytes. Thats 2+2Bytes, that's 4 bytes.
Oh man, you're kidding me right?
The two bytes they're talking about are used for the *position of the word* in the document, and have nothing to do with the DocId.
Furthermore, plain hits or fancy hits have some of their bits used for capitalization and such, so the amount of bits available for position information are further reduced. If you'll read further in the paragraph you quoted me you'll see that there's 12 bits of position information (out of 16) for plain hits, and 8 bits of position information (again, out of 16) for fancy hits.
Hits are included in forward barrels, which are represented as the second figure near the aforementionned paragraph. Forward barrel records start with a docid and wordid, then are filled with "hits", each of which is in the hit list format outlined above. The document id length, you'll notice in the paper, is not specified.
If you don't understand the paper (clearly, you don't) please don't quote from it. :)