|Why does the 'Google Lag' exist?|
Trying to understand its purpose.
I had some in-depth discussion this weekend with some friends about the sandbox. Every theory on how to beat it kept coming back to one central problem - no one is sure why it exists.
I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.
I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.
So, why does the sandbox exist?
The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?
<<< How will looking at an operating system change my perspective on a search engine?
I assume if you are the mod for the linux forums you have a certain amount of scepticism about anything MS says or has said about its Windows products? That's exactly how I would look at anything google says or has said about what it does or how it does it, or why.
Re the sandbox, lag, penalty etc, yes, that's what we're both talking about largely here. Why it exists, and all that. It's very odd behavior. Good also to see the term more precisely defined, it's not just commercial terms though, it's much wider range than that from what I see. Unless commercial just means x number of results returned? Hard to say. Is it a capacity problem, is it a ranking problem, is ranking being used to deal with a capacity problem, is a capacity problem causing a big glitch in ranking, which is being called a 'lag'. Hard to say. But not hard to call it a problem.
Imagine this: MS releases their new longhorn. But you can't install any new software on it until the software is 6 months old. That's to thwart potential security holes, or whatever. Paint this picture for any other tech company than google and you can see how absurd the business model is. Google still is getting a free pass though.
Nice to see that at least a few here have been able to hack this latest version, though I'm not positive that all they did was prelink the domain or something.
Here is where the capacity theory first appeared. Remember the guy joined WebmasterWorld specifically to post the message.
His 1 year timeline matches the lag we see now.
GoogleGuy strenuously denied his comments.
Looks like a leak to me.
|Google still is getting a free pass though. |
Free pass? Just because they might have a policy on holding back new websites?
I started a new site at the end of May 2004. At least Google has all pages in its index and spiders it daily - that gives me some comfort. Yahoo is still showing pages that are now nearly three months old - I know that because I changed the page structure in late July. And Yahoo has about 1/8th of the website in its database even though it Slurps down pages daily.
I've got exactly 1 page in MSN - and I don't care where they get their database from. Ask has my home page as does Wisenut. So tell me how Google is getting a free pass? They are STILL the best engine out there.
If there is no update soon, and this this thread continues much longer, we may actually fill up google's page capacity despite what everybody says ;-)
>Looks like a leak to me.
If that wasn't a leak it was a very odd hoax. What is so odd about that post is the specificity of the details. While he says at the end this is just a guess, his post includes things like "They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes." That isn't consistent with a guess; he could only know that if he had an inside source. And I'd think a hoaxer would make the problem seem more urgent, rather than the problem will take a while to become evident. Very curious that if this was a hoax, his theory is consistent with what we are seeing now. Google had to do something temporarily about this problem, and that was create the sandbox. The new URLs they decided not to index were mostly new sites.
|His 1 year timeline matches the lag we see now. |
Huh. It sure does.
If they removed even part of the worthless cranked out duplicate swill, the index would probably be no more than 3/4 the size it is now.
Well, I expect dupe filter 2.0 any day now Marcia. We're already seeing parts of it with slow death and this weekend's past update.
But I still don't believe the capacity issues, guys. I just think Google is smart enough to see something like that coming.
But maybe I am giving them too much credit.
>If they removed even part of the worthless cranked out duplicate swill, the index would probably be no more than 3/4 the size it is now.
If Google could easily remove that worthless cranked out duplicate swill, no doubt they would even if they can index 20 times the number of pages that they currently do. And on the theory that they have hit a page indexing limit, Google may have decided to go slow upgrading once they realized that one of the consequences of going slow is that the sandbox would be that it would help keep down the amount of worthless cranked out duplicate swill. If the searcher is looking to buy a widget, there will still be lots of sites he can find selling widgets even if new sites are sandboxed. And for pure informational searches, how many useful pages are there out there on new sites where similar information can't also be found on old sites? Yeah, if I develop a cure for all forms of cancer and put that on a new site, it won't be findable in Google. However, is this a scenario that happens significantly often on the web? The sandbox probably is the most effective way of dealing with worthless cranked out duplicate swill. If Google is quickly indexing new sites, spammers will crank them out faster than Google can identify them and whack 'em.
The sandbox exists for some reason. If it isn't because of Google being limited to the number of pages they can index, then that means Google intentionally for some other reason decided to limit the size of the index. There would only be 2 reasons to do this when it wasn't necessary. #1) To fight spam; and/or #2) To give an incentive for new sites to buy Adwords.
>But I still don't believe the capacity issues, guys. I just think Google is smart enough to see something like that coming.
>But maybe I am giving them too much credit.
Or maybe Google did in fact see it coming, and decided to go slow in upgrading because they considered keeping the size of the index down was in their best interest. Let's assume there really is a capacity issue, and Google has already hit it. Does anyone have data that Google's share of the market is declining ever since the sandbox effect hit? If not, then Google doesn't have a problem.
©2004 Google - Searching 4,285,199,774 web pages
Pretty close to the index limit size that poster claimed.
I love GoogleGuy's reaction to the re5earcher "leak."
His first reaction, the same day (June 7, 2003), was "Did anyone catch the IP address of that masked re5earcher? ;) (just kidding)"
Another reaction, on June 16, was:
|One cautionary word of advice: take everything with a grain of salt, and make choices that are common sense to you and work well for your users. For example, there was recently a thread that suggested Google was running out of "address space" to label our documents. I was talking to another engineer here and he said he almost fell out of his chair laughing when he read that. So there's always a lot of theories floating around all the time about why something is this way or that. My advice is to assume that Google wants the most useful, relevant pages to come up first for searchers. Try to build those useful, relevant pages as well as you can, and we'll do our best to find them and rank them accurately for searches. |
I believe that Google got re5earcher's IP address and had a friendly chat with him. On June 14 re5earcher answered a sticky and said,
"hehe, tell them it was a hoax, nothing more :)
and i'm not a google employee :)"
But then, I'm not sure that it was really re5earcher behind the sticky at that point.
<<His 1 year timeline matches the lag we see now>>
>I believe that Google got re5earcher's IP address and had a friendly chat with him.
Or GoogleGuy knew this was a genuine leak, and wanted to plug it and stop even more from leaking out. The problem here is that if this was a genuine leak, GoogleGuy would be expected to say it was just BS. And if this was a hoax, GoogleGuy would also say this was just BS. So we can infer nothing by how GG responded.
I must say I do find it curious that what re5earcher in that post predicted seems to have come true. Of course, coincidences happen all the time.
One year lag time? Why even bother using Google than. I'd prefer a search engine that provides the best results on the web, not the best results of a year ago.
If you look at all of rea5earcher's posts the language/tone is really inconsistant. The one refrenced above is pretty coherent, and with no underscores.
post #1 here isn't, he misspells "algirithm" and uses lots of underscores.
msg #3 more underscores
skip down to message #14 no underscores in sight and he is back to being pretty coherent.
#18 the underscores return
Was re5earcher more than one person?
Hey, back in the 80's a lot of programmers thought they were being quite clever by using 2 digits to represent years to save some space. *someone* would have plenty of time to fix the problem before Y2K.
Now, apply this to the tools google uses by considering the stable released versions. They are 32 bit tools in native form.
The machines and perforce the os that google uses are 32 bit. Therefore, the tools such as compilers and script engines are also 32 bit.
Now, even *if* they use 64 bit routines, consider that every 64 bit access is actually *2* accesses to the data bus. Consider then, that the cpu idles for *multiple* wait states during data bus accesses on cache misses, you then have a massive slow down in doing calculations if you move to > 32 bits, both in data access times and longer code paths.
So, even if they have designed 64 bit workarounds, it remains a workaround. And, as long as they stay on 32 bit boxes, the backend calculation times cannot help but increase on any attempt to move beyond 32 bits. The pipe is only so wide.
The 32 bit limit is immutable in the hardware.
Is it causing a problem? From here it seems to be a reasonable presumption.
As part of their *heritage* google as a matter of image will not move away from their chosen operating system and duct taped white boxes.
The awaited MS search is not hobbled by such considerations. As a matter of fact, it becomes a showcase opportunity.
It's funny if you do a search for re5earcher in G that it got a decent bit of notice (his hoax(?) post).
>Was re5earcher more than one person?
The posts are similar enough they could plausibly be from the same author. Only one that seems somewhat inconsistent is the first one about Google running out data indexing capacity. My interpretation of this is one of the following:
#1) The first post was written by someone in Google.
#2) The first post was written by someone else who sent it to him, and this person claimed to have a source inside Google.
Note he apparently contradicts himself in the first post. At the end it says "[just a guess but who knows]". Notice the use of brackets there. This would only make sense if he was tacking that on himself to a communication he received from another. If these were all his words, no need to bracket that. Also, he contradicts that this is a just a guess by stating above:
"They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes."
How could he know the expansion would be to 5 bytes, and not say 6?
"This procedure will require 1000 new page index servers and additional storage for temporary tables.
"They are hoping to make this change gradually server by server.
"The completion of the process will take up to one year after that the main URL index will be switched to use 5 bytes ID."
How if he just guessed they were running out of data indexing capacity could he know that it would take specifically 1000 new page indexing servers, Google was hoping to make this change gradually, and that this would take up to one year to do? He couldn't. Thus the most reasonable explanation is that he wasn't guessing, but instead this was sent to him by someone else and he decided to run it by the people here for analysis.
>>>While he says at the end this is just a guess, his post includes things like "They now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes." That isn't consistent with a guess; he could only know that if he had an inside source. <<<
Originally he stated that the info was from an inside source and then lated edited it to "this is just a guess"
|They are hoping to make this change gradually server by server. |
Could explain all the datacenters that have disappeared over time...
The re5earcher post was edited the original author - it appears he originally claimed it was from "inside sources" as the first reply asked who were inside sources.
I also think it was interesting that when people were asking for Googleguy to respond, Brett mentioned that the information was "mission critical" or something along those lines - insinuating there may actually be truth to it.
I may be naive, but I don't believe Googleguy would intentionally lie if the whole thing were true (I honestly believe the founders want to stick to the "do no evil" mantra). While I'm not a programmer, I don't see what the big deal is that Google has to adapt as the web grows and technology evolves. In other words, if it were true, Googleguy could have remained silent, or could have even confirmed that there is truth that the next algorithms will be dealing with limitations of a 32 bit system. Would it really have somehow hurt Google for people to know they had a capacity issue?
Interestingly, reading this re5earcher's other post about helping Google develop a better algorithm, Googleguy responded about the person's suggestion referring to how poeple could spam. One suggestion (referring to re5earchers proposed aglo) involved people buying up new domains as a way to deal with old domains being penalized.
I personally believe that the Florida update would have encouraged people to buy new domains as a way to spam the algo - and Google put a stop to it with the sandbox. I can't think of how many times I've seen people comment, when talking about penalized domains, that the solution is "get a new domain and start over" - well, the sandbox would put a stop to that - and I think that is why Google is doing it.
>Originally he stated that the info was from an inside source and then lated edited it to "this is just a guess"
Ahh...this explains why the person in msg #2 wrote: "Who are internal sources?" re5earcher realized he posted something he shouldn't have. However, note that re5earcher obviously doubted the believability of that source. Why else the subject of "I think google reached its ID capacity limit?" Surely no big shot at Google would leak something like this. It would have to be someone much lower. Perhaps sufficiently low down that maybe they didn't know the full truth.
>Would it really have somehow hurt Google for people to know they had a capacity issue?
Yep. Admitting problems isn't the way to go with an IPO on the horizon.
>I personally believe that the Florida update would have encouraged people to buy new domains as a way to spam the algo - and Google put a stop to it with the sandbox. I can't think of how many times I've seen people comment, when talking about penalized domains, that the solution is "get a new domain and start over" - well, the sandbox would put a stop to that - and I think that is why Google is doing it.
And would be consistent with GoogleGuy denying they had a data indexing capacity problem. If GG knew the big bosses had approved the sandbox already, then he'd also know that current data indexing capacity would be more than adequate for Google's needs until such time as they could upgrade their systems.
Suppose you had a 1200 page website, but for some technological reason only 1000 pages could appear on the Internet at one time. What would you do? Not publish the newest 200 pages, or not publish what you consider the 200 worst/weakest pages? If Google has a capacity problem, why not just remove all PR0 pages from the index?
This whole supplemental pages phenomenon is weird, the lag time phenomenon is weird, the backlinks and toolbar PR choices are weird. Honestly now, completely leaving aside the quality of the ranked serps, isn't just about everything Google is up to these days just plain weird? Even if you would aknowledge a capacity problem, there would seem a far better arbitrary way to deal with it than what they are doing.
|isn't just about everything Google is up to these days just plain weird? Even if you would aknowledge a capacity problem, there would seem a far better arbitrary way to deal with it than what they are doing. |
Capacity problem + management problem + shifting priorities (from algorithms to ad revenue) = weirdness in the main index
I think the whole thing about re5earcher is very interesting, great find SlyOldDog. The information from re5earcher seems to carry weight or at least ruffle feathers. The fact that GoogleGuy even reviews the merits or problems in detail bout "Re5earcherRank" is telling. There is lots of stuff that is bantered about here at WW that GoogleGuy simply ignores.
I would like to share some observations related to the sandbox effect. I am hoping others will share similar observations - perhaps we can find out more about the sandbox by sharing observations.
From my experience, there has only been one time when sandboxed sites were allowed out of the sandbox. This occurred about May or June of this year. Below is my experience on this and why I feel this way.
I helped put out a website in about February or March of this year. This site was indexed and displayed classic sandbox behaviour. It was in the index, for example 'site:' showed all its pages but the site could not rank for anything it should easily have ranked for. For example, a search for site name put the site on the second page of results - after about 15 other sites which just linked to the sandboxed site.
The site was a small website built by hand promoting the services of a professional in local market. The site was clean, it had only on topic links in addition to a DMOZ link. The links were gained quickly but quite naturally.
At the time, the sandbox phenomenon was quite new. When I read here at WW that many others were seeing the same thing happening to their sites, I was at ease. I realized that this was not my doing but something larger at play.
Then in about May or June, (not exactly sure when), I remember reading a thread here at WW that said sites in the sandbox were allowed in. Sure enough, when I checked the site I helped put out, all was fine. The site had good rankings, basically top 10's for the search terms it was optimized for.
As I remember, there was no one saying that their sites were not allowed in at that time so I assumed that all sandbox sites were allowed in.
Do others agree that there was generally only one period in which sites were allowed in from the sandbox? If so, were all sites allowed in or just a few? Have there been other times when sites were allowed in?
If all sites were allowed in at intervals, this would tend to indicate to me that this is more of a capacity issue than it is a spam fighting issue.
|I may be naive, but I don't believe Googleguy would intentionally lie |
"I did not have sex with that woman."
|There is lots of stuff that is bantered about here at WW that GoogleGuy simply ignores. |
Yes - everything lately. Didn't his absence also coincide with the appearance of the sandbox?
|Then in about May or June |
I second that. It was by the second week of May that many sites came out of the sandbox. I havent seen any site come out completely since then.
Also I have not seen any site come out of it without a dmoz link.
A few people seem to be able to beat the lag. I would love to see an example of that, as I've yet to see it happen (well since May anyway).
"Didn't his absence also coincide with the appearance of the sandbox?"
No. You are off by about five months.
|Also I have not seen any site come out of it without a dmoz link. |
Untrue, I had one come out in may without a DMOZ link.