|Why does the 'Google Lag' exist?|
Trying to understand its purpose.
I had some in-depth discussion this weekend with some friends about the sandbox. Every theory on how to beat it kept coming back to one central problem - no one is sure why it exists.
I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.
I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.
So, why does the sandbox exist?
The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?
Marcia refers to 'scientific proof' - but if scientific proof was the requirement for any posting regarding Google's algo; all posts, present and past would have to be pulled.
Let's get realistic.
BTW, there are at least a couple of basic principles surrounding Google's results which could be regarded as 'scientific':
1) Make some money for Google.
2) Make more money for Google.
3) Make a bit more money for Google.
Adwords is the basis, Adsense provides the final proof.
Beyond that: who's to say? But I guess we're trying to subvert it - what's unnatural about that Marcia?
There is no "scientific" proof possible; aside from which, most people who run extensive controlled testing on huge amounts of data, which some certainly do, while it's as scientific as it can get, tend keep their findings to themselves as trade secrets.
>>Adsense provides the final proof.
It certainly provides the motivation to flood the index with swill. ;)
BTW, is anyone else seeing "sandboxed" sites showing up in the backlinks for a site they link to?
Let's just swing this back a bit.
Why does the 'Google Lag'/sandbox/whatever exist?
* Because Google want to hold pages on new domains back from ranking for the first few months? (Why just newly linked domains, surely it's just as easy to add junk to existing domains? What's wrong with new sites anyway?)
* Because Google have run out of diskspace/addresses? (Very, very unlikely.)
* Because Google have some new ranking/theming/quality system that takes a few months to run? (Months? Seems odd.)
The reason for a 120+ post thread on this subject is that the answer is not obvious. Or maybe it's really quite simple but none of us have suggested it?
ciml, could it be your #1 and #3 combined? And time needed for testing?
Remember a while back someone saying that Google's new algos and indexes are thoroughly tested for quality before they're released for public consumption? It would make sense then, particularly if as some believe there's more than a one step process going on.
|Because Google have some new ranking/theming/quality system that takes a few months to run? (Months? Seems odd.) |
That one, if there's a new system or ordering, might take considerably more testing time. Just pure conjecture on my part of course, but there's too much otherwise unexplainable and there's been too much reason not to expect a somewhat different type of scheme eventually.
[edited by: Marcia at 8:30 pm (utc) on Oct. 1, 2004]
Since everyone shares just a little bit, and as Marcia suggest we keep the rest of it close to the vest, because it's a 'trade secret'. We can't come to much of a conclusion.
Since things have only moved out of the lag box once most of us don't have enough data to predict how/why what got out did, and how to develop a reliable workaround.
Thinking back to when this "effect" first appeared - or people started posting about it anyway was just about the same time that Googleguy was posting in here that they were looking at how to handle expired domains being used for their old linkbacks to build up new sites. If memory serves - it had really gotten out of hand people were suggesting ways to handle it here.
There was quite some discussion about a way to dampen the ability to use a "new" domain (aka the ones that seem to be having the so called sandbox problem) as a spam tool. One of the effects people could see with some of the remedies was that it would effect brand new domains as well as expired domains that were transferred as Google seemed to be going the path of looking at the modified date in whois databases instead of the created date to fight the expired domain problem.
One of the important facts here that has been proven is that old domains that a person has had around for quite some time with a site or just a holding page have not been affected - wheras a brand new domain or domain that was transferred since this talk was going on seem to be the affected sites.
To boil this all down - I believe this "sandbox" effect that everyone claims is just a by-product of the algo change Google has made to handle the old expired domain spam problem.
Itís a total mystery for us. Some say they have gotten by this thing but boy they are in the minority thatís for sure. We do ok in this business and Google has been a big part of that so Iím not going to whine. But this sandbox is really beginning to weigh on our minds. We have launched four new sites since March, unique subjects, and unique content, slow natural link building, submitted to all the usual places, basic on page seo, ect, ect. All the stuff we have been doing that has brought good results.
However, ever since March of 2004 these sites just wonít punch through for any meaningful terms. There you see them using the various commands, well placed. The bottom line is they just donít make us any money; we are just unable to drive the necessary traffic to convert. Itís not even close.
All of the good thoughts put forth by everyone are interesting and maybe the truth lies in the somewhere. But the problem I am having is that for all practical purposes no one has ever gotten out of this thing. Some claim they have and if thatís true my hats off to them. But for the great majority we are stuck going on seven months now. Itís inconceivable to me you could go a year not being able to monetize a site organically. But that time frame is not that far off.
Keep the thoughts coming, I for one appreciate each and every one.
If there is one think you can count on google doing it is that they always (pardon the phase) think outside the box and think FAR OUTSIDE.
GMail became a internet legend overnight by doing something nobody had ever DREAMED of before. Can you image how ridiculous Hotmail felt when they are offering 2 mb for free and google offers 1000mb.
I believe whatever google is doing, it is something big and in one or two or six months from now half of us will be leaping in the air for joy and the other half will be crying like babies. God I love this business. :)
>>>> Why does the 'Google Lag'/sandbox/whatever exist?
One more theory....perhaps newer sites, especially those focused on competitive terms have a higher bar / different criteria to rank than old sites and that's the way it's gong to stay. Google got a lot of bad press after Florida for creaming a lot of small business listings, so perhaps they decided to go ahead with their algo changes but grandfather them in, so it's much harder (needing more links, more themed links, more uniique content, time lag on links, natural linking patterns etc.) to rank these days and that is the way its going to stay.
That way the old businesses can't compain because they aren't getting hit (unless they have datafeed, duplicate content, auto gen pages, etc.) and the new web site owners can't really complain about "lost traffic" since their sites are new and they never had any traffic to lose.
It's just a thought but it's something I've been wondering about.
|Site with loads of decent inbounds, TBPR of 6, scores allinanchor in the top 5. Nowhere to be seen for any query worth a damn. Proof enough for me |
how old is the site?
Call it what you will: sandbox, lag effect, the Man in the Moon...it does exist.
I have a site that went up two years ago that deals with different brands of widgets. Search for "blue widgets" and it's #1 on Google.
I have a site that I submitted to G on June 14th. It's indexed about 300 out of 1500 pages.
Both sites have the brand "blue widgets" on them. Same keyword density, same everything when it comes to SEO.
A search for "blue widgets" finds my new site at about position #350 in the SERPS.
Also back in June, the owner of the company that I did the two year-old site for wanted to add a page about "red widgets." Two weeks later I checked on Google, and that site ranked #1 for "red widgets."
And, yes, I have incoming links for the new site's "blue widgets" page.
How else to explain it?
Lest we stray off in another direction again, and staying with the "why" if the thing, whatever it is:
|To boil this all down - I believe this "sandbox" effect that everyone claims is just a by-product of the algo change Google has made to handle the old expired domain spam problem. |
Those will eventually get in, but not with the same advantages they did; so it sounds like a logical part of the whole picture. Also, the "instant" link pop can't work like it used to, put together with the flooding of the index with a ton of less_than_valuable cranked out pages, with the attendant linking strategies.
Putting it all together with the other things they seem to be tightening up on, including removing loads of pages from the index a few months ago and what seems to be an emphasis on detecting near-duplicate content, all we can conclude overall is that the reason on their part is no different from what they've always claimed their motivations are - to improve the value of search for their users.
We can quibble over the how's and the means and mechanics, but what it all boils down to is that regardless of the methodologies they're using, there's no way they'd sit still forever without resisting and fighting back against what violates their standards of value. None of us would either, in their place.
Pardon me for seeming bemused but isn't it obvious? For some time now (about a year) a theory has existed that Google would not be able to index more than 2^32 pages.
That's 4,294,967,296 pages
Today Google is indexing 4,285,199,774 pages.
That is within 0.2% of its theoretical limit.
GoogleGuy denied it last year, but that is just too much of a coincidence, and GoogleGuy has been wrong before.
[edited by: SlyOldDog at 11:36 pm (utc) on Oct. 1, 2004]
Assuming there is a reason for lag time, that is, it is deliberate and not a technical meltdown, if it continues I suspect we will see a significantly different implementation of it in the near future.
Tons of sites are avoiding lag time by spamming blogs and guestbooks. At the same time, one site I'm watching is the equivalent of bretttabke.com -- an offical site for a person with a name that no more than a half dozen people in the world might have. It's listed in dmoz and has lots of high quality links, and ranks in the hundreds for the person's name. It is ludicrous to suggest the reason for this is "bad seo".
The point may be that lag time exists in an attempt to accurately weigh if apparent quality is real quality. But if this is the point, Google currently isn't caring about zero quality sites built on the non-authority aspect of the algorithm (they rank just from volume of anchor text links).
It is of course crazy to lag "apparent quality" while you judge its true worth, while letting pure dreck rank via the non-quality aspects of the algorithm. As the results get flooded with more of this total crap, they are either going to have to lag the effect of any link or accept results deliberately skewed to new, low quality sites.
Here is when Google increased their index to the current size. February 17th. So when is this sandbox supposed to have started?
For us it started in March, but thats just the time frame of the oldest site we have in the box.
|Tons of sites are avoiding lag time by spamming blogs and guestbooks. |
1) spamming blogs and gbooks alone won't get you around the sandbox
2) spamming blogs and gbooks was around long before the sandbox started
I've been lurking for a while, but this is my first post in this forum. I have some observations to make. No proof - dont shoot me down, but feel free to point out any errors I've made in my comments, I'm interested in learning rather than winning arguments.
The theory about running out of docIds is interesting. When I first heard it, I thought it daft - surely Google could get round this, I thought. But considering it further it may well have some bearing. For ages Google's front page has claimed there are 4,285,199,774 pages in the index. And the maximum number of pages if each were assigned a 32-bit numerical id would be 4,294,967,296, a fraction of a percent higher. In perspective, the number of pages they claim to have in their index is 99.8% of the maximum number allowed in a 32 bit number. This seems a hell of a coincidence to me, especially as it has stayed at this level (at least according to their front page) for considerable time. Perhaps this also explains why the index used to be built is stages culminating in a Google Dance, but for some time a different rolling system has been used. The old system may have produced more pages than the limit, so could have been abandoned in favour of another system which gradually adds pages as other pages are deleted.
Given that there may be significant obstacles in upgrading to more than 32 bit indexing system, this may explain why Google has kept the index roughly the same size all this time - they may have been working on a 64 bit system but it's not ready yet.
With too many pages for the index to cope with, this may be a reason why they are imposing barriers to entry for new sites into the system and are also getting more strict on spamming techniques as they are quite happy to kick sites out to free up room for good new sites.
It wouldnt surprise me if they keep new sites not indexed yet in a quaranteen database and perform a number of tests on them - are they purely affiliate sites, are they dmoz clones etc, or if they are sites with genuinely new content, then they would be good candidates to enter the index when space has been freed up by kicking out spammy sites. This perhaps explains why some people in this forum have commented that Google is becoming much less tolerant of duplicate content. If you were Google and have been nearly at your page limit for a while, the last thing you want is yet another glorified clone of an existing database.
If a 64 bit index is nearly ready for launch, I am not suprised they delayed it until after the share floatation, since there may be unforeseen instabilities in the new index and at a time when the company needs a stable image to attract investor confidence in their technology.
Also it just so happens that making it harder to get into the index increases demand for Adwords, so it makes business sense to do exactly what they are doing at the moment until the 64 bit index is ready.
Without wishing to jump to conclusions, a much bigger index would need much faster spidering and perhaps the dramatic increase in spidering we have seen recently is a test for when the new index is launched.
This may be dumb question, but I will ask anyway. if the docid limit is 4.2 or so billion. Why when I do a query on +the do I get Results 1 - 100 of about 5,800,000,000?
|This may be dumb question, but I will ask anyway. if the docid limit is 4.2 or so billion. Why when I do a query on +the do I get Results 1 - 100 of about 5,800,000,000? |
Perhaps because the DocID theoretical limit does not apply to Google's current algorithm?
hallelujah to you arthurdaley.
jnmconsulting - yes this is true re- the +the query. but if you look at the google home page as touched on by slydog above, it says:
"©2004 Google - Searching 4,285,199,774 web pages"
and this number has been at this level for months now. maybe google engineers just forgot to update their home page.
also the 5b+ results for the "+the" query has been at that level for a year now. does it mean no pages with "the" has been added in a year's time?
The number of pages indexed cannot be the answer because we get new pages included every day - many of us put new pages on existing sites and they are included within 2-7 days. This only affects sites that are on new domains or older domains that never had a site (ever including the archive org listings) and has been affecting sites since last fall.
Ive heard people theorize that the page limit number may be staying the same and that the new pages we make on existing sites are replacing older pages that havent been updated - doesnt sound very likely as it would take a massive amount of horsepower on already hard working servers - not Googles way of doing things - they look for easier streamlined ways of doing things these days
This "sandboxing" all started with Google going public.. it is relational to that process imo
The easist way to show strong profits is to cutback on labor , hardware , and investments in research ..
It could be that the investment bankers /firms wanted "assurances" prior to IPO that G would show immmediate continuing PROFITS .. These investment firms cannot profit themselves unless G's stock rises ..so in G's very first 10Q if G reports a loss or stagnant profits ..the stock price drops and the big investment firms are not going to be happy...
So maybe G is in a conservative mode..maybe G isnt allowed to make changes on a dime anymore.Maybe Changes are stuck in the board room now?
Maybe sites are sandboxed because the expense of adding new sites and all that surronds those entries is enough that they negatively affect earnings?
Isnt this the strategy that MSN did for the longest time...they had high profit margins on their search because they cut back on the expense of updating and adding new sites..that works short term but they played it so long that now they are having to totally rebuild search.
I dont know ..just something to consider..maybe G just cant shoot from the hip anymore..they are now a public company and every move will need layers of approval ...
Maybe this, maybe that, maybe not...
The only fact is the number on their home page which has not changed since February 17th. The number is 8 million pages short of their theoretical limit, so that still allows them to add new pages to existing sites in minute quantities.
Ciml already debunked most of the theories here. In my opinion correctly.
The Adwords theory doesn't hold water either. New entrants will be likely to spend less on Adwords than bigger organizations who are already in the index and get removed. Existing sites in the index will already have built significant online businesses that need supporting. For many of them it's pay-or-die as opposed to pay-to-play for new entrants.
|Pardon me for seeming bemused but isn't it obvious? For some time now (about a year) a theory has existed that Google would not be able to index more than 2^32 pages. |
Am I missing something? Google is still indexing all of these new sites, at least in my case. The problem is one of ranking - not indexing.
Quite a while ago I suggested my 32 bit theory but calum once said he wouldn't believe it be the case. The figure 4.28 billion websites indexed (i.e. valued in page rank) exactly remains below the 4.29 billion given by 2exp32. And now - with google remaining there for three months - I'd regard this theory worth being discussed again.
I definietly am not an expert, but I believe the PR-algorithm is heavily based on a 32-bit-hardware architecture. As far as I know, PR is calculated by approximation thru about 100 iterations over the 4.29 billion cross 4.29 billion matrix, which means a huge number of calculations.
Note how much Larry Page and Sergey Brin emphasize the factor of speed in their original paper, and I do not think this only concerns request-traffic on the net.
Below this figure you might work with a "long" index variable on the matrix. Beneath it you need at least a "double." Nor am I an expert on c++ or processor-technology, but I suggest the difference implies much more than just doubling calculation time. It seems to require a complete restructuring of the algorithm, and it might well be the case that it is even impossible to find a solution for this problem at current state of technology.
I have put quite a lot of effort in improving my websites PR in the past three months so please, please falsify my theory.
[edited by: ciml at 10:40 pm (utc) on Oct. 7, 2004]
|The sandbox/google lag/whatever is a ranking, not an indexing issue. It is absolutely critical to understand that. |
Yes, to say anything else would be to discuss a different phenomena
>>The sandbox/google lag/whatever is a ranking, not an indexing issue. It is absolutely critical to understand that.
>>Yes, to say anything else would be to discuss a different phenomena
They are absolutely, positively in the index, just like any other pages. And they're cached just like any others.
|This is G o o g l e's text-only cache of [example.com...] as retrieved on Sep 9, 2004 12:29:04 GMT. |
G o o g l e's cache is the snapshot that we took of the page as we crawled the web.
The page may have changed since that time. Click here for the current page without highlighting.
Click here for the full cached page with images included.
The pages get updated (verified by changing "last updated" on the pages), and in fact as new pages are added to the sites they're indexed as well. One has had a fresh date for the homepage. One is even showing up in the backlinks for another site. They just won't rank for anything.
|It just means you're not able to compete with all the other sites that are better SEOed! |
That would be odd, considering that it is my old sites currently ranking. Maybe I am growing old and out of touch and cannot even compete with my "younger self" :)
A blind man could see that newer sites are being treated differently than older ones and a 3 year old can get a new site INDEXED.
>>The sandbox/google lag/whatever is a ranking, not an indexing issue. It is absolutely critical to understand that.
>Yes, to say anything else would be to discuss a different phenomena
Depends on the intended semantics of 'indexing'. It's certainly not a spidering problem as they have spidered many pages which are not in the index/ searchable database. But if there is a limitation on the number of pages in the index and old pages must be kicked out before new ones can enter, then that could be called an 'indexing' issue since the issue is caused by limitations in the size of the index/ searchable database.
Also as I understand it, although related the sandbox and google lag are not the same thing. The sandbox affects pages already in the index, seemingly penalising some of them. The lag is a time period delaying the acceptance of pages not yet in the index.
|Also as I understand it, although related the sandbox and google lag are not the same thing. The sandbox affects pages already in the index, seemingly penalising some of them. The lag is a time period delaying the acceptance of pages not yet in the index |
They are the same thing. Someone at WebmasterWorld doesn't like the term sandbox so called it google lag.
I don't believe there is a delay in getting pages indexed. The delay is in getting pages ranked.