Forum Moderators: open
I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.
I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.
So, why does the sandbox exist?
The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?
I have a site that went up two years ago that deals with different brands of widgets. Search for "blue widgets" and it's #1 on Google.
I have a site that I submitted to G on June 14th. It's indexed about 300 out of 1500 pages.
Both sites have the brand "blue widgets" on them. Same keyword density, same everything when it comes to SEO.
A search for "blue widgets" finds my new site at about position #350 in the SERPS.
Also back in June, the owner of the company that I did the two year-old site for wanted to add a page about "red widgets." Two weeks later I checked on Google, and that site ranked #1 for "red widgets."
And, yes, I have incoming links for the new site's "blue widgets" page.
How else to explain it?
To boil this all down - I believe this "sandbox" effect that everyone claims is just a by-product of the algo change Google has made to handle the old expired domain spam problem.
Those will eventually get in, but not with the same advantages they did; so it sounds like a logical part of the whole picture. Also, the "instant" link pop can't work like it used to, put together with the flooding of the index with a ton of less_than_valuable cranked out pages, with the attendant linking strategies.
Putting it all together with the other things they seem to be tightening up on, including removing loads of pages from the index a few months ago and what seems to be an emphasis on detecting near-duplicate content, all we can conclude overall is that the reason on their part is no different from what they've always claimed their motivations are - to improve the value of search for their users.
We can quibble over the how's and the means and mechanics, but what it all boils down to is that regardless of the methodologies they're using, there's no way they'd sit still forever without resisting and fighting back against what violates their standards of value. None of us would either, in their place.
That's 4,294,967,296 pages
Today Google is indexing 4,285,199,774 pages.
That is within 0.2% of its theoretical limit.
GoogleGuy denied it last year, but that is just too much of a coincidence, and GoogleGuy has been wrong before.
[edited by: SlyOldDog at 11:36 pm (utc) on Oct. 1, 2004]
Tons of sites are avoiding lag time by spamming blogs and guestbooks. At the same time, one site I'm watching is the equivalent of bretttabke.com -- an offical site for a person with a name that no more than a half dozen people in the world might have. It's listed in dmoz and has lots of high quality links, and ranks in the hundreds for the person's name. It is ludicrous to suggest the reason for this is "bad seo".
The point may be that lag time exists in an attempt to accurately weigh if apparent quality is real quality. But if this is the point, Google currently isn't caring about zero quality sites built on the non-authority aspect of the algorithm (they rank just from volume of anchor text links).
It is of course crazy to lag "apparent quality" while you judge its true worth, while letting pure dreck rank via the non-quality aspects of the algorithm. As the results get flooded with more of this total crap, they are either going to have to lag the effect of any link or accept results deliberately skewed to new, low quality sites.
Here is when Google increased their index to the current size. February 17th. So when is this sandbox supposed to have started?
I've been lurking for a while, but this is my first post in this forum. I have some observations to make. No proof - dont shoot me down, but feel free to point out any errors I've made in my comments, I'm interested in learning rather than winning arguments.
The theory about running out of docIds is interesting. When I first heard it, I thought it daft - surely Google could get round this, I thought. But considering it further it may well have some bearing. For ages Google's front page has claimed there are 4,285,199,774 pages in the index. And the maximum number of pages if each were assigned a 32-bit numerical id would be 4,294,967,296, a fraction of a percent higher. In perspective, the number of pages they claim to have in their index is 99.8% of the maximum number allowed in a 32 bit number. This seems a hell of a coincidence to me, especially as it has stayed at this level (at least according to their front page) for considerable time. Perhaps this also explains why the index used to be built is stages culminating in a Google Dance, but for some time a different rolling system has been used. The old system may have produced more pages than the limit, so could have been abandoned in favour of another system which gradually adds pages as other pages are deleted.
Given that there may be significant obstacles in upgrading to more than 32 bit indexing system, this may explain why Google has kept the index roughly the same size all this time - they may have been working on a 64 bit system but it's not ready yet.
With too many pages for the index to cope with, this may be a reason why they are imposing barriers to entry for new sites into the system and are also getting more strict on spamming techniques as they are quite happy to kick sites out to free up room for good new sites.
It wouldnt surprise me if they keep new sites not indexed yet in a quaranteen database and perform a number of tests on them - are they purely affiliate sites, are they dmoz clones etc, or if they are sites with genuinely new content, then they would be good candidates to enter the index when space has been freed up by kicking out spammy sites. This perhaps explains why some people in this forum have commented that Google is becoming much less tolerant of duplicate content. If you were Google and have been nearly at your page limit for a while, the last thing you want is yet another glorified clone of an existing database.
If a 64 bit index is nearly ready for launch, I am not suprised they delayed it until after the share floatation, since there may be unforeseen instabilities in the new index and at a time when the company needs a stable image to attract investor confidence in their technology.
Also it just so happens that making it harder to get into the index increases demand for Adwords, so it makes business sense to do exactly what they are doing at the moment until the 64 bit index is ready.
Without wishing to jump to conclusions, a much bigger index would need much faster spidering and perhaps the dramatic increase in spidering we have seen recently is a test for when the new index is launched.