Forum Moderators: Robert Charlton & goodroi
If your site is less than a year old you are likely sandboxed.
I can't believe most sites under a year's age are in some sort of penalty box. Google would be useless. So, I want to know:
1. Are all sites sandboxed, or do certain traits (like affiliate links, low content) trigger it?
2. How long does it last?
3. How variable is the duration?
4. How do you know your site is being sandboxed?
5. Does the effect taper off or is it a binary thing?
6. What gets you out of the sandbox? Is it merely time or do good links or whatever speed it up?
Thanks.
Yes, that's the point here exactly, and it's what people defending this policy simply aren't grasping. It's probably closer to 15%, but it's the same problem.
The web is new, very very new. And it's growing fast. The guess of 20% may in fact be correct given the rate of growth of the web. This is not trivial, it's not a spam issue, this is, again, a full on, total and absolute system failure. This is not people whining like little kids who can't have their candy, this is people seeing a full on system failure, recognizing it, and calling google on it. Defending this policy is very reminiscent of defending ford when they knew perfectly well that the explorer had major stability issues, and, what's worse, defending ford because of what ford was issuing as press releases and damage control on that problem. Personally, when I see a failure like this, I prefer to just call it what it is, a failure. I'll leave the spin to the company, there's lots of computer and tech media hacks who will faithfully reprint the company spin word for word.
The growth of the web has been exponential almost, in 1999 google was very proud that they expected to have 1 billion pages indexed, their primary algo was designed to support 4.2 billion pages. They hit that number much sooner than they expected, I think they probably expected a doubling of pages by 2004, which is when that algo got maxed. That's why calling this policy damage control is not unreasonable. Google doesn't have to make just a new algo, it has to create a whole new google filesystem, os, and algo. Speaking crudely, that is, seeing the whole system as one, running all the components that make it work. See recent articles on the google file system for example to get a sense of just how major this change is.
Comparing it to a new windows release is not that unreasonable, except it's much simpler, since it only will run on their systems, it doesn't have to have as much error and hardware detection, and debugging stuff built in like windows does. But I think it is very much a process of building a new os and a new algo to run on it, all ontop of linux of course. And I don't think google had much warning, I think they did not expect this to happen this fast.
And there's other issues I've come across here, the hardware one, the stability of the 2.6 kernel, a lot of things that made it pretty much impossible to really start work on this issue until 2004. That's essentially now.
So the tests show that if an addition of 2-3 pages in top 10 serps then its as much as 30% in some keywords and it could even be higher.
It really is making Google look silly - they have a reputation to keep. If the press heard this they'd have a field day. A basic problem and they can't fix it - thats what I call lame but it could be just bad luck. If it is bad luck i'd say they better fix it or they could look rather stupid.
Yes, that's the point here exactly, and it's what people defending this policy simply aren't grasping. It's probably closer to 15%, but it's the same problem.
I wonder what percentage of that 15% or 20% consists of original information? I'd guess that quite a bit is duplicate content (boilerplate product pages, hotel listings, automatically sliced-and-diced directory pages, etc.).
If there really is a limitation on the maximum index size, then maybe Google needs to be more aggressive in purging the index to create more room for the useful new material that users allegedly are desperate to find.
I see you are quite new to the board - FYI there are a great number of intelligent & generous people on this board who offer their time and advice at no cost to anyone - you might want to back off a little with the blanket criticisms.
Yes, you read them a little too quickly. I know there are knowledgeable members here by the way, but what I'm talking about has been kind of a serious problem on the WebmasterWorld search engine forums since late 2003, the tendency to uncritically accept certain explanations that simply don't fit what you can see with your own eyes. But which are very good for google to have as the perceived wisdom.
You have to connect all the dots, not just the 2 that seem most obvious, in this case spam and adwordsIPO income. The number one dot that I've seen neglected is the capacity issue, and that only started getting discussed again once it became simply too painfully obvious to ignore, almost an entire year after that idea was first suggested. The initial argument was very strong, as was its condemnation, and the spin thrown up to stop it spreading. Then it moved to the dual index question, again an idea that made total sense logically, but which was again routinely criticized as a wacky idea. Why, fundamentally for one and only one reason: an apparent total faith in the fact that google can do no wrong, therefore any suggestion of weakness or failure is by definition wacked. I prefer a more reasonable idea, which is that google can and does do wrong, is a real company, and messes up majorly, like many other software / web based companies have in the past, like hotbot, altavista, etc.
Both of these have since clearly been shown to have bene true, first with the stuck 2^32 pages indexed count which triggered the initial sandbox, and second, the 2x2^32 magical index size boost in november. Showing totally clearly this: there was one index, it was full, there was no more room, there was a second index, which was a junk collector, and whateve else they wanted to throw in there, like sandboxed pages. That's why the sandboxed pages are available, they are indexed, always have been. Add to this radical decreases in indexing frequency, much longer times to add large blocks of pages, etc, and the picture I see is of a broken system. Barely working, working, but only by restricting almost all new data from it.
I've been reading these forums for a long time, it was only when I saw that the sandbox was a: getting longer and longer, and b: not going away, that I started getting concerned enough to start posting.
I thought these issues would be completely worked out in the middle of last year. Now, not only are they not resolved, the situation is significantly worse.
You're correct to call me on any blanket type appearing statement, but I'm making that because I've noticed a tendency here to trivialize extremely critical components of google's operation, especially the hardware/algo restrictions, and extra especially, the idea that a company might want to make money, and do what all other companies in the world try to do in the same situation, which is to maximize quarterly performance pre ipo in order to maximize ipo capitalization. This is business 101. It's not a conspiracy, unless you believe that google isn't a business, and really is your friend and wants to help you all.
Keep in mind, if these issues had been discussed openly by google, and if the press had done their jobs, which doesn't tend to happen in the online media much, they prefer easy handouts to real investigative reporting as a rule, this could very literally have cut 10 billion, or more, dollars from the IPO. Possibly much more, currently google stock is about 10 times overvalued, imagine if it had come out at a much more reasonable 20-30 per share?
But much more to the point, what worries me is that google has totally abandoned any efforts at openness they used to engage in, are totally tightlipped about these issues.
The capacity issue is and was intrinsically related to the sandbox as far as I can tell, I refuse to believe google would back this far off their core model of supplying the very best, uptodate results, of the whole web, not the current maybe 85% that they currently are supplying, without a very very good reason. And neither spam or anything else is such a good reason.
Again, I'm seeing yahoo do exactly the same thing, except in a much looser way, that avoids easy categorization, but the result is the same, pages are indexed, but only x percent of the whole web is. This lets them let new sites in, but drops pages indexed per site down dramatically. I think we're looking at the same exact cause, being treated in 2 different ways.
Again, if spam is the issue, which I don't for one minute believe, solve the spam problem, don't kill the internet, which is what google is in essence currently doing. Every year new sites were added, spam was put up, seos got busy, this isn't new. To suddenly pretend that a new site cannot have value ignores fundamentally the very nature of the internet, it's not a trivial decision, it's a complete failure to deal with the internet as far as I'm concerned, and I hope google either fixes this problem in the next few months, or fails miserably and joins altavista etc as a memory.
It's very basic, if your job is to organize the world's information, do your job, if you can't handle your job then quit, cash out, do something you can handle, but stop pretending, it's getting boring to be honest.
<ps>sorry about the typos, new prescription, hard to read the damned screen...</ps>
While I appreciate the time and obvious thought you've put into your post, I haven't seen evidence supporting your suppositions in the sites I run. I started a new website in late Dec - Googlebot is the most aggressive spider by far, indexing every day, and at current has nearly all of my 200+ pages indexed. I don't think I'm in the sandbox either, although that doesn't lead me to deny it's existence. The only delay I've seen in rankings hasn't been all that different from a few years ago, where some sites needed a big update & recognition of a few good backlinks in order to begin ranking well. I'll know within a few weeks how things are playing out w/ the new site, but I'm not all that worried.
I've also seen no evidence of a decrease in Googlebot indexing with any of my other 8 or 9 websites, the oldest of which is 5+ years.
Restricting new data? Maybe SOME data; certainly not all. Barely working? Don't buy that either.
That would be an example of what I'm talking about, not an example showing it's not right, google used to rip through sites almost overnight. I added several thousand pages to a site a few months ago and it took googlebot almost 3 weeks to process them. And I'm seeing much longer delays in full site indexing. Like you, I've been watching some of these sites for a while, and it's pretty clear in terms of frequency.
What I want is for google to fix itself, I could absolutely care less if I'm right or wrong, as long as they start giving searchers the real growing living web again.
Working off 85% of the web as they are currently doing is being broken. When is it going to stop? At what point? 75%? 60%? If that's not broken, I don't know what is. Even assuming the spam argument, cutting off everything to catch the guilty is failure, there's no way I can think of to explain this except as a failure. Sure they are indexing old sites fine, and have uptodate content on old sites, nobody is saying they aren't. But they don't have new sites that could be and often are more relevant to the searcher's needs. The web is a thing that caters to the latest and greatest more than almost any other medium out there, it's not like print, magazines etc, the best information often is on a new site, not always, but often.
But the real kicker is this: google's success was built around providing all of the web, completely, in the most uptodate form available. This formula crushed their competition. To believe that they would leave this successful formula lightly is not realistic, they know why they succeeded, and I think they are very aware of the risks they are taking currently, and of course they know exactly why they are taking them. Too bad we don't, LOL... but there are signs, and if you follow the signs without having too many presuppositions, the conclusions I come point to a broken system. If it's not broken, and this is actually something they did on purpose, then I think google is dead. But I don't think that.
I got caught up in the belief that if I had more links than someone else, I "deserved" to rank higher than they did. However, this actually makes no sense. My site does not "deserve" to be higher simply because I went out on an aggressive link campaign. Acquiring a bunch of links in this manner makes the Google algo less effective. And this is something Google must combat.
So if they just look at how sites and links have naturally progressed in the past, they probably noticed that a site would SLOWLY get links pointing to it and then the link acquisition rate may increase once the site has been established for a period of time (6-12 months).
So if you have a new site and have acquired a large number of links for it in order to compete on a competitive keyword, Google may not see this as happening naturally and may think the site should rank for these competitive keywords.
The people continually brining up the idea of an index capacity issue made me wonder... Why didn't Google put the Sandbox into affect retroactively? In other words, when it put this tweak into the algo, why didn't it apply to sites that were 2 months old at the time? It is possible that they either didn't want sites to rank and then drop out or it is possible they didn't have the capabilities.
First, a new site can outrank some of the older sites (and even rank well for noncompetitive keywords), but is just significantly suppressed from its "rightful" position. Second, the use of the -fsdfdf garbage strings seems to solve the problem, and display old and new sites in their "rightful" place.
If there is a capacity problem/dual index, can anyone suggest how these two observations are consistent with it. And how the google algorithm might be choosing to intermingle the old index and the new index?
My guess is because they knew pretty much when their system would fill up, and decided to flick the switch at that point. My guess, again, is that they simply didn't realize how fast the web would grow, and weren't prepared to deal with it, so they had to apply some hacks.
I'd have to go back and read most of the documented stuff re how google actually assigns a query, but my guess is that for competitive keywords, the ones that a site is sandboxed for, the query gets sent to the primary index servers, the one sandboxed sites are not allowed into. Then secondary queries, which will be definition be very low volume, get applied to the full, 2x2^32 index, both, that is, the second one that has all the garbage urls, expired pages, deleted pages from a year ago, etc. So those queries return non sandboxed results.
My guess, again, is that you are not actually seeing one single set of serps, you are seeing 2 sets, and it switches over somewhere in the high 100's to the include the secondary index material. Same when you click 'more results like this', more results from same site etc. Since those are very low volume queries, they can afford to handle them with the full system, since t doesn't happen very often relatively speaking. I was playing around with this, and noticed that there is a jump in processing time somewhere up in those high hundreds, between every ten it goes up, but then it goes up a bit more than it was suddenly. I would guess that's the exact point it starts drawing results from the secondary index.
The problem with thinking that this is link related is you can score very high for these secondary search terms with a new site. If link and site development were what was being filtered this simply wouldn't be the case, you wouldn't rank for any keywords at all.
And sometimes that's not for serps with only 100 results, it can be several hundred thousand. But it's not a money keyword, I had one term sandboxed, one not, both had around 300,000 returns. For the sandboxed term, in the 200s, for the non sandboxed term, top 10 or 20. It's just more resource intensive to query the full database, so they don't do it is my guess unless its queries that a: will not make them any money, and b: won't get used very often.
And by the way, a month or so ago a page slipped out of the sandbox, for a few weeks I was ranking quite well for a keyword phrase, top 20, that the site is in the sandbox for, then google caught that, it was a new page, I think that's what tricked it, and then it removed it.
Again, this is just my best guess, I'd love to see other best guesses that can also deal with the capacity issue, which as far as I'm concerned google has openly admitted, every search, every day, in their pages indexed count. If you think you can explain away that fact I'd like to see it, really. But please don't offer an explanation that ignores that fact, I don't see the point of offering a theory that doesn't explain all the facts.
And coming to think about it - I remember just before Florida maybe even 2 months before I noticed a sandbox on one keyword phrase in particular.
It was kind of weired at the time, my guess is that they were experimenting - this tells me glitches was not part of the original problem.
Capacity issue had to play some part in all this, as well as spam sites. I think they were fighting spam sites but this could of been just good coincedence for G. The prospect of a stock float and rumours that they were hitting full capacity would of created a bit of talk in the board room.
Now my guess is the discussion would go something like 'if we carry on with this expotential growth in websites coming on line then we are gonna hit problems.
Coming to think of it, the updates went from monthly to more regular (well thats what they wanted us to think) - THE END OF THE GOOGLE DANCE, THE FIGHT AGAINST SPAM, GOOGLE REACHES THE 4 BILLION MARK, GOOGLE TO FLOAT STOCK EXCHANGE, ETC ETC
Never a mention of 'GOOGLE MAYBE RUNNING AT FULL CAPACITY' - The introduction of Adwords, the amazing 40 PHD's, the speclation of two algos in tandem.
The speculation about link exchanges, its all hype as far as I am concerned. If I was Larry Page, I'd be thinking the more complicated it became - the more futher peole would be to thinking that they could have capacity problems.
Now lets say that were true, it would of certainly made sense. By hang on - how many people work in Google? Well the core development team were busy making adwords, froogle, and looking at other products to work with. This would of ment less resources spent on fixing issues - IF ITS NOT BROKE DON'T FIX IT. But in an attempt to curve the problem, Florida, Sandbox and HillTop could of been really important.
I think now its past the speculation point. If they had a problem - it would of been fixed by now. So we got to ask the question - why is this happening?
Well non profit making sites are dead wood as far as business goes, unless they offer value.
I'd say some of the sites we have probably do offer value. But there is no reason why they should still be filtered out. This part does not make sense, only logical reasoning suggests it would be a matter of time.
Unless this thing is all about contolling people, profits and the Internet I am out of guesses on what it is all about.
Technical problems uesually get resolved over time. I think the only thing we can do is carry on building good quality - hopefully then we will be out of reach from most spammers and by then Google algo will be able to seperate us from the spammers.
Maybe thats what it will take?
I wonder what percentage of the older 80% or 85% is original information?
More than is the case with the billions of pages that automated tools have churned out in the last six months.
My guess is because they knew pretty much when their system would fill up, and decided to flick the switch at that point
Is the general consensus that the Sandbox went into affect on a certain date? If so, does this mean that if you had your site indexed in Google before this specific date you were fine and if you were indexed after this date you have been playing in the sand with me? Or did it take place over the course of a few weeks/months?
I'd have to go back and read most of the documented stuff re how google actually assigns a query, but my guess is that for competitive keywords, the ones that a site is sandboxed for, the query gets sent to the primary index servers, the one sandboxed sites are not allowed into. Then secondary queries, which will be definition be very low volume, get applied to the full, 2x2^32 index, both, that is, the second one that has all the garbage urls, expired pages, deleted pages from a year ago, etc. So those queries return non sandboxed results.
Very interesting posts 2by4. One thing that doesn't make sense... You say that google is able to handle the 2x2^32 matrix calculations for low volume queries, because it doesn't have to do those calculations too often. And you seem to be suggesting that Google reruns the entire 2^32 or 2x2^32 matrix every time an individual does a search query. This seems incredibly inefficient, if it's true. Why wouldn't google simply run the matrix once a day (or once a week) for high volume search queries and serve up cached results.
Both of these have since clearly been shown to have bene true, first with the stuck 2^32 pages indexed count which triggered the initial sandbox, and second, the 2x2^32 magical index size boost in november. Showing totally clearly this: there was one index, it was full, there was no more room, there was a second index, which was a junk collector, and whateve else they wanted to throw in there, like sandboxed pages. That's why the sandboxed pages are available, they are indexed, always have been. Add to this radical decreases in indexing frequency, much longer times to add large blocks of pages, etc, and the picture I see is of a broken system. Barely working, working, but only by restricting almost all new data from it.
I admit that I didn't read all of your posts, they're just too long and my time is limited. But how about if we apply some logic?
Let's assume that site rankscrap.com is sb'ed and that site rankswell.com is not. Both sites target the query "conspiracy theory". Let's also assume that word "foo" only occurs on rankswell.com and that word "bar" occurs on rankscrap.com. How do you explain that the query
"conspiracy theory" ( foo ¦ bar )presents both sites on page one, one above the other? There are only two possible explanations: either the two matches came from the same single index or both indices are queried simultaneously. Whichever explanation you chose, either implies that G serves queries from sandboxed as well as non-sandboxed index entries. If G is able to do that for obscure queries like the above example, it can do so for non-obscure queries and money words as well. The bottom line is: G is in full control over the situation and the sandbox is a feature of the ranking algo, not a bug or problem with the index.
More than is the case with the billions of pages that automated tools have churned out in the last six months.
Yes, to feed Google's own Adsense monster.
So because Google created this situation millions of valid websites are denied the opportunity of featuring in the results? Talk about never giving a sucker an even break ...
So because Google created this situation
You mean the people who actually built those made-for-AdSense sites were innocent bystanders?
I can imagine that prior to that they tried the keyphrase they had just found me on Yahoo with to no avail. Joe surfer just got the impression that Yahoo knows what he wants while Google does not.
What makes you believe it was a surfer? I bet it was a competitor checking your rankings, or maybe yourself ;-)
I believe that google ist NOT a dynamic search engine, but rather serving a subset of results for any query, previously prepared (we used to call that google dance ;). That explains a lot.
1) exactely the same results for the same search over time..
2) (old) limit of 10 words, there will be a (huge) but limited number of searches done.
3) Sandboxing might be rather some filter, google had to put in place to track down the overwhelming spam
4) -adsfdsa -asdfdsa tricks the filter as google does not have a subset for every of this results and has to serve "simple" dynamic results, before filtering.
5) the -asdfds trick used to work too for less than 10 words, until all the variants where calculated too
6) Websites "sandboxed/filtered" might just have something which triggers the spam-filter, what might be some fairly huge algorithm not usable for on the fly results out of a DB with 8'00..... Pages.. (to be new is just one of this filters parameters, but I've seen plenty of before well ranking pages drop without any change on the site since beginning 2004)
7) I've heard from NOBODY that they came out of the so called sandbox..
Well I have no real idea about what I've just wrote,;) or DBs this size, but that sounds as some simple solution to this "dropped sites" problem.
Always keep in mind, google is for joe surfer, not for us SEOs. For competitive keywords, there are plenty of "good" pages around, the quality of the index keeps roughly the same, if 50% of sites which show only slightest "spam" signs are droped.. joe does not even notice that.. and if 10% of the dropped sites are spam google does not care of the 40% "punished" unfairly, as long as he got all spam away..
So I think to wait until sandbox ends could need quite some patience.. I rather start to think about my sandboxed sites as dead forever, and panicking, that it might happen to my "still ok" sites..
-UserEdit, added point7-
You mean the people who actually built those made-for-AdSense sites were innocent bystanders?
Europe you are not a recent arrival on the scene. You DO know that anything that is seen as a way of making money from the Internet WILL be exploited by the unscrupulous people who are in this business. Are you trying to tell me that Google could not have forecast that this would happen? I don't think so.
hasm, this could very well be right, but keep in mind that the restriction on the main algo if I remember right limited it to 2^32 pages, so the system is built around that, which means most results, non-sandboxed, will get drawn from that data set. I'd suspect that you're right about them not doing this on the fly, for the core search terms, but once you get out of the core group, to the non-sandboxed search terms, I'd guess that it might become an actual live search. Or something like that. Doing a live search wouldn't be a big deal even if it had to run through 2 data sets, or indexes, since it doesn't happen very often. Keep in mind when you see your junk pages in supplemental results, those are very clearly being drawn from another index, IMO.
Other good points raised here too but I have to do some paying work right now and don't have time to reply like I'd like to, LOL.
Except for this: why on earth do some of you persist in believing that a corporation, funded by venture capitalists, would not try to boost their income from the only primary revenue generator available to them prior to their IPO. How can you possibly believe that a group of humans would deliberately take actions that would result in them losing possible 10 to 20 billion dollars? Are any of you really this naive? If so, I have some really good property I'd like to sell you, and some bridges, slightly used but they get good traffic. Do you have any idea of the type of power and influence this type of money can get you? We're not talking about 20 dollars here. This is a whole different ballpark, this is serious power, people don't mess around when it comes to this kind of thing, especially experts in the field. If Google needed to boost adwords income to maintain high initital IPO share prices, then you can be absolutely certain that's what they did.
Google hitting record quarterly profits right before the IPO was not an accident, it was a result of handing over some decisions to the money men for that time period, which obviously cut into their ability to apply pure engineering fixes to the mounting problems, a short term gamble which has paid off very very well. Do you really believe that google had a record quarterly profit right before the IPO by sheer coincidence? I don't. That's business, it's not a conspiracy. Why do some people insist on calling standard business practices a conspiracy, that's really bizarre. Google is a business, and they need capital to compete long term, and they got it, a lot of it. I don't blame them for this, it was a smart move, I just don't like what it did to their product short term.
If Google needed to boost adwords income to maintain high initital IPO share prices, then you can be absolutely certain that's what they did.
If Google decides that it's time for e-commerce and affiliate sites to buy AdWords, it doesn't need a sandbox to communicate that message. It can turn the dupe-content filter up to "high," adjust its algorithm to give less weight to certain types of pages, deliver separate SERPs for "commercial" and "information" searches, and so on. Google won't even have to apologize: It simply can explain that it's improving the user experience, just as it did when the AdWords team announced the recent crackdown on direct-to-merchant affiliate ads. (Sure, there will always be those who can get around such changes, but that's true of the sandbox, too.)
I'm going to assume that not one of you has ever worked in networking or done real programming. If you had you'd know that some issues are simply extremely difficult to solve.
I'm sorry, but your comment is not true!<edited :-)>
I do not think the Google's algo is as nearly as complex as the GNU/Linux Kernel or Mozilla or OpenOffice with more than 20.000.000 lines of code. And I do see there is so much fundamental change in this awesome complex projects every day...