Forum Moderators: open
I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.
I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.
So, why does the sandbox exist?
The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?
Google can take off to the bahamas, they can rebuild their server farm, they can reprogram everything, they did it before with relatively no resources, now it's just a tiny drop in their cash reserves; they can do it all at once, and if they can't, there will be some unemployed googlers very soon.
Then the question becomes, "Why have the band-aids they've applied to their index in the last 18 months been so pathetic? This alone should have been enough to endanger the IPO!"
If Yahoo can do it without PageRank, why can't Google? (True, they both are overly-dependent on keywords in anchor text.)
If one person at Gigablast can do almost all of the programming for a very respectable engine, why does Google have to rely on cute colored logos to keep everyone impressed?
At the very least, I think there's a management problem at Google and their priorities are messed up. But it's an uphill argument when they're all getting stinkin' rich over there in Mountain View. Maybe after Bubble2.0 we'll be able to figure out what happened.
first, the things that we know about sandlag pages and believe are agreed upon:
- sites/pages are in the "index" - can find them using site: and similar queries
- they appear in the serps for non-competitive terms and hardly appear for competitive terms
- no clear pattern at this time when and how sites/pages leave the sandbox.
so how would google accomplish this and produce the above symptoms.
A way would be to use filters and penalties implemented by having "if-new-site" logic in their algorithm. This seems too messy considering that there is an easier alternative.
At this point, I bring up "supplementals", not because it is related to the sandbox (it has confused somebody previously). If you look at the symptoms, they are very similar to supplemental pages. pages are in the index and also appear in the serps for non-competitive terms. so why not use the same technology (i.e. separate index from the main) to implement this quarantine of new sites? this will avoid any messy "if" programming. what remains is to figure out when and how to migrate sites/pages from this separate index to the main index.
why would google do this? quarantine new sites? this is where the bigger contoversy is. some claim it is to fight spam. some (including me) claim google is out of-capacity in its main index. perhaps if we are able to answer this question, it will help us figure out what criteria google uses to choose which sites leave the box and integrated into the main index.
I bow my head and apologize for not having done any research on that.
> I would say 'it would take them a while' is correct, except I'd change that too: it is taking them a while. Obviously they couldn't start this during IPO time, but they can now, and equally obviously they were never going to admit what the real situation is.
> they can do it all at once
well..
I conclude so far that the four-byte-theory all in all is not too unreasonable. Since in the past we all never knew what the 'real situation' was, why not stop reading tea leaves and proceed to more tactical efforts:
As a matter of fact, most of us webmasters suffer more or less heavily by pagerank of our new sites not being reindexed for three months now. Can you imagine a headline in the Financial Times saying "google facing serious technical problems" or so? Just an idea to maybe accelerate what is going on.
> Untrue, I have PR0 pages ranking on un-lagged sites, and PR5 pages that are google-lagged that are nowhere.
Maybe, but did you - as me - watch some of them bounce up and down in ranking almost every hour? This is not what we'd expect from a thoroughly working search engine, is it?
Why have the band-aids they've applied to their index in the last 18 months been so pathetic? This alone should have been enough to endanger the IPO!
Bandaids were enough when the press and their supporters didn't bother applying the kinds of critical standards they should have. Google has a cute name and company slogan, and for some reason this made everyone roll over and wave their legs in the air rather than just apply the same standards you apply to any other commercial/corporate entity. However, think of the damage it would have done if the press had started printing articles about the algo being maxed, IPO prices would have dropped dramatically, nobody wants a sick company. Then if you can implement some algo tweaks to force out enough pages to force webmasters to buy adwords, boost income, boost pre IPO bottom lines, presto. Then work out the engineering headaches all these hacks create afterwords, now that is.
back to the topic: what and why the sandlag?
We didn't leave the topic, the topic thread is why does it exist, the sandlag [haha] is a phenomena that is relatively easily explained by physical limitations on the algo.
This is not what we'd expect from a thoroughly working search engine, is it?
no, but it is exactly what I would expect from a holding pattern, full on system redo, the example I've given before is when your harddrive is basically full, you start shuffling stuff in and out, waiting to add stuff [this is just an analogy, I'm not saying that google is physically out of storage space, that would be stupid]. Then finally one day you break down are realize not only is it time for a new harddrive, it's time for a new system altogether, since in the meantime everything is faster and has more capacity. This analogy might be more accurate than we realize, remember that google runs on the same boxes you run on, more or less, it doesn't use supercomputers, so what you see happen on your own whitebox is what is happening, more or less, on google's. And what's happening now is a move to 64 bit computing on Linux.
Seems so. The question is: How long'll it take and - since this is in the interest of most of us - what if anything can we do to accelerate this process.
I assume you all know this joke: "What does a german do if faced with a red traffic light at three o clock in the morning? He stops his car!" I hate that!
Oliver
Eric Schmidt: "How the hell did that happen? I'd better call back that pesky guy from Merrill Lynch who's been bugging me for ages about the IPO. By the way GoogleGuy you're demoted. New job working on Google holiday logos" (note: end of GoogleGuy on WebmasterWorld)
Merrill Banker: "You guys better IPO right now. I don't think any of my buddies at the pension funds would want these damaged goods. Let's make it an open auction and rip off the public instead."
Larry and Sergey: "Crap, how are we going to fix this mess? I know, we'll make 2 share classes so we retain a voting majority. That way even when the shareholders get pi**ed we'll be able to stop them firing us"
IPO announced 29th April to a huge sigh of relief at Google HQ.
You show your ignorance in the biggest way. :)
Any URL is going to be stored using a hash algorithm that they've developed--probably not a very complex one at that. If you take a look at the query string parameters when you click on "Cached" on a Google search result page you'll notice that there's a cache:? entry. Most likely the URL's identifier is the question mark portion.
Google uses capital and small letters and numbers in this identifier, which gives 36^^12 or 4,738,381,338,321,616,896 possibilities, which should keep them going for a while.
What does a numbering system for cached pages have to do with capacity of the index?
From reading the comments of people who actualy understand the problem, the main problem seems not to be the capacity, but the increase in processing power required once the jump is made to a larger pagerank matrix.