| This 75 message thread spans 3 pages: < < 75 ( 1 2  ) || |
|Is The "Sandbox" Ending?|
Is this Doomsday, or is Something Wonderful about to happen?
The sandbox is coming to an end soon. It will be rolled out gradually, perhaps as fast as google used to be able to incorporate new documents before they started having problems late last winter. Even though they could probably incorporate all the documents above 2^32 (2 to the 32nd power or 4.2 Billion) level of documents established last winder at once and begin doing their cycles of algorithmic calculations, I think they are just going to introduce documents into the expanded matrix at a gradual rate. That is if they introduced all the documents into the new, exapanded matrix, they could do the full set of various algorithmic calcuations and we would see tremendous upheaval in the SERPs, on a much larger scale than ever seen before, because their new matrix is simply capable of that, and to demonstrate such power would be an admission that there was a capacity problem. Rather I think they will roll all the documents sandboxed into the expanded index over the next 6 to 8 weeks and that the process has already started. It will look pretty much like the old rolling updates google used to, except it will be strong and nearly continuous, punctuated by periods of stabality though the majority of the matrix at any one time.
To admit there was a capacity problem after all this time might be taken by some to be an admission of culpable negligence in their failure to advise potential investors regarding serious technical issues during their IPO period. I don't think there will be any culpable negligence issues because Google will not fail. However, if it did fail, I think the fact that they have withheld such information would make them subject to suit, perhaps even criminally if some of the ones who profited on the IPO were the ones who concealed the capacity problem. They would only be guilty of negligent deception IF THEY FAILED.
It's kind of like you wake up in the middle of the night and there is someone in your bedroom and they make a frightenting sound and it's dark and you see something flash towards you and you are so scared you shoot into the darkness to later find it's a serial killer wanted in a nationwide manhunt and you are a freaking hero and on talk shows everywhere, or for a change of scene, it's the neighbor's senile grandfather and you are doing 10 to 20 in max lockdown with Bubba Joe who likes to scratch his ass and sniff his fingers when he's not telling you how pretty your eyes are.
I don't have anything that would serve as proof of what I'm saying, but it pretty much stands to reason that if Google has been perfectly mum about the sandbox to this point, that they are not going to so quickly incorporate new and faster expanded technology at such a rate that it requires public statement.
If the sandbox phenomenon is over and/or in the process of ending, what would it likely look like? Would it be rolled out all at once? by topological area? by chronological time in the sandbox? alphabetically? by pages or by domains?
What will the results look like to us as they change? There must be tens of thousands of sites that have been released since last Winter that are sandboxed, as they take their place in the higher SERPs, will there be a mad assualt? or will it be more like a gradual infiltration? should we expect to see gradual changes in every area over time, steady like an hour glass, or will we see a week of dramatic change to be followed every couple of more weeks with dramatic change for a couple of months? or will we just wake up one morning to find that hurricane google has re-written the face of the internet with major devastation in it's wake and young hopeful sites seeing sunlight for the first time?
With MSN's new engine expected to go online perhaps as early as February, and google's known fondness for upstaging MS, how much later can they wait before they release the sandbox? The SERPs are apparently beginning to change. I've already heard of several people who've claimed their many-month-long-sandboxed site is out of the sandbox. Could it be that this is really the beginning of the end?
I just think google went to this double database until they convert to the new 64 bit system. Which right now the new 8 Billion page database is just a cover up until they finish the upgrade.
You got to remember that Google is cheap. They are the masters of making CHEAP look like a billion dollars.
With a few hundred thousand dollars we could create a search engine that could compete with google's search right now.
The only thing google has that we don't is massive amounts of traffic. That is their biggest problem.
Search isn't trivial. It's basically simple to figure out. If we can figure out how to rank high in the search engines, we can surely create one that works well. That's the easy part.
The hard part is getting the masses to like your engine. And google has that figured out. Because they are innovative and fresh (so to speak).
I think the 64 bit system is in place right now. They are just working on the software to run it right now. The proof is in the new Mozilla 5.0 crawler they have now.
I seen it crawling my pages but they are not indexed yet on a new site I have. Then the old crawler hit and those pages got indexed right away. So I think they are about to switch over to the new 64 bit system soon.
|The proof is in the new Mozilla 5.0 crawler they have now. |
I don't think anyone here would accept that as "proof" that their entire underlying database strategy is changing the size of its index keys.
It's a nice supposition, but "proof"? No, I really don't think so :-)
Getting back on topic here...
> The sandbox is coming to an end soon. It will be rolled out gradually, perhaps as fast as google used to be able to incorporate new documents before they started having problems late last winter.
This is wishful thinking.
That doesn't mean it is right or wrong. You might be right. But your declarative statement that "The Sandbox is coming to an end soon" is not supported by any facts. It is based on your "theory" that Microsoft is an immiment threat.
Just because you think that Google might have an incentive to upstage Microsoft soon, doesn't mean it will happen in the next few months. Heck, Microsoft had an incentive to get a great search engine out before Christmas shopping and they didn't. So why should that happen quickly now.
Have you been around long enough to recall Windows 1.0, 2.0, 3.0, Win95 and so on? It took them awhile to get it right and gain market share. To assert that they will get it right in the next 60 days is probably silly (I am not arrogant enough to say that I have crystal ball either - they could get it right, but I doubt it).
Microsoft's history is to get things out and improve them over time, incrementally.
I would like to see the sandbox end, but wanting it too happen doesn't not mean it will happen.
Commenting on another topic raised in this thread...
The notion of two databases at Google makes some sense, if only based on precedence. For a long time, Inktomi (which Yahoo! bought) had two databases. The BOW database, which was the "core" of the internet, and everything else. This was a public FACT. I sat in on a presentation from Inktomi on this, and it has been extensively commented on here and other places.
So it is quite logical and conceivable that this is going on to some degree or another at google. Heck, they certainly have a spidering priority schedule at google, and their are "suppplemental results". So maybe there are 5+ databases, not just two (just as a wild ass guess).
>>Also when I look at their crawling frequency.
>Now this is an interesting observation. I can't personally attest to any slowdown
>in crawling activity, but I've heard other reports from people that I trust as reliable.
>Thanks for bringing this angle up. Anyone else care to comment?
>Seen any slowdowns in spider activity?
I was just studying the logs for a site of mine that I haven't added links to in two months. In case anyone is interested in my data, here are the number of googlebot hits by day.
Nov 1 - 8
Nov 2 - 75
Nov 3 - 2
Nov 4 - 2
Nov 5 - 183
Nov 6 - 8
Nov 7 - 1
Nov 8 - 6
Nov 9 - 5
Nov 10 - 5
Nov 11 - 5
Nov 12 - 5
Nov 13 - 8
Nov 14 - 10
Nov 15 - 12
Nov 16 - 14
Nov 17 - 128
Nov 18 - 19
Nov 19 - 17
Nov 20 - 16
Nov 21 - 17
Nov 22 - 15
Nov 23 - 15
Nov 24 - 14
Nov 25 - 16
Nov 26 - 15
Nov 27 - 15
Nov 28 - 197
Nov 29 - 6
Nov 30 - 25
Dec 1 - 23
Dec 2 - 26
Dec 3 - 18
Dec 4 - 15
Dec 5 - 35
Dec 6 - 46
Dec 7 - 43
Dec 8 - 44
Dec 9 - 30
Dec 10 - 18
Dec 11 - 38
Dec 12 - 34
Dec 13 - 215
Dec 14 - 30
Dec 15 - 28
Dec 16 - 92
Dec 17 - 53
Dec 18 - 39
Dec 19 - 24
Dec 20 - 31
Dec 21 - 24
Dec 22 - 35
Dec 23 - 18
Dec 24 - 32
Dec 25 - 164
Dec 26 - 27
Dec 27 - 19
Dec 28 - 40
Dec 29 - 26
Dec 30 - 9
Dec 31 - 0
EDIT: A gif of the graph is at [qcguide.org...]
I keep seeing people on this list claiming that Google's results are so much better than the rest. Upon what are these claims based because this is not my experience?
|I keep seeing people on this list claiming that Google's results are so much better than the rest. Upon what are these claims based because this is not my experience? |
Our own observations, maybe? :-)
I'm really curious as to what is going to happen with the sandbox... I however think it will not be gone in the near future for various reasons one being that it is a massive fight back towards 90% of the commom spam.
<< let's say from a fab built before some years for 200K of 1U/2U HP ProLiants (let's estimate Pentium III 833) to 200K+ of (?) with double Opterons and double RAM maybe.
Also when I look at their crawling frequency. They are for my site still far behind MSIE and Yahoo for the last 5-6 month now. This suggests either they don't love me anymore or they are simply still short in computing power for a half year? >>>
xcomm, good points, since we're just speculating here, my guess is that Google will stay with the same philosophy, simple, cheap, basic. You could be right, but creating dual processor systems would as you note create a huge amount more heat in tight spaces. So I'd guess single processor amd 64s, maybe dual ddr, single ide hard drive, no larger than 120 gigabytes [sata is still I think not fully stable in the 2.6 kernel, but google could easily have had that problem fixed I suspect, redhat has good sata support], it's not that much more expensive, but it's very fast. Everything else basic. Still a lot hotter than the old system. By the way, there was a good hardware article, all very vague, on how much heat was created by google's servers. The numbers are very high, well above the industry standard per cubic foot, or whatever units they use to measure server farm heat output. What creates this heat? Hot processors.
I've seen the google crawl frequency and intensity drop dramatically over the last year. In fact, I've seen nothing but reduced performance by google over the last year. Both slurp and msnbot spider much more aggressively, quickly, and deeply on a consistent basis. Slurps quick spidering unfortunately seems to have almost zero connection to getting those spidered pages into their main index, which also by the way seems to be quite full, although the symptoms are manifested in slightly different, and more random, ways than Google.
Google is still far ahead of yahoo in terms of getting new content indexed and up fast however, usually a few days. Definitely a lag on large new blocks of content, but for naturally growing sites, it's very fast. MSN beta is also very fast, I raced them some time ago to see which would get a new page indexed and ranked first and it was pretty much a tie.
Something struck me, first of all, I don't believe google runs 200,000 servers, that was from one truly horrible article posted by some google spin master, I don't think the numbers are anywhere close to that. But I do wonder if at least some of the last generation servers are being used to run things like gmail, which is really not that processor intensive, power requirements would be kept down as well, and heat. I don't know how google is storing it's gmail data, my guess is not on local machine hard drives, for many reasons, raid arrays for sure I think due to requirements for live real time swapping for failed drive units, much like what phone companies use to store voicemail etc.
>> that was from one truly horrible article posted by some [...]
While i don't believe in calling other members names, i think you might be referring to the Technology Review article referred to in this thread: [webmasterworld.com...]
The quote was "more than 250,000 Linux-based servers". Funny you should mention it, as in that thread (after getting input from you) i calculated (very roughly) that
|250,000 machines @ 70Gb should make room for roughly 875,000,000,000 pages at zero compression |
So, even if we are a bit conservative and let the number of servers be 100K, there's still plenty of storage for a few indexes or so :)
<< one truly horrible article posted by some google spin master >>>
Sorry, I wasn't clear, I meant 'published by some google spin master', I was referring to the author of the article, not the person who posted the link to the article. I think it's better for people to decide for themselves who is putting out spin etc on these forums, it's pretty obvious over time.
While I don't remember the math, if your math is right, I'd say Google is running about 1/10th that number of machines, which is what almost all estimates I've read in the past suggest. Nothing like massively inflating numbers to scare the competition, or, more likely, to impress investors, since the competition already knows perfectly well how much hardware is needed to do the job.
With images taking about 50,000 gigabytes assuming about 50kB average per image, 880,000,000 indexed today, web page html requiring for all pages only about 2500 machines by your math, allowing for significant redundancies, and various other service like gmail etc, you're probably in the right ballpark range with 50-100,000 machines, maybe less [[good google hardware article [zdnet.co.uk]].
One thing that has been somewhat under the radar is the new datacenters they are building (at a rather aggressive rate). Why would they be doing this unless they required more computing power. I think it is obvious they are in the midst of some sort of massive rebuild.
Anybody have any data on the locations and start dates of all of the Google datacenters?
(don't expect this would be easy to come by)
|So, even if we are a bit conservative and let the number of servers be 100K, there's still plenty of storage for a few indexes or so :) |
Yes for storage/backup this may be right, but I would assume they will need to have their 2 running 32Bit indexes in RAM!
Lets begin with this to calc:
2^32 index == 8058044651 Pages
Average HTML page maybe 15KB
-> 8058044651 Pages * 15KB = 120870669765KB = 115271,3GB
If their machines have 2GB of RAM
115271,3 GB / 2GB == 57635,6 Servers
(This is about a optimum. Some (sure not much) RAM for GNU/Linux - some machines for other work - some in maintenance - and what about redundancy?)
Xcomm, I think you're not quite getting how the system works:
|The process |
Obviously it would be impractical to run the algorithm once every page for every query, so Google splits the problem down.
When a query comes in to the system it is sent off to index servers, which contain an index of the Web. This index is a mapping of each word to each page that contains that word. For instance, the word 'Imperial' will point to a list of documents containing that word, and similarly for 'College'. For a search on 'Imperial College' Google does a Boolean 'AND' operation on the two words to get a list of what Hölzle calls 'word pages'.
"We also consider additional data, such as where in the page does the word occur: in the title, the footnote, is it in bold or not, and so on.
Each index server indexes only part of the Web, as the whole Web will not fit on a single machine - certainly not the type of machines that Google uses. Google's index of the Web is distributed across many machines, and the query gets sent to many of them - Google calls each on a shard (of the Web). Each one works on its part of the problem.
Google computes the top 1000 or so results, and those come back as document IDs rather than text. The next step is to use document servers, which contain a copy of the Web as crawled by Google's spiders. Again the Web is essentially chopped up so that each machine contains one part of the Web. When a match is found, it is sent to the ad server which matches the ads and produces the familiar results page.
From the link above.
I'm guessing it has to do with the URL's. Each sector just has a parameter added based on the query string (?sectors=5). This used to give the same pr as the url without the string added, not so anymore. Give each one of these sectors their own url. Another point that may be important, (but I don't think so) is your site map. You have links to these pages in a noscript tag saying you should have js enabled to see them yet the page requires no js. Personally, I would remove the noscript tag there.
| This 75 message thread spans 3 pages: < < 75 ( 1 2  ) |