|Changing the Way We Do Science - with Google Software|
|In February, the National Science Foundation announced the Cluster Exploratory, a program that funds research designed to run on a large-scale distributed computing platform developed by Google and IBM in conjunction with six pilot universities. |
The cluster will consist of 1,600 processors, several terabytes of memory, and hundreds of terabytes of storage, along with the software, including Google File System, IBM's Tivoli, and an open source version of Google's MapReduce. Early CluE projects will include simulations of the brain and the nervous system and other biological research that lies somewhere between wetware and software.
The entire article is an excellent read. The idea is that with enough data, we no longer need to create and test theories. At the O'Reilly Emerging Technology Conference this past March, Google's research director Peter Norvig observed: "All models are wrong, and increasingly you can succeed without them."
I can't say that I agree with this article. Take Google's own example - without the model which says that inbound links produce importance - they would have been lost.
The gene tracking likewise, would be impossible without established models of gene behaviour.
On the subject of using empirical data correlation in place of theoretical equations, that's not science, that's engineering and it's been going on for a very very long time.
An engineer, given the task of determining the thickness of a new material as part of a car will test the material at various thicknesses, draw a curve, read off the minimum predicted thickness and then add a nice safety margin. That's all that this article is talking about - and it certainly is not science.
I agree, the author seems mystified by the brave new world of "cloud" computing and has given up trying to make sense of it, preferring instead to imagine some fundamental philosophical change in science.
Google's preference toward models for the sake of models, the inbound link formula being representative of it, is all to prevalent in the modern world where we are faced with such an influx of data which nobody knows how to analyze sensibly. Google does a hash job of it, and nobody seems to mind - but turning the web into a popularity contest is no basis on which science can be prosecuted.
For instance, a genius biologist (a subscriber to the avant garde biology Wired tells us of?) published the cure to cancer on his personal website last year. In fact the page is still online. But since he tragically got hit by a bus an hour after uploading the paper, he never got around to telling anyone else, nor all that annoying business of search engine optimization. Google bot traversed the page, but lacking any number of external inbound links, ignored it...
In science, consensus is utterly meaningless. All that matters are the facts as can be proven. In terms of data presentation, peer review is the best we have - Google's page rank should be entirely discarded.
In other words, this idea of patterns being the defining be all and end all of science, and models for the sake of models, is very dangerous.
The future ultimately has to be about much more intelligent computing, and pulling apart patterns - smashing through illusions - to see the underlying realities.
This is a good response:
As someone with a Ph.D. in physics I have to take exception with the main idea put forward in this article. It's dangerous to say that science should be reduced to looking at patterns and from that deduce what's going on.
Physics Nobel Prize winner Richard Feynman warned of this in his Caltech commencement speech about Cargo Cult Science [lhup.edu]. What is cargo cult science? I'll let the late great Professor Feynman explain it himself.
In the South Seas there is a cargo cult of people. During the war they saw
airplanes land with lots of good materials, and they want the same
thing to happen now. So they've arranged to imitate things like
runways, to put fires along the sides of the runways, to make a
wooden hut for a man to sit in, with two wooden pieces on his head
like headphones and bars of bamboo sticking out like antennas--he's
the controller--and they wait for the airplanes to land. They're
doing everything right. The form is perfect. It looks exactly the
way it looked before. But it doesn't work. No airplanes land. So
I call these things cargo cult science, because they follow all the
apparent precepts and forms of scientific investigation, but
they're missing something essential, because the planes don't land.
The analogy may not be perfect with respect to what's discussed in the article, but the idea is the same. We're bound to run into trouble if we don't have a good model that explains the data that we're looking at.
Also, don't forget that truly good models do more than explain the data that we have. They predict new things and explain other current data that no one previously understood. This happened often in the development of particle physics, where a new model based on symmetry predicted particles that had not yet been seen and the particles were later discovered. The added benefit was that we came to understand that symmetry is a fundamental basis for how things work in the universe. If it wasn't for hypotheses and models, if we had just looked at the data, we would have missed out on this.
That brings me to my next point, that there is a group of people who have had more data than they know what to do with for quite a while now - the particle physicists! Over six petabytes of particle physics data are stored at Fermilab [isgtw.org]. However, when physicists want to study the data they don't just grab a bunch of it and run a statistical analysis without any reason behind the analysis (as proposed in the article). In fact, physicists are very careful to develop an advanced analysis first (it is tested on randomly generated 'Monte Carlo' data). This is called a blind analysis (pdf) [slac.stanford.edu] and is the particle physics equivalent of a double-blind randomized clinical trial in medical research. Only once they know that the analysis is correctly targeted to test a certain hypothesis do they actually run it on real data. Once the analysis is run on real data it is final. There is no going back to tweak it to try to get a 'better' answer. If the answer is surprising or seems incorrect it should not be thrown out because it might just be a new discovery! This is the only way to truly do what one can call science.
Now that isn't to say that one cannot learn things by looking at a large corpus of data and trying to find patterns. I think statisticians have been doing this for quite a while. No doubt some interesting things can be learned, such finding new species, which was mentioned in the article. But that's not the scientific method and won't replace science. For instance, I didn't see anything mentioned in the article about the scientist discovering a new model for the evolution of species and proving that the model is correct. In fact, though, one might even see some patterns and say "Hey, look at that" and form a new hypotheses based on it. Then to be a real scientific theory that model would have to be tested against other data to see if it still held true. To be a great theory it would need to predict new phenomena which have not yet been observed. A truly great theory will even teach us something fundamental about the nature of the universe, or of the body, or human nature, etc.
In summary, the new tools for analyzing large data sets in a statistical way will no doubt be very useful. But they will not replace the scientific method.
I agree with you, physics. The cause-effect model (and it is really just that - a construct or model in consciousness) has done great service for the human race and it will continue to serve. However, the emergence of stastical techniques for working with super-large data sets has its own place in our tool kit too. And its value will also rest on its predictive abilities.
As I see it, we are moving to penetrate into the "sub-causal" world of mental models, in parallel with the physical penetration of the sub-atomic "reality". If you can make valid predictions with an approach, then where's the problem in not having a model?
The applications of complexity theory to biology may well depend on this kind of approach. For example, contemporary pace makers, for the heart and other organs incuding the brain, are still quite brutish. They deliver a mega whack to the organ, smacking it back in line to get the desired behaviior. But complexity theory describes the phenomenon of a strange attractor. With the gentlest of touches at exactly the right moment, the heart or brain may be nudged into a completely different (and more healthful) state.
We know the two states exist and that transitions between them can be profound in their effect. It appears that we don't need to address the "cause" of a heart attack or a seizure in order to provide relief, if we have a large enough data model to work from.
There has been some remarkable work from Stephen Wolfram in addressing macro-scale phenomena with micro-scale data approaches. When I read his book "A New Kind Of Science", it opened me to wildly different understanding of the phenomenal world.
Traditional science will long be with us, most definitely. Newton serves just fine in much of the practical world, and Einstein needs to get hauled out in other situations. Just so with cause-and-effect theory models compared to no-model data analysis.
This all gets fuzzy around the edges - but so does matter/energy/space/time when you observe it in depth. And none of it accounts for consciousness itself very well.
I doubt that Google's systems will be able to do much better at that, either. AI and machine learning are not the same as consciousness itelf, the very ground where mind stuff appears. Science is worked through the manipulation of mind stuff, but consciousness itself is prior, or senior to it all. The tail cannot wag the dog! (Am I at all coherent at this point?)
Poor Chris Anderson, I am avidly reading Wired for a while, but he's starting to sound like the Maxim-spokesmen of tech, and with this article he's talking pretty much gibberish: the cloud is nothing more than a powerful computer and without a human telling it what and how to do it, it's pretty much worthless. Science, models, theories, formulas and experiments will always be need and will lay at the center of sound technology: patterns after all are but a subdivision of math. All the cloud contributes for is for massive Monte-Carlo simulations and analysis of large data sets: however, without models and theories I am pretty sure the cloud will one day conclude, based on Google Space data, that the Sun rotates around the Earth. Mr Anderson, we share your enthusiasm for technology but we don't share your enthusiasm for tech-sensationalism.
|Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. |
Gee, that's really useful. How does it help a suicidal teenager to know that "people do it" and we don't even care why? So this is the "point"?
|The models we were taught in school about "dominant" and "recessive" genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton's laws... |
I first studied genetics in the days when it was pure statistics because there was no way to see inside the human chromosome. Even then, we knew it really wasn't that simple. I'm now working in cancer genetics research in the days of epigenetics and SNPs and we still know it's not that simple. Sure, new data changes the way research is done; the 80-year-old doctor I work with is hanging for dear life onto "linkage studies" (put simply, someone with trait A is more likely to have trait B than is supported by chance, so the genes controlling the two traits are likely to be relatively close to each other), when SNPs can identify the genetic source of disease so much faster and more efficiently that it hardly bears talking about. But SNPs have to be "made" by humans working in labs - humans who still have a lot of things to figure out. Yes, the human genome has been sequenced, but that means we've learned the alphabet - it doesn't mean we know how to read everything it says.
Speaking of the human genome, a disclaimer here that, as someone who thinks science is helped by people sharing their research results, I have no love lost on Venter. I'll choose Francis Collins to be in charge of my genome, thank you:
|Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology...... By analyzing [data] with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation. |
Collecting "statistical blips" is okay, but if it stops there, what advancement has it performed? If other scientists take his data and run with it, and actually learn something about the species he's found evidence of, that's when biology will be advanced. To say that Venter has "advanced biology more than anyone else of his generation" is... well, it's a lot of things, most of which I can't say here.
Of course, you can ask anyone who uses AdWords or who tries to produce good SEO about the definition of "Google quality." Is that really what you want your doctor to base his treatment decisions on?
Once a week or so, I change the quote I have hanging on my office door. One of my favorites is from Isaac Asimov (who actually wrote a lot more scientific books than he did science fiction): The most exciting phrase to hear in science,
the one that heralds new discoveries,
is not 'Eureka!' but 'That's funny...'
My personal belief is that, no matter how much data we manage to aggregate, it will still take a human mind to look at it and say, 'That's funny...' and then to go on and make sense of it.
|patterns after all are but a subdivision of math. |
True, but massive computing power is also setting the traditional world of math on its head.
For instance, the "four color theorem" has been proved only via automated computer work on a massive scale, and that proof has also be verified only through similar massive computation. It seems this proof cannot be made or checked "by hand", and many mathematicians are more than a bit disturbed by the implications of all that.