Douglas Merrill, former CIO/VP of Engineering at Google, has issued an important warning about big data:
“With too little data, you won’t be able to make any conclusions that you trust. With loads of data you will find relationships that aren’t real… On net, having a degree in math, economics, AI, etc., isn’t enough. Tool expertise isn’t enough. You need experience in solving real world problems, because there are a lot of important limitations to the statistics that you learned in school. Big data isn’t about bits, it’s about talent.”
What Merrill is warning about and what he means by “talent,” I think, is the danger of blindly falling in love with correlations and not being able to develop a model that explains (or predicts) the relationships found. This is what I think David Smith means when he agrees with Merrill, saying that “this is a great illustration of why the data science process is a valuable one for extracting information from Big Data, because it combines tool expertise with statistical expertise and the domain expertise required to understand the problem and the data applicable to it.”
The “talent” of “understanding the problem and the data applicable to it” is what makes a good scientist: The required skepticism, the development of hypotheses (models), and the un-ending quest to refute them, following the scientific method that has brought us remarkable progress over the course of the last three hundred and fifty years.
But this tradition is threatened by the excitement—too much excitement?—around big data. According to nextgov.com, Farnam Jahanian, chief of the National Science Foundation’s Computer and Information Science and Engineering Directorate, believes that “Big data has the power to change scientific research from a hypothesis-driven field to one that’s data-driven.” Explains Bill Perlowitz, chief technology officer of Wyle science: “In hypothetical science, you propose a hypothesis, you go out and gather data and you see if your hypothesis is supported. That limits your exploration to what you can imagine. It also limits the number of relationships you can explore because the human mind can only go so far. The shift with data-driven science and big data is that first we collect the data and then we see what it tells us. We don’t have a pretense that we understand what those relationships are, or what information we may find.”
Sure, just take for example that fella Einstein, who had the “pretense” to speculate about the universe without having any data, big or small, to support his limited imagination…
At least one commentator on the nextgov post is not buying the next-big-thing-in-science, observing that “With Big Data, any clever analyst can find the data set with the right, spurious correlation to prove his bias. How not to fall for the ‘revelation-de-jour’ will be the Bid Data challenge.”
I’m not sure how much this misguided excitement around big data is a clear and present danger to science right now. But the threat to sound business decisions is quite evident and some scholars are fighting back. Technology Review reports that Daniel Gayo-Avello, at the University of Oviedo in Spain, “knocks Twitter’s predictive crown off altogether.” After reviewing the work of researchers who claim that Twitter’s data can predict election results, Gayo-Avello concluded that it is flawed because of the simple fact that “social media is not a representative and unbiased sample of the voting population.”
Gayo-Avello is joined, also on the pages of Technology Review, by Wharton’s Peter Fader, who responds unequivocally to a question about businesses that “promise to take a Twitter stream or a collection of Facebook comments and then make some prediction”:
“That is all ridiculous. If you can get me a really granular view of data—for example, an individual’s tweets and then that same individual’s transactions, so I can see how they are interacting with each other—that’s a whole other story. But that isn’t what is happening. People are focusing on sexy social-media stuff and pushing it much further than they should be. The important part, as both scientists and businesspeople, is to understand what our limits are and to use the best possible science to fill in the gaps. All the data in the world will never achieve that goal for us.”
Amazing. Apparently Gayo-Avello and Fader never heard that more data beats sampling (“the big data blasphemy”
per as Meta S. Brown correctly labeled it) and that science has finally thrown off the shackles of hypothesis-making.
While a lot of money will continue to drive blind exploration in both science and business, I’m certain that the advancements and triumphs of the future will come from cool minds developing imaginative models and theories and testing them with the help of new big data tools and technologies. The trouble with big data may be just the hype surrounding it.
If you want to extract a few pounds of gold out of a mountain, you can dig the whole mountain and process millions of tons of rock, and get all the gold. Or you can use smart strategies to detect where gold is likely to be located (e.g. metal sensors, rock sampling, spatial statistical to estimate lodes location) and get 50% of the gold. The first technique produces a negative ROI, the second one produces a positive ROI. The first technique is equivalent to processing raw big data, the second is equivalent to process carefully selected small data.
Depending of course, on the relative costs of digging vs. metal sensors, sampling, etc., and some digging, which better be less than 50% the cost of digging the whole mountain. 🙂
Also keeping in mind that the capital investment in sensors, statistical models, etc can be used on next mountain but the expense of digging and processing is limited to a single jobsite.
Excellent article that is right on point. I agree that it is not economically feasible to process an entire mountain for a few pounds of gold. However, it is not only feasible but desirable to process an entire corpus of data and enable fast, intuitive exploration and discovery of its information (data is big, information is small). This is becoming known as totality and exploration. At least two cutting edge tech firms are racing down this path – SPLUNK and SpeedTrack.
Vincent, your analogy suggests that if you’re in the business of digging up mountains for gold that you should first invest in developing the right machines for the job.
Using the ‘big data’ Haynes Technique, I would imagine a machine that would simply eat through a mountain, breaking it up into grains of sand, then its easy to filter out the gold with a $20 metal detector. My machine would also run quicky, unattended and cost next to nothing to hire and operate.
I don’t care where the gold is – apart from it’s in them there hills.
I agree with Douglas Merrill. While I like the positive ROI been a result of process carefully selected small data, I am well aware that dimensionality reduction is the essence of any meaningful data analysis, that is not the focus of Data mining or Big Data Scientists at present time. It will become however the key effort in the future. As Merrill said “having a degree in math, economics, AI, etc., isn’t enough. Tool expertise isn’t enough. You need experience in solving real world problems, because there are a lot of important limitations to the statistics that you learned in school. Big data isn’t about bits, it’s about talent.” That talent is very rare indeed.
This article is bang on. I am curious to see which companies have mastered the art of analyzing their own data warehouse versus an ocean of loosely coupled data. The problem with big data is that it comes with an enormous amount of noise, and unfortunately, that noise is subject to wide variances because the data is not sourced through controlled processes.
The traditional data warehouse is still a challenge to most companies because they are unable to align their source systems in a consistent fashion. As a result, data anomalies arise and send executives in a tailspin, culminating in a frenzy of activity to determine root cause. Imagine what root cause analysis will look like when the data source is relatively unpredictable.
Leveraging information effectively is a science, and in the end, the hype of Big Data will have to settle down into a realistic extension of proven data analysis techniques. In business intelligence, one often tries to differentiate between when information is actionable and when it isn’t. A huge amount of time and energy (translation: $) can be spent pursuing something that indicates a strong relationship/correlation, but if that information cannot be turned into measurable actions to realize the value proposition, then what has been gained? Big Data does provide a larger repository of possibilities, but it will fall far short of the promise if the proper methodologies are not put in place to mitigate the inevitable spike in false positives.
As a traditional data warehouse/BI practitioner I’ve had the same suspicions as voiced in this post, but couldn’t articulate them even to myself so clearly. I’m glad to find there are like-minded people out there and that my suspicions are justified. Still, I’m sure there are appropriate venues for big data approaches, but they need a lot more provisos and caution than we typically hear in the hype.
A similar discussion, albeit in the context of data mining, was posted yesterday here http://www.information-management.com/blogs/data-model-mining-DM-politics-Breiman-Schrodt-10022472-1.html?goback=%2Egde_2013423_member_115788255
with links to similar debates among statisticians and econometricians.
I’m really happy to see critical oppinions about big data hype.
Many have troubles to master “standard” data and big aspirations will lead them down from road.
Start with good reason WHY and only then dig into hard rock.
At the end of the day, whether we start from a cognitive approach: proposing and testing hypotheses (traditional statistical analysis), or whether we use the non-cognitive approach of having the data speak to us (data-mining), the relative value will be determined by real world results. If there is not an underlying story that makes sense in the non-cognitive approach, then it will ultimately fail in the real world.
Very nice article in term of highlighting the importance of fining the “smart” predictive models. What I mean by “smart” is considering the business profitability goal and the required data pre-processing and post-processing made decisions. However, this article could be more sentimentally positive while communicating this idea to the business/ data science community. Thanks for the useful article.
Thanks! and I appreciate your point about being more positive. I blogged again today on the subject–I hope in a more balanced and less sarcastic way. See https://whatsthebigdata.com/2012/05/30/machines-vs-models-noise-vs-signal/
whats new thats been said here ?
I have a theory that there are far more ‘good scientists’ who don’t hold any degree than those that do. Big data analytics is childs play – what’s all the fuss about?
Perhaps Google and many other blue-chip companies are being let down by their recruitment agencies (or in-house policies) because they won’t consider the vast army of talented individuals who have not been processed through the university (uniformity) mills.
If you want to find the special talent – why not try looking outside the universities?
Analyzing big data is like weather forecast. You might improve your forecast by adding more data points, but after all the process itself (the weather) is chaotic in the mathematical sense and defeats prediction beyond a certain time scope. So certain efforts are simply hopeless.
And one thing that statistical analysis of data without testing hypothesis can never master is finding the difference between correlation and causality.