The Big Data Debate: Correlation vs. Causation

Correlation_vs_causationIn the first quarter of 2013, the stock of big data has experienced sudden declines followed by sporadic bouts of enthusiasm. The volatility—a new big data “V”—continues this month and Ted Cuzzillo summed up the recent negative sentiment in “Big data, big hype, big danger” on SmartDataCollective:

“A remarkable thing happened in Big Data last week. One of Big Data’s best friends poked fun at one of its cornerstones: the Three V’s. The well-networked and alert observer Shawn Rogers, vice president of research at Enterprise Management Associates, tweeted his eight V’s: ‘…Vast, Volumes of Vigorously, Verified, Vexingly Variable Verbose yet Valuable Visualized high Velocity Data.’ He was quick to explain to me that this is no comment on Gartner analyst Doug Laney’s three-V definition. Shawn’s just tired of people getting stuck on V’s.”

Indeed, all the people who “got stuck” on Laney’s “definition,” conveniently forgot that he first used the “three-Vs” to describe data management challenges in 2001. Yes, 2001. If big data is a “revolution,” how come its widely-used “definition” is based on a dozen year-old analyst note?

Ranting about how “blogs and articles yammer on with the benefits of ‘big data,’” Cuzzillo correctly observes that they are simply “repeating promises made years ago about the benefits of small data and small analytics. This is old decision support super-sized and warmed over, the ‘new and improved’ that won’t satisfy any better than the original but which costs much, much more.”

Cuzzillo is joined by a growing chorus of critics that challenge some of the breathless pronouncements of big data enthusiasts. Specifically, it looks like the backlash theme-of-the-month is correlation vs. causation, possibly in reaction to the success of Viktor Mayer-Schönberger and Kenneth Cukier’s recent big data book in which they argued for dispensing “with a reliance on causation in favor of correlation” (see my discussion of the book and this argument).

In “Steamrolled by Big Data,” The New Yorker’s Gary Marcus declares that “Big Data isn’t nearly the boundless miracle that many people seem to think it is.” He concedes that “Big Data can be especially helpful in systems that are consistent over time, with straightforward and well-characterized properties, little unpredictable variation, and relatively little underlying complexity.” But Marcus warns that “not every problem fits those criteria; unpredictability, complexity, and abrupt shifts over time can lead even the largest data astray. Big Data is a powerful tool for inferring correlations, not a magic wand for inferring causality.” Calling for “a sensitivity to when humans should and should not remain in the loop,” Marcus quotes Alexei Efros, “one of the leaders in applying Big Data to machine vision,” who described big data as “a fickle, coy mistress.”

Matti Keltanen at The Guardian agrees, explaining “Why ‘lean data’ beats big data.” Writes Keltanen: “…the lightest, simplest way to achieve your data analysis goals is the best one…The dirty secret of big data is that no algorithm can tell you what’s significant, or what it means. Data then becomes another problem for you to solve. A lean data approach suggests starting with questions relevant to your business and finding ways to answer them through data, rather than sifting through countless data sets. Furthermore, purely algorithmic extraction of rules from data is prone to creating spurious connections, such as false correlations… today’s big data hype seems more concerned with indiscriminate hoarding than helping businesses make the right decisions.”

In “Data Skepticism,” O’Reilly Radar’s Mike Loukides adds this gem to the discussion: “The idea that there are limitations to data, even very big data, doesn’t contradict Google’s mantra that more data is better than smarter algorithms; it does mean that even when you have unlimited data, you have to be very careful about the conclusions you draw from that data. It is in conflict with the all-too-common idea that, if you have lots and lots of data, correlation is as good as causation.”

Isn’t more-data-is-better the same as correlation-is-as-good-as-causation? Or, in the words of Chris Andersen, “with enough data, the numbers speak for themselves.”

That’s much more than a mantra. It’s the big data religion, its core mystical experience: The data speak (how prescient was Larry Ellison when he re-named his company in 1982).

“Can numbers actually speak for themselves?” non-believer Kate Crawford asks in “The Hidden Biases in Big Data” on the Harvard Business Review blog and answers: “Sadly, they can’t. Data and data sets are not objective; they are creations of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpretations. Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves. We get a much richer sense of the world when we ask people the why and the how not just the ‘how many.'”

A NPR blogger notes that “while Big Data can uncover correlations between data, it doesn’t reveal causation. Sometimes, that doesn’t really matter, but other times, it might — in ways we’re not always aware of.” He (or she) also quotes The New York Times Steve Lohr who quotes Albert Einstein: “Not everything that counts can be counted, and not everything that can be counted counts.”

Speaking of Einstein (“imagination is more important than knowledge”), E. O. Wilson in The Wall Street Journal takes the discussion to a whole new level. While he doesn’t specifically mention big data, Wilson (in “great scientists don’t need math”) makes an important distinction between using only mathematics and using one’s imagination or intuition: “I have a professional secret to share: Many of the most successful scientists in the world today are mathematically no more than semiliterate… Fortunately, exceptional mathematical fluency is required in only a few disciplines, such as particle physics, astrophysics and information theory. Far more important throughout the rest of science is the ability to form concepts, during which the researcher conjures images and processes by intuition… The annals of theoretical biology are clogged with mathematical models that either can be safely ignored or, when tested, fail. Possibly no more than 10% have any lasting value. Only those linked solidly to knowledge of real living systems have much chance of being used.”

And David Brooks in The New York Times, while probing the limits of “the big data revolution,” takes the discussion to yet another level: “One limit is that correlations are actually not all that clear. A zillion things can correlate with each other, depending on how you structure the data and what you compare. To discern meaningful correlations from meaningless ones, you often have to rely on some causal hypothesis about what is leading to what. You wind up back in the land of human theorizing… Most of the advocates understand data is a tool, not a worldview. My worries mostly concentrate on the cultural impact of the big data vogue. If you adopt a mind-set that replaces the narrative with the empirical, you have problems thinking about personal responsibility and morality, which are based on causation. You wind up with a demoralized society.”

I don’t think that the big data mind-set replaces “the narrative” with the empirical. It replaces it with numbers and correlations. There is nothing wrong with a scientific mind-set, based on empirical observations, as long as people don’t mistake number-crunching for scientific inquiry or see cause-and-effect in correlations.

Kaiser Fung concludes in his summary of the recent Reinhart-Rogof kerfuffle (“Occupational hazards in data science”) that the problem of seeing (or implying) causation in correlations is found not just in economics but also in medical research and other fields using observational data: “The usual ploy is first acknowledge that the data could not prove causality (‘we found an association between sleeping less and snoring; our data does not allow us to prove causation.’), then quietly assume that the causal link is there, and wax on the implications (‘if you want to snore less, sleep less.’)” Or, as The Atlantic’s Matthew O’Brien puts it: “R-R whisper ‘correlation’ to other economists, but say ‘causation’ to everyone else.”

Whether you use small or big data, your imagination (developing theories) and integrity (following the scientific method) are what counts. Correlations can count, too, in certain situations. Just don’t expect them to explain anything.

[Originally published on]