Machines vs. Models, Noise vs. Signal

An excerpt from Nassim Taleb’s forthcoming book, Antifragile, was posted yesterday on the Farnam Street blog. In “Noise and Signal,” Taleb says that “In business and economic decision-making, data causes severe side effects —data is now plentiful thanks to connectivity; and the share of spuriousness in the data increases as one gets more immersed into it. A not well discussed property of data: it is toxic in large quantities—even in moderate quantities…. the best way… to mitigate interventionism is to ration the supply of information, as naturalistically as possible. This is hard to accept in the age of the internet. It has been very hard for me to explain that the more data you get, the less you know what’s going on, and the more iatrogenics you will cause.”  

In other words, big data equals big trouble. Taleb is right to warn of the dangers of blindly falling in love with data and we are all familiar with the dangers of data-driven mis-diagnosis and intervention not just in healthcare but in policy making, education, and business decisions.

But making a sweeping statement that more data is always bad is also dangerous. Is intuition (i.e., no data) always better? Is a small amount of data always better than lots of data? Does noise rises proportionally to the increase in the volume of data?

Many data scientists today would answer with a resounding “no.” More data is always better, they argue.  Their major reference point is the 2009 paper by Alon Halevy, Peter Norvig, and Fernando Pereira, “The Unreasonable Effectiveness of Data.” It talks about how Google’s “trillion-word corpus with frequency counts for all sequences up to five words long” can serve as “the basis of a complete model for certain tasks—if only we knew how to extract the model from the data.” The paper demonstrates the usefulness of this corpus of data for language processing applications and argues for the superiority of this wealth of data over pre-conceived ontologies.

Last October (and on other occasions), Norvig has followed up with a presentation on the effectiveness of data, conveniently available on YouTube.  The presentation, however, goes way beyond the modest argument of the paper for the superiority of data-as-a-model in performing “certain tasks.” It makes a much bigger argument by contrasting the data-driven approach with the traditional scientific method (exemplified in Norvig’s presentation by Newton) of developing models and testing them with data and with the expert systems approach (exemplified by Edward Feigenbaum) of extracting models from experts and then coding them into computer programs. Norvig argues for the statistical machine learning approach (exemplified by this year’s Turing Award-winner Judea Pearl) “where we say we don’t need an expert to tell us the theory; maybe we can gather enough data and run statistics over that and that will tell us the answer without having to have an expert involved.”

In his presentation, Norvig provides examples for the superiority of data in areas such as photo editing, re-formatting video images, word sense ambiguation, and word segmentation. Another example is how Google Flu Trends “predicts the present” by beating the Center for Disease Control by three weeks in identifying flu epidemics.

There is no question in my mind as to the utility of this approach for “certain tasks.” But I do have one question: What have image processing or text analysis or being speedier than the federal government to do with Newton? How can these be presented as a “better” approach than the scientific method?

Taleb should think about whether it is the quantity of data that matters or is the root of the problem in how we approach the data available, large or small. Norvig should think about how far has science advanced by scientists first proposing models and how little it may advance if we only expect models to rise from the data.