The Unreasonable Effectiveness of Small Data

“If this program is not effective it has to end. So far, I’m not convinced by what I’ve seen,” said Senator Patrick Leahy in the Senate Judiciary Committee hearing on July 31, according to the New York Times. In June, I wrote this about the NSA controversy:

Most of the discussion around the revelations about the data collection activities of the NSA has been about the threat to our civil rights and the potential damage abroad to U.S. political and business interests. Relatively little has been said, however, about the wisdom of collecting all phone call records and lots of other data in the fight against terrorism or other threats to the United States.

Faith in the power (especially the predictive power) of more data is of course a central tenet of the religion of big data and it looks like the NSA has been a willing convert. But not everybody agrees it’s the most effective course of action. For example, business analytics expert Meta Brown: “The unspoken assumption here is that possessing massive quantities of data guarantees that the government will be able to find criminals, and find them quickly, by tracing their electronic tracks. That assumption is unrealistic. Massive quantities of data add cost and complexity to every kind of analysis, often with no meaningful improvement in the results. Indeed, data quality problems and slow data processing are almost certain to arise, actually hindering the work of data analysts. It is far more productive to invest resources into thoughtful analysis of modest quantities of good quality, relevant data.”

While not commenting specifically on the work of the NSA, Kate Crawford, principal researcher at Microsoft Research and a visiting professor at the MIT Center for Civic Media, adds that “Small data, big data, and combinations of the two can be appropriate depending on the type of question being asked. It’s about applying the right tools to the right question, and being explicit in your results about the limitations of each approach.”

No small data for the NSA which has always preferred to know it all–its earliest predecessor tapped into Western Union’s network already in the 1920s. Earlier this week, James Bamford summed up his three decades of NSA watching by telling an “old National Security Agency joke: God we trust. All others we monitor.” He also warned us about the future: “In a dusty corner of Utah, NSA is now completing construction of a mammoth new building, a one-million-square foot data warehouse for storing the billions of communications it is intercepting. If the century-old custom of secret back-room deals between NSA and the telecoms is permitted to continue, all of us may digitally end up there.”

Bamford broke the story of the $2 billion construction of this Yottabyte datacenter in Wired last year. He quoted former NSA employee William Binney who proposed a way for NSA’s data collection efforts “to distinguish between things you want and things you don’t want.” But the proposal was rejected and Bamford concluded the NSA stores everything it gathers and is gathering as much as it can.

In his Wired story, Bamford pointed to a breakthrough in code-breaking and the building of a more powerful supercomputer as the prime motivations behind the sweeping of more and more data. And more to come—the reason for storing all the data is “What can’t be broken [as in code-breaking] today may be broken tomorrow.” But Bamford may have missed the rise of the big data and machine learning experts at the NSA and the replacement of supercomputers with “commodity” servers and storage devices for the cost-efficient processing of very large sets of data, using software that was first develop by Google, then enhanced and open-sourced by other Web-native companies (and that the NSA further developed and even gave back as the open-source Accumulo; see GigaOm and Wired). The availability of new hardware, software, and people well versed in the new ways of big data answered the new post-9/11 needs and probably drove a shift in focus from deciphering encrypted data to finding non-encrypted “digital crumbs” left by and pointing to potential terrorists. “If you’re looking for a needle in the haystack, you need a haystack,” Jeremy Bash, chief of staff to Leon E. Panetta, the former C.I.A. director and defense secretary, told MSNBC.

That building a giant haystack is the way to go, and that you don’t need even to know what needle you are looking for, it will simply “emerge” from the data, is certainly what the NSA learned from big data advocates. “Now go out and gather some data, and see what it can do,” three Google researchers recommended in their influential 2009 paper, “The Unreasonable Effectiveness of Data” (PDF). That the paper dealt with a very specific domain—language processing—and argued only for the superiority of Google’s trillion-word corpus over pre-conceived ontologies, did not deter big data advocates from claiming the superiority of “data-as-a-model” (i.e., don’t use models, let the data speak) in all other domains, even claiming it is transforming science (forget about making hypotheses). The broad impact of these claims was evident last week when a Wall Street Journal editorial defending the NSA declared “The effectiveness of data-mining is proportional to the size of the sample, so the NSA must sweep broadly to learn what is normal and refine the deviations.” Size matters, end of story.

The Wall Street Journal also reported that the NSA has tried, failed, and tried again to follow this “more data is better” philosophy until is saw success in 2010 with a program for the detection of the location of IEDs in Afghanistan. “Analysts discovered that the system’s analysis improved when more information was added,” we are told. Whatever the magnitude of the improvement was, it could not have justified in my opinion this reaction from a former U.S. counterterrorism official, as reported by the Journal: “It’s the ultimate correlation tool… It is literally being able to predict the future.” But if you want to believe that some success in a specific, narrow task indicates you can predict the future everywhere else, you proceed to collect all the data you can collect because you assume eventually it will tell you whatever you want to know and even what you don’t know that you don’t know.

The New York Times mentioned another strand of influence on the NSA in the early 2000s: “When American analysts hunting terrorists sought new ways to comb through the troves of [data]… they turned to Silicon Valley computer experts who had developed complex equations to thwart Russian mobsters intent on credit card fraud.” Rachel Schutt, Senior Research Scientist at Johnson Research Labs, brought up this venerable and fairly successful example of data mining when I asked her (via email) about the NSA: “If they are building something like the equivalent of a fraud detection system for a credit card company, or some sort of suspicious activity detection system, then that needs to be running on all data streaming into the system. If they didn’t let all calls go through the fraud detection system, then they’ll miss fraud. This would be like a credit card company not saving all transactions or observing all transactions.”

Schutt also explained why the NSA task of identifying specific individuals is different from the population-level work of traditional statistics: “Our understanding of statistical modeling is different when it comes to user-level data. It used to be we thought in terms of sampling in order to make inferences about the entire population. But with user-level data, we want to know about every individual. For a specific individual, we might want to sample from their phone calls if we discover we don’t need to keep it all (though how can you be sure?). It could be we only take snapshots or aggregates for that individual over time and that is sufficient to know they are not a terror threat with some level of confidence.”

I tend to be in the same camp as the advocates of “The Unreasonable Effectiveness of Small Data” (you heard it here first) but I’m ready to be convinced that there may be certain situations where the increased investment in collecting (and cleaning) the data may result in better analysis. There is a difference, however, between businesses falling for exaggerated or misleading claims about big data (after all, it’s—for the most part—their own data and dime) and the government doing the same. Especially when what the government is doing is shrouded in secrecy and can result in deadly consequences.

Forbes’ Bruce Upbin reported here about two TED talks, one describing drones as a new mode of transportation making overcrowded cities more livable, the other warning about drones that “will someday soon (unless we do something about it) make lethal decisions beyond our accountability.” Similarly, big data—or making decisions based on data analysis—can be a force for good and a dangerous development. To make sure it is not the latter, especially when our government is concerned, we need to examine and re-examine some of the claims about what it can do and specifically, whether more data is always better.

U.S. Director of National Intelligence James Clapper points to a couple of cases (CNN counts half a dozen) in which intercepted communications were critical in terror investigations. No need to share with the public the specifics–a more honest and transparent account would be to tell us periodically the number of terror threats foiled specifically because of the NSA’s vast data collection (we usually find out when the system fails, as in the recent Boston Marathon bombing). President Obama said that “one of the things that we’re going to have to discuss and debate is how… we [are] striking this balance between the need to keep the American people safe and our concerns about privacy, because there are some trade-offs involved.” I believe we need first to debate the incremental benefit, if any, of collecting more data and the relative merits of the NSA’s big data programs vs. other ways of finding needles in a haystack and going after them.

[Originally published on Forbes.com]

Last updated on August 7th, 2013.