On Data Janitors, Engineers, and Statistics

A Data Janitor

A Data Janitor

Big Data Borat tweeted recently that “Data Science is 99% preparation, 1% misinterpretation.” Commenting on the 99% part, Cloudera’s Josh Wills says: “I’m a data janitor. That’s the sexiest job of the 21st century. It’s very flattering, but it’s also a little baffling.” Kaggle, the data-science-as-sport startup, takes care of the “1% misinterpretation” part by providing a matchmaking service between the sexiest of the sexy data janitors and the organizations requiring their hard-to-find skills. It charges $300 per hour for the service, of which $200 go to the data janitor (at least in the case of Shashi Godbole, quoted in the Technology Review article). Kaggle justifies its mark-up by delivering “the best 0.5% of the 95,988 data scientists who compete in data mining competitions,” the top of its data science table league, the ranking of data scientists based on their performance in Kaggle’s competitions, presumably representing  sound interpretation and top-notch productivity.

Kaggle’s co-founder Anthony Goldbloom tells The Atlantic’s Thomas Goetz that the ranking also represents a solution to a “market failure” in assessing the skills and relevant experience of the new breed of data scientists: “Kaggle represents a new sort of labor market, one where skills have been bifurcated from credentials.” Others see this as the creation of a new, $300 per hour, guild. In “Data Scientists Don’t Scale,” ZDnet’s Andrew Brust says that “’Data scientist’ is a title designed to be exclusive, standoffish and protective of a lucrative guild… The solution… isn’t legions of new data scientists. Instead, we need self-service tools that empower smart and tenacious business people to perform Big Data analysis themselves.”

The self-service tools are not here yet (and it’s debatable whether any tool could fully replace a data scientist), so executives at Opera Solutions offer free advice on how to hire and keep data scientists, based on their experience recruiting “more than 230 leading machine-learning scientists.” In short, handle with care: “Scientists prefer to work in a community of like-minded individuals with whom they can collaborate, learn from and feel comfortable with. In addition, they want to know that they’re a valuable and important group within the organization. And companies that already have a critical mass of scientists find it easier to attract more.” If you want to join the community of like-minded individuals and “make the leap from [a] guy who knows data to data scientist,” Derrick Harris at GigaOM distills for you advice from Hortonworks, Netflix, and Orbitz.  Harris reports that Chris Pouliot, director of algorithms and analytics at Netflix, follows the Goldbloom Bifurcation Thesis: “One thing Pouliot warned about is an over-reliance on what’s on your résumé. Right off the bat, for example, he’ll test the heck out the skills or knowledge that someone claims to ensure they really know it. Having a Stanford degree and work experience at Google don’t necessarily make someone a shoo-in, either.”

For those outside the data science-heavy orbit, with or without a Stanford degree, a comprehensive guide to understanding and working with data scientists is provided by Tom Davenport and Jinho Kim in their new book Keeping Up with the Quants: Your Guide to Understanding and Using Analytics. They write: “All organizations in all industries will need to make sense of the onslaught of data. They’ll need people who can do the detailed analysis of it—these people go by different names, but are quants, and this book is not meant for them. And they’ll need people who can make good decisions and take actions based on these results—this is who we are writing to, the non-analysts, nonquantitative people in organizations who have to work with and make decisions based on quantitative data and analysis.” In other words, the people who are not going to wait for “self-service tools” and are not going to let data scientists get away with “trust me.”

Blindly trusting the “quants” on Wall Street was revealed five years ago to be a really bad idea and the result was that the constant flow of quants to Wall Street changed its course. Chris Wiggins, professor of applied mathematics at Columbia, describes in the IEEE Spectrum podcast “Is Data Science Your Next Career?” how “Data Science” replaced “Wall Street” as the two words for graduates with quantitative analysis skills:  “…that economy from 2001 to 2008, that sort of rapidly and consistently growing sort of Madoff-like economy… was a time when many of the students that I taught went down to Wall Street after they graduated from Columbia, and that was sort of the dominant narrative about what you do with quantitative training in New York City at that time. Things have changed dramatically in the last five years, and they’re changing faster and faster all the time… People started thinking more seriously about their opportunities post-Columbia, and more and more of those people every year started going into start-ups.” There were fewer job opportunities on Wall Street and it was no longer cool to be a quant.

But five years is a long, long time in our fast-paced world and an eternity on Wall Street. So the quants are back, only now they are called data scientists. In Advanced Trading, “the community for innovative trading,” Ivy Schmerken writes in “Welcome to the New Quants: Data Scientists”:  “Buy side firms are constantly looking for new ways to analyze data that can yield alpha, so don’t be surprised if the role of data scientist emerges at hedge funds and traditional asset managers.” There are two important differences between the old and new breeds of data analysts, according to the article.  Quants develop models and use them to predict the future, but data scientists mine data in the hunt for “interesting patterns.” And quants hand over their research to a software developer who writes code so it can be used in trading; data scientists perform both functions, eliminating a dangerous area “where bugs are introduced between the original idea and the implementation.” Are these differences going to make a difference?

Not if the education of the new quants is only about number crunching and the “science” of data. On this subject, it’s always a pleasure to quote Douglas Merrill at length:

Given enough data, everything is statistically significant. Few people are taught to examine not only the significance but also the size of the coefficient, which tells you how big an impact that variable has overall. Simple statistics fails us in the big data context. If one blindly follows statistics 101, the wrong answers follow. “Scientist” isn’t enough, you need some amount of intuition to separate the significant from the important. This is where having the right skills comes in: You need to understand the math (or algorithms, or psychology), that’s the barrier to entry. But you also need to have the broader knowledge to enable you to separate significant from important… Don’t look for people who speak data science. Look for people who love data, and have their own—unique—way of looking at it.

While Merrill insists on adding “art” to “science,” Mike Driscoll emphasizes the importance of engineering. In “Let’s praise data engineers,” Driscoll writes:

…if data is the new oil, and data scientists are its petrochemical high priests, who are the oil riggers?  Who are the roughnecks doing the dirty work to get data pipelines flowing, unpacking bytes, transforming formats, loading databases? A stark but recurring reality in the business world is this: when it comes to working with data, statistics and mathematics are rarely the rate-limiting elements in moving the needle of value.  Most firms’ unwashed masses of data sit far lower on Maslow’s hierarchy, at the level of basic nurture and shelter.  What is needed for this data isn’t philosophy, religion, or science— what’s needed is basic, scalable infrastructure. It’s the data engineers who can build this infrastructure, and they represent the true talent shortage of Silicon Valley and beyond.  Their unsexy but critical skills include crafting Hadoop pipelines, programming of job schedulers, and parsing broad classes of data— timestamps, currencies, lat & long coordinates—which are the screws, bolts, and ball bearings in the industrial age of data.

For his part, Larry Wasserman worries about statistics being left out. In “Data Science: The End of Statistics?” he writes:

The very fact that people can talk about data science without even realizing there is a field already devoted to the analysis of data — a field called statistics — is alarming. I like what Karl Broman says:

When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.

If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

Well put.

Maybe I am just pessimistic and am just imagining that statistics is getting left out. Perhaps, but I don’t think so. It’s my impression that the attention and resources are going mainly to Computer Science. Not that I have anything against CS of course, but it is a tragedy if Statistics gets left out of this data revolution.

Finally, in Foreign Policy Kate Crawford advocates adding to data science the expertise of social scientists and combining “big data approaches with small data studies.” Crawford neatly summarizes why the debate about what exactly is this new discipline and profession of data science—and why we all should understand what data scientists do—is both significant and important:

Given the immense amount of information collected about us every day—including Facebook clicks, GPS data, health-care prescriptions, and Netflix queues—we must decide sooner rather than later whom we can trust with that information, and for what purpose. We can’t escape the fact that data is never neutral and that it’s difficult to anonymize. But we can draw on expertise across different fields in order to better recognize biases, gaps, and assumptions, and to rise to the new challenges to privacy and fairness.

[Originally published on Forbes.com]