The Data Scientist Will Be Replaced By Tools

We just started to use the term “data scientist” and the demise of this new profession is already predicted? Well, at least it’s not one more “rise of the machines” prophecy; it’s the provocative title of a proposed panel for the upcoming SXSW.

The organizer of the panel, Scott Hendrickson of Gnip, has provided a useful run-down of some of the arguments for and against the possible disappearance of data scientists. Supporting the proposition are the current scarcity of data science talent and a slew of startups providing “data science as a service.” As an example of the opposition to the “democratization of algorithms,” Hendrickson quotes Cathy (Mathbabe) O’Neil who wrote recently that “if your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with. Even if it worked in a given situation, when you train on slightly different data you might run into something that throws it for a loop, and you’d better be able to figure out what that is.” In other words, machines will never have the deep understanding of the tools of data science that is required to practice data science.  

One of the proposed panelists for the proposed panel, John Myles White of Princeton University (the other panelists are James Dixon of Pentaho and Yael Garten of LinkedIn), has already given us a sneak preview of his take on the matter: “The most plausible answer to the question seems to be: ‘data scientists will have portions of their job automated, but their work will be much less automated than one might hope. Although we might hope to replace knowledge workers with algorithms, this will not happen as soon as some would like to claim’… While we can—and will—develop better tools for data analysis in the coming years, we will not do nearly as much as we hope to obviate the need for sound judgment, domain expertise and hard work.”

So here we go again, human judgment vs. automation. It’s just that this debate is in the context of something called “data science.” It may be useful for the panel to ponder the term itself: What is it that they do–science or engineering?

If it’s science, they need to take into consideration the arguments of fellow data scientists such as Google’s Peter Norvig who argues that statistical machine learning is a new scientific paradigm with which ”we don’t need an expert to tell us the theory; maybe we can gather enough data and run statistics over that and that will tell us the answer without having to have an expert involved.” (See also Schmidt and Lipson on automating science and the head of the National Science Foundation’s Computer and Information Science and Engineering Directorate on data-driven science). If science can be automated (and I don’t think it will be automated), that is also the future of data science.

If it’s engineering, they need to take into consideration what engineering skills and tasks have—and have not—been automated in the past. There are about 4 to 5 million engineers and computer scientists employed today in the US—are they all going to be automated out of existence?

I think data science is a very exciting new profession, rising out of (mostly) computer science and statistics. But, like computer science and statistics, it’s not science, and data scientists are not going to be replaced by tools, just like their fellow engineers.