Big Data Debates: Individuals Vs. Teams

teamsGregory Piatetsky recently ran a poll on his popular KDnuggets website where he asked his readers to vote for the preferred way to build data science capabilities in their organizations. The poll was prompted by the strong reaction to a post by Michael Mout in which he advised employers not to advertise for “data scientists” but rather to hire computer scientists, statisticians, and database administrators and combine them into a data science team.

Piatetsky’s respondents were equally split between those favoring “Seek and train versatile Data Scientists that have all (or most) of the needed skills” and those opting for “Build a data science team where each member mainly focuses on one skill.” Some of the respondents favoring “individuals” were thinking about the limited resources of smaller companies. A more substantive argument for individuals was the fear of specialists being not “well-rounded” professionals.

Being well-rounded is a great attribute in any profession but it could be a stretch when it comes to the multiplicity of skills needed in the data science process, especially if you consider additional skills to the ones enumerated by Mout.  In the HBR Blog, Brad Brown and Brian Henstorf recommend putting together a  data science team with “Technical and Data Specialists,” “Analysts and Data Scientists,” and “Business Analytics and Solutions Specialists.” Similarly, Dawen Peng lists the following skills on the Think Operations Research blog: business consulting, analysis and modeling, communications and visualization, data engineering, and programming.

The debate about data scientists as individuals with multiple skills and practical experiences vs. multi-disciplinary teams is very much related to the perceived dearth of data scientists. I say ”perceived,” because it is impossible (even for data scientists) to estimate the supply and demand of a new, ill-defined profession.

The 2011 McKinsey Global Institute report on big data, frequently cited as the authoritative source for the not-enough-data-scientists claim, actually talks about the United States facing “a shortage of 140,000 to 190,000 people with deep analytical skills.” Even if you assume that “deep analytical skills” (defined by McKinsey as “people with advanced training in statistics and/or machine learning”) is an adequate definition of a data scientist (a term McKinsey did not use), you should read the full report and the relevant appendix to understand McKinsey’s definitions, assumptions  (e.g. “companies across the economy fully adopt big data techniques by 2018”), and estimates employed to get to such a seemingly precise assessment of the supply and demand imbalance.

McKinsey should be lauded for being the first to publish a comprehensive (and publicly available) assessment of the big data term, for taking a stab at estimating important future implications of this term, and most important, for telling us a lot (not all) about how it arrived at the numbers. But no good deed goes underutilized and the shortage of data scientists, now and in the future, is now an undisputed fact.

Let me be clear that I’m simply pointing out that hype begets “facts” based on misquoted research reports (it must be admitted that McKinsey’s executive summary could have been a bit more detailed and less sound bite-prone but I guess “executive” stands for “buyer beware”). I have no doubt that demand for people with “deep analytical skills” has been around for quite sometime in our data-saturated world, as has been demonstrated by the job placement success of master in business analytics programs such as North Carolina State’s. This particular program—where all 84 graduates in 2012 has job offers—was established in 2007, long before we were told about the dearth of data scientists in the age of big data. Whatever educational programs “people with deep analytical skills” graduated from in the past—business analytics, operations research, statistics, or any other discipline (or work experience) that trained them in data mining skills—they did not go unemployed for long, especially if they had practical experience in the use of data analysis tools (SAS and SPSS were developed in the 1960s).

The use of “data scientists,” the new designation for “people with deep analytical skills”—and associated “facts” about supply and demand imbalances—is driven by new educational programs, those replacing “business analytics” with “data science” and by vendors selling new tools to manage and mine the ever-increasing pile of data at our disposal.  The vendors, especially those that focus on streamlining or automating the data analysis process, have an understandable interest in diverting the debate about data scientists from individuals vs. teams to the conclusion that you can’t have either because data scientists are nowhere to be found and you’d better replace them with tools that automate some or all of their work.

It’s very refreshing then to read what Joe Hellerstein, co-founder of Trifacta, a company that is trying to automate the cleaning of data, the most arduous and boring part of the data analysis process, has to say about data science: “anyone promising to automate away the need for people in data analysis is engaging in pointless hubris. Data analysis is a process that fundamentally revolves around people, not just technology: people who can understand the links between business problems and relevant data, forming hypotheses and interpreting the resulting numbers. At bottom, data science—like all science—is a creative human activity. Take away the science, and all you have is data.”

Hellerstein believes that the solution to the dearth of what he calls “triply-skilled data scientists” is “for technology to work synergistically with human analysts to make data science a more productive and broadly accessible job.” He agrees that combining people with different skills into a team is a “sensible approach,” but dismisses it as “too expensive” and “infeasible for many organizations in a market in which skilled data people of all stripes are rare.” While defending the need for humans and their ingenuity, he argues for using tools to automate and simplify at least some of the work of the data scientist.

Michael Caveretta, data scientist at Ford, begs to differ: “While there have been a lot of vendors who say, ‘You don’t need a data scientist, just use our software,’… it’s gonna be a while before we get to that stage where it’s really taken over by the software itself.”  Instead, he relies on the team approach: “You don’t have to look for these unicorns, these people that are incredibly difficult to find and you have to pay them incredible amounts of money. The idea that you can build a team that has all these components in there is something that has been really exciting to me, because we’ve been able to go out and kind of be strategic about some of the people that we bring in, but then also look internal to our organization and supplement that team with some internal resources that have really worked out well.”

Defining the role of the data scientist at Linkedin about five years ago, DJ Patil also opted for the team approach but on a larger scale, including in the team people with skills other than programming and statistics: ”It’s important that our data team wasn’t comprised solely of mathematicians and other ‘data people.’ It’s a fully integrated product group that includes people working in design, web development, engineering, product marketing, and operations. They all understand and work with data, and I consider them all data scientists. We intentionally kept the distinction between different roles in the group blurry.”

In my opinion, this is exactly the strength of teams and the reason to go with the team approach even if we had ample supply of data scientists, or “people with deep analytical skills,” or whatever we call them.  A team comprised of members with different skills and experiences is likely to make better decisions than either individuals on their own or teams populated with people with the same type of skills and experiences.

This is not only because different perspectives based on different experiences and expertise contribute to a better understanding of any situation and the potential consequences of actions. The different points of view also help, in well-managed teams where debate is encouraged, to minimize or cancel out the personal biases of team members.

Which brings me back to McAfee’s Law which I discussed in the previous post: “As the amount of data goes up, the importance of human judgment should go down.” Maybe it should be revised to say: As the amount of data goes up, the human judgment of well-managed, highly interdisciplinary teams, is more important than ever.

[Originally published on]