Data Scientists: The Definition of Sexy

I put “sexy” in the title because I’m told that the words in the title make all the difference in getting noticed on the Web. That has certainly proven true for the Harvard Business Review after it included the word “sexiest” in the title of a recent article. It even got the attention, probably for the first time ever, of Geekosystem, a website devoted to geeks:

The Harvard Business Review, a noted authority on “things that are sexy,” has declared “Data Scientist“ to be the sexiest career of the 21st century. The article reflects the burgeoning mystique of the new and pocket protector friendly gig, which we have to assume narrowly edged out things like “Chippendales dancer” and “calendar firefighter” on its way to being named the sexiest of all possible careers. Because if there’s one thing that gives a job an indefinable allure, it is everybody else being kind of unsure what it is you really do — a quality that data scientists damn near embody.

Whether employers know or don’t know what data scientists do, they have been using—in rapidly-growing numbers—the term“data scientist” in job descriptions in the past two years as Indeed.com’s data demonstrates.   

So we have the data to show data scientists are hot. But what is sexy about them? To paraphrase a great letter-writer, What then are the data scientists, these new men and women of industry?  Are they scientists? Engineers? Programmers? A new breed of business decision-makers and innovators? I think I have an answer to these questions but to find out what’s my definition of a “data scientist,” stay with me to the end of this post.

What the Indeed.com chart doesn’t show is the long period of gestation in academia of this new profession where it has never been established as a new academic discipline. And that long period of academic incubation may have a lot to do with the practice of data science today.

Consider the following a brief history of how data (science) became sexy. The term “data science” (together with “Datalogy”) was first suggested by Peter Naur in the late 1960s as a substitute for “computer science,” was used by the International Federation of Classification Societies  (its members conduct “research in problems of classification, data analysis, and systems for ordering knowledge”) in the mid-1990s, and was proposed by William S. Cleveland in 2001 as a new academic discipline, extending the field of statistics to incorporate “advances in computing with data.” The Data Science Journal and The Journal of Data Science were launched in 2002 and 2003, respectively, and in 2005, The National Science Board published “Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century, ”defining data scientists as “the information and computer scientists, database and software engineers and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection.”

By the mid-2000s, data science has started to move away from the halls of academia and from being defined as a laundry list of other disciplines. An example of moving a half-step away from academia towards establishing a new profession was what Troy Sadkowsky did in 2009. He worked in Australia (like many others around the world) in an academic setting but as a “scientific programmer,” developing applications in support of large-scale, “big data” scientific research. Sadkowsky learned about the term “data scientist” from “Harnessing the Power of Digital Data for Science and Society,” a January 2009 report of the Interagency Working Group on Digital Data, thought it described best what he was doing, and wanted to establish an online community of similar professionals. In June 2009, he created the data scientists group on LinkedIn as a companion to his website, datasceintists.com (which later became datascientists.net).

But the wholesale move from academia to industry has already happened the year before in the U.S. with the development of big data technologies by Web-based companies and their emerging need for quantitative analysts to mine and make sense of all the data they were collecting. Before, the “quants” that did not want to pursue an academic career would go work on Wall Street. But in 2008 this was not such an alluring option anymore. D.J. Patil, today the data scientist-in-residence at Greylock Partners, wrote last year:“Starting in 2008, Jeff Hammerbacher and I sat down to share our experiences building the data and analytics groups at Facebook [Hammerbacher] and LinkedIn [Patil]. In many ways, that meeting was the start of data science as a distinct professional specialization… the focus of our teams was to work on data applications that would have an immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use both data and science to create something new.“

There is no reason to doubt Patil’s account and it is possibly supported by the small 2008 blip in the Indeed.com chart. But when was the term “data scientist” first used publicly in reference to a non-academic, non-research science-related position? I believe the first documented use was in June 2009 in a blog post titled “Rise of the Data Scientist” by Natahn Yau, a PhD candidate in statistics:  “As we’ve all read by now, Google‘s chief economist Hal Varian commented in January [2009] that the next sexy job in the next 10 years would be statisticians. Obviously, I whole-heartedly agree. Heck, I’d go a step further and say they’re sexy now – mentally and physically. However, if you went on to read the rest of Varian’s interview, you’d know that by statisticians, he actually meant it as a general title for someone who is able to extract information from large datasets and then present something of use to non-data experts… We’re seeing data scientists – people who can do it all – emerge from the rest of the pack.”

A month earlier, Mike Driscoll, a data scientist and entrepreneur (Dataspora and Metamarkets) wrote in “The Three Sexy Skills of Data Geeks”: “…with the Age of Data upon us, those who can model, munge, and visually communicate data — call us statisticians or data geeks — are a hot commodity.” Other data scientists and observers of the data scene followed with a discussion of the skills required of “data scientists,” e.g., Kenneth Cukier,  Mike LoukidesHilary MasonChris Wiggins, and Drew Conway.  But even as recently as the March 2012 Strata Conference, where the hot and sexy gather to talk data, the Guardian’s DataBlog could still report on how attendees gave different answers to the question “what is a data scientist?”  (My favorite is Monica Rogati’s: “it’s Columbus meet Columbo – starry eyed explorers and skeptical detectives”).

Now to the most recent in-depth discussion of what data scientists actually do and, most important, how they approach their work. In “Data Scientist: The Sexiest Job of the 21st Century,” Tom Davenport and the aforementioned D.J. Patil tell us about Jonathan Goldman,  the data scientist who came up with the “people you may know” feature on LinkedIn: “He began forming theories, testing hunches, and finding patterns that allowed him to predict whose networks a given profile would land in.” Based on this and other observations of data scientists, Davenport and Patil generalize about the way they work:

“What data scientists do is make discoveries while swimming in data… [their] dominant trait is intense curiosity—a desire to go beneath the surface of a problem, find the questions at its heart, and distill them into a very clear set of hypotheses that can be tested. This often entails the associative thinking that characterizes the most creative scientists in any field…. perhaps it’s becoming clear that the word ‘scientist’ fits this emerging role… their greatest opportunity to add value is not in creating reports or presentations for senior executives but in innovating with customer-facing products and processes.”

Still, Davenport and Patil don’t provide a concise definition of a data scientist, so let me offer one based on their observations: A data scientist is an engineer who employs the scientific method and applies data-discovery tools to find new insights in data. The scientific method—the formulation of a hypothesis, the testing, the careful design of experiments, the verification by others—is something they take from their knowledge of statistics and their training in scientific disciplines. The application (and tweaking) of tools comes from their engineering, or more specifically, computer science and programming background. The best data scientists are product and process innovators and sometimes, developers of new data-discovery tools.

That’s the definition of sexy.