I’m in the process of researching the origin and evolution of data science as a discipline and a profession. Here are the milestones that I have picked up so far, tracking the evolution of the term “data science,” attempts to define it, and some related developments. I would greatly appreciate any pointers to additional key milestones (events, publications, etc.).
1974 Peter Naur publishes Concise Survey of Computer Methods in Sweden and the United States. The book is a survey of contemporary data processing methods that are used in a wide range of applications. It is organized around the concept of data as defined in the IFIP Guide to Concepts and Terms in Data Processing, which defines data as “a representation of facts or ideas in a formalized manner capable of being communicated or manipulated by some process.“ The Preface to the book tells the reader that a course plan was presented at the IFIP Congress in 1968, titled “Datalogy, the science of data and of data processes and its place in education,“ and that in the text of the book, ”the term ‘data science’ has been used freely.” Naur offers the following definition of data science: “The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.”
1977 The International Association for Statistical Computing (IASC) was founded as a Section of the ISI. “It is the mission of the IASC to link traditional statistical methodology, modern computer technology, and the knowledge of domain experts in order to convert data into information and knowledge.”
1989 Gregory Piatetsky-Shapiro organizes and chairs the first Knowledge Discovery in Databases (KDD) workshop. In 1995, it became the annual ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
1996 Members of the International Federation of Classification Societies (IFCS) meet in Tokyo for their biennial conference. For the first time, the term “data science” is included in the title of the conference (“Data science, classification, and related methods”). The IFCS was founded in 1985 by six country- and language-specific classification societies, one of which, The Classification Society, was founded in 1964. The aim of these classification societies has been to support the study of “the principle and practice of classification in a wide range of disciplines”(CS), “research in problems of classification, data analysis, and systems for ordering knowledge”(IFCS), and the “study of classification and clustering (including systematic methods of creating classifications from data) and related statistical and data analytic methods“ (CSNA bylaws). The classification societies have variously used the terms data analysis, data mining, and data science in their publications.
1997 Launch of the journal Knowledge Discovery and Data Mining: “Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing. KDD is concerned with issues of scalability, the multi-step knowledge discovery process for extracting useful patterns and models from raw data stores (including data cleaning and noise modelling), and issues of making discovered patterns understandable.”
2001 William S. Cleveland (then at Bell Labs, now at the Department of Statistics at Purdue University) publishes “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.” It is a plan “to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called ‘data science.’” The plan “sets out six technical areas for a university department”: Multidisciplinary Investigations, Models and Methods for Data, Computing with Data, Pedagogy, Tool Evaluation, and Theory. Cleveland puts the proposed new discipline in the context of computer science and the contemporary work on data mining: “…the benefit to the data analyst has been limited, because the knowledge among computer scientists about how to think of and approach the analysis of data is limited, just as the knowledge of computing environments by statisticians is limited. A merger of knowledge bases would produce a powerful force for innovation. This suggests that statisticians should look to computing for knowledge today just as data science looked to mathematics in the past. … departments of data science should contain faculty members who devote their careers to advances in computing with data and who form partnership with computer scientists.”
April 2002 The Data Science Journal is launched, publishing papers on “the management of data and databases in Science and Technology. The scope of the Journal includes descriptions of data systems, their publication on the internet, applications and legal issues.” The journal is published by the Committee on Data for Science and Technology (CODATA) of the International Council for Science (ICSU).
January 2003 The Journal of Data Science is launched: “By ‘Data Science’ we mean almost everything that has something to do with data: Collecting, analyzing, modeling…… yet the most important part is its applications — all sorts of applications. This journal is devoted to applications of statistical methods at large…. The Journal of Data Science will provide a platform for all data workers to present their views and exchange ideas.”
September 2005 The National Science Board publishes “Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century.” One of the recommendations of the report reads: “The NSF, working in partnership with collection managers and the community at large, should act to develop and mature the career path for data scientists and to ensure that the research enterprise includes a sufficient number of high-quality data scientists.” The report defines data scientists as “the information and computer scientists, database and software engineers and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection.”
July 2008 The JISC publishes the final report of a study it commissioned to “examine and make recommendations on the role and career development of data scientists and the associated supply of specialist data curation skills to the research community. “ The study’s final report, “The Skills, Role & Career Structure of Data Scientists & Curators: Assessment of Current Practice & Future Needs,” defines data scientists as “people who work where the research is carried out – or, in the case of data centre personnel, in close collaboration with the creators of the data – and may be involved in creative enquiry and analysis, enabling others to work with digital data, and developments in data base technology.”
January 2009 Harnessing the Power of Digital Data for Science and Society is published. This report of the Interagency Working Group on Digital Data to the Committee on Science of the National Science and Technology Council states that “The nation needs to identify and promote the emergence of new disciplines and specialists expert in addressing the complex and dynamic challenges of digital preservation, sustained access, reuse and repurposing of data. Many disciplines are seeing the emergence of a new type of data science and management expert, accomplished in the computer, information, and data sciences arenas and in another domain science. These individuals are key to the current and future success of the scientific enterprise. However, these individuals often receive little recognition for their contributions and have limited career paths. Critical challenges in achieving our strategic vision include providing an effective pipeline of data professionals to ensure that the needs and opportunities of the future can be met and providing these professionals with appropriate rewards and recognition.” The report discusses the emergence of “new information disciplines” and lists a few examples:
- Digital Curators: experts knowledgeable of and with responsibility for the content of digital collection(s);
- Digital Archivists: experts competent to appraise, acquire, authenticate, preserve, and provide access to records in digital form; and
- Data Scientists: information and computer scientists, database and software engineers and programmers, disciplinary experts, expert annotators, and others who are crucial to the successful management of a digital data collection.
May 2009 Mike Driscoll writes in “The Three Sexy Skills of Data Geeks”: “…with the Age of Data upon us, those who can model, munge, and visually communicate data — call us statisticians or data geeks — are a hot commodity.” [Driscoll will follow up with The Seven Secrets of Successful Data Scientists in August 2010]
June 2009 Nathan Yau writes in “Rise of the Data Scientist”: “As we’ve all read by now, Google’s chief economist Hal Varian commented in January that the next sexy job in the next 10 years would be statisticians. Obviously, I whole-heartedly agree. Heck, I’d go a step further and say they’re sexy now – mentally and physically. However, if you went on to read the rest of Varian’s interview, you’d know that by statisticians, he actually meant it as a general title for someone who is able to extract information from large datasets and then present something of use to non-data experts… [Ben] Fry… argues for an entirely new field that combines the skills and talents from often disjoint areas of expertise… [computer science; mathematics, statistics, and data mining; graphic design; infovis and human-computer interaction]. And after two years of highlighting visualization on FlowingData, it seems collaborations between the fields are growing more common, but more importantly, computational information design edges closer to reality. We’re seeing data scientists – people who can do it all – emerge from the rest of the pack.”
[update]February 2010 Kenneth Cukier writes in ”Data, data everywhere: A special report on managing information“: ”… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.”
June 2010 Mike Loukides writes in “What is Data Science?”: “Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: ‘here’s a lot of data, what can you make from it?’”
September 2010 Hilary Mason and Chris Wiggins write in “A Taxonomy of Data Science”: “…we thought it would be useful to propose one possible taxonomy… of what a data scientist does, in roughly chronological order: Obtain, Scrub, Explore, Model, and iNterpret…. Data science is clearly a blend of the hackers’ arts… statistics and machine learning… and the expertise in mathematics and the domain of the data for the analysis to be interpretable… It requires creative decisions and open-mindedness in a scientific context.”
September 2010 Drew Conway writes in “The Data Science Venn Diagram”: “…one needs to learn a lot as they aspire to become a fully competent data scientist. Unfortunately, simply enumerating texts and tutorials does not untangle the knots. Therefore, in an effort to simplify the discussion, and add my own thoughts to what is already a crowded market of ideas, I present the Data Science Venn Diagram… hacking skills, math and stats knowledge, and substantive expertise.”
May 2011 Pete Warden writes in “Why the term ‘data science’ is flawed but useful”: “There is no widely accepted boundary for what’s inside and outside of data science’s scope. Is it just a faddish rebranding of statistics? I don’t think so, but I also don’t have a full definition. I believe that the recent abundance of data has sparked something new in the world, and when I look around I see people with shared characteristics who don’t fit into traditional categories. These people tend to work beyond the narrow specialties that dominate the corporate and institutional world, handling everything from finding the data, processing it at scale, visualizing it and writing it up as a story. They also seem to start by looking at what the data can tell them, and then picking interesting threads to follow, rather than the traditional scientist’s approach of choosing the problem first and then finding data to shed light on it.”
May 2011 David Smith writes in “’Data Science’: What’s in a name?”: “The terms ‘Data Science’ and ‘Data Scientist’ have only been in common usage for a little over a year, but they’ve really taken off since then: many companies are now hiring for ‘data scientists’, and entire conferences are run under the name of ‘data science’. But despite the widespread adoption, some have resisted the change from the more traditional terms like ‘statistician’ or ‘quant’ or ‘data analyst’…. I think ‘Data Science’ better describes what we actually do: a combination of computer hacking, data analysis, and problem solving.”
September 2011 Harlan Harris writes in “Data Science, Moore’s Law, and Moneyball” : “’Data Science’ is defined as what ‘Data Scientists’ do. What Data Scientists do has been very well covered, and it runs the gamut from data collection and munging, through application of statistics and machine learning and related techniques, to interpretation, communication, and visualization of the results. Who Data Scientists are may be the more fundamental question… I tend to like the idea that Data Science is defined by its practitioners, that it’s a career path rather than a category of activities. In my conversations with people, it seems that people who consider themselves Data Scientists typically have eclectic career paths, that might in some ways seem not to make much sense.”
September 2011 DJ Patil writes in “Building Data Science Teams”: “Starting in 2008, Jeff Hammerbacher (@hackingdata) and I sat down to share our experiences building the data and analytics groups at Facebook and LinkedIn. In many ways, that meeting was the start of data science as a distinct professional specialization…. we realized that as our organizations grew, we both had to figure out what to call the people on our teams. ‘Business analyst’ seemed too limiting. ‘Data analyst’ was a contender, but we felt that title might limit what people could do. After all, many of the people on our teams had deep engineering expertise. ‘Research scientist’ was a reasonable job title used by companies like Sun, HP, Xerox, Yahoo, and IBM. However, we felt that most research scientists worked on projects that were futuristic and abstract, and the work was done in labs that were isolated from the product development teams. It might take years for lab research to affect key products, if it ever did. Instead, the focus of our teams was to work on data applications that would have an immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use both data and science to create something new. “
Update: Gregory Piatetsky-Shapiro posted a great discussion of the journey from data mining to big data. Note that in this timeline, I tried to focus on specific mentions of “data science” and attempts to define it.
See also: A Very Short History of Big Data