A Very Short History of Data Science

I’m in the process of researching the origin and evolution of data science as a discipline and a profession. Here are the milestones that I have picked up so far, tracking the evolution of the term “data science,” attempts to define it, and some related developments.  I would greatly appreciate any pointers to additional key milestones (events, publications, etc.).

[An updated version of this timeline is at Forbes.com]

1974 Peter Naur publishes Concise Survey of Computer Methods in Sweden and the United States. The book is a survey of contemporary data processing methods that are used in a wide range of applications. It is organized around the concept of data as defined in the IFIP Guide to Concepts and Terms in Data Processing, which defines data as “a representation of facts or ideas in a formalized manner capable of being communicated or manipulated by some process.“ The Preface to the book tells the reader that a course plan was presented at the IFIP Congress in 1968, titled “Datalogy, the science of data and of data processes and its place in education,“ and that in the text of the book, ”the term ‘data science’ has been used freely.” Naur offers the following definition of data science: “The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.”

1977 The International Association for Statistical Computing (IASC) was founded as a Section of the ISI. “It is the mission of the IASC to link traditional statistical methodology, modern computer technology, and the knowledge of domain experts in order to convert data into information and knowledge.”

1989 Gregory Piatetsky-Shapiro organizes and chairs the first Knowledge Discovery in Databases (KDD) workshop. In 1995, it became the annual ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).

1996 Members of the International Federation of Classification Societies (IFCS) meet in Tokyo for their biennial conference. For the first time, the term “data science” is included in the title of the conference (“Data science, classification, and related methods”). The IFCS was founded in 1985 by six country- and language-specific classification societies, one of which, The Classification Society, was founded in 1964. The aim of these classification societies has been to support the study of “the principle and practice of classification in a wide range of disciplines”(CS), “research in problems of classification, data analysis, and systems for ordering knowledge”(IFCS), and the “study of classification and clustering (including systematic methods of creating classifications from data) and related statistical and data analytic methods“ (CSNA bylaws). The classification societies have variously used the terms data analysis, data mining, and data science in their publications.

1997 Launch of the journal Knowledge Discovery and Data Mining: “Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing. KDD is concerned with issues of scalability, the multi-step knowledge discovery process for extracting useful patterns and models from raw data stores (including data cleaning and noise modelling), and issues of making discovered patterns understandable.”

2001 William S. Cleveland  (then at Bell Labs, now at the Department of Statistics at Purdue University) publishes “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.” It is a plan “to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called ‘data science.’” The plan “sets out six technical areas for a university department”: Multidisciplinary Investigations, Models and Methods for Data, Computing with Data, Pedagogy, Tool Evaluation, and Theory. Cleveland puts the proposed new discipline in the context of computer science and the contemporary work on data mining: “…the benefit to the data analyst has been limited, because the knowledge among computer scientists about how to think of and approach the analysis of data is limited, just as the knowledge of computing environments by statisticians is limited. A merger of knowledge bases would produce a powerful force for innovation. This suggests that statisticians should look to computing for knowledge today just as data science looked to mathematics in the past. … departments of data science should contain faculty members who devote their careers to advances in computing with data and who form partnership with computer scientists.”

April 2002 The Data Science Journal is launched, publishing papers on “the management of data and databases in Science and Technology. The scope of the Journal includes descriptions of data systems, their publication on the internet, applications and legal issues.” The journal is published by the Committee on Data for Science and Technology (CODATA) of the International Council for Science (ICSU).

January 2003 The Journal of Data Science is launched: “By ‘Data Science’ we mean almost everything that has something to do with data: Collecting, analyzing, modeling…… yet the most important part is its applications — all sorts of applications. This journal is devoted to applications of statistical methods at large…. The Journal of Data Science will provide a platform for all data workers to present their views and exchange ideas.”

September 2005 The National Science Board publishes “Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century.” One of the recommendations of the report reads: “The NSF, working in partnership with collection managers and the community at large, should act to develop and mature the career path for data scientists and to ensure that the research enterprise includes a sufficient number of high-quality data scientists.” The report defines data scientists as “the information and computer scientists, database and software engineers and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection.”

July 2008 The JISC publishes the final report of a study it commissioned to “examine and make recommendations on the role and career development of data scientists and the associated supply of specialist data curation skills to the research community. “ The study’s final report, “The Skills, Role & Career Structure of Data Scientists & Curators:  Assessment of Current Practice & Future Needs,” defines data scientists as “people who work where the research is carried out – or, in the case of data centre personnel, in close collaboration with the creators of the data – and may be involved in creative enquiry and analysis, enabling others to work with digital data, and developments in data base technology.”

January 2009 Harnessing the Power of Digital Data for Science and Society is published. This report of the Interagency Working Group on Digital Data to the Committee on Science of the National Science and Technology Council states that “The nation needs to identify and promote the emergence of new disciplines and specialists expert in addressing the complex and dynamic challenges of digital preservation, sustained access, reuse and repurposing of data. Many disciplines are seeing the emergence of a new type of data science and management expert, accomplished in the computer, information, and data sciences arenas and in another domain science. These individuals are key to the current and future success of the scientific enterprise. However, these individuals often receive little recognition for their contributions and have limited career paths. Critical challenges in achieving our strategic vision include providing an effective pipeline of data professionals to ensure that the needs and opportunities of the future can be met and providing these professionals with appropriate rewards and recognition.” The report discusses the emergence of “new information disciplines” and lists a few examples:

  • Digital Curators: experts knowledgeable of and with responsibility for the content of digital collection(s);
  • Digital Archivists: experts competent to appraise, acquire, authenticate, preserve, and provide access to records in digital form; and
  • Data Scientists: information and computer scientists, database and software engineers and programmers, disciplinary experts, expert annotators, and others who are crucial to the successful management of a digital data collection.

May 2009 Mike Driscoll writes in “The Three Sexy Skills of Data Geeks”: “…with the Age of Data upon us, those who can model, munge, and visually communicate data — call us statisticians or data geeks — are a hot commodity.” [Driscoll will follow up with The Seven Secrets of Successful Data Scientists in August 2010]

June 2009 Nathan Yau writes in “Rise of the Data Scientist”:  “As we’ve all read by now, Google’s chief economist Hal Varian commented in January that the next sexy job in the next 10 years would be statisticians. Obviously, I whole-heartedly agree. Heck, I’d go a step further and say they’re sexy now – mentally and physically. However, if you went on to read the rest of Varian’s interview, you’d know that by statisticians, he actually meant it as a general title for someone who is able to extract information from large datasets and then present something of use to non-data experts… [Ben] Fry… argues for an entirely new field that combines the skills and talents from often disjoint areas of expertise… [computer science; mathematics, statistics, and data mining; graphic design; infovis and human-computer interaction]. And after two years of highlighting visualization on FlowingData, it seems collaborations between the fields are growing more common, but more importantly, computational information design edges closer to reality. We’re seeing data scientists – people who can do it all – emerge from the rest of the pack.”

June 2009 Troy Sadkowsky creates the data scientists group on LinkedIn as a companion to his website, datasceintists.com (which later became datascientists.net).

[update]February 2010 Kenneth Cukier writes in “Data, data everywhere: A special report on managing information“: “… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.”

June 2010 Mike Loukides writes in “What is Data Science?”:  “Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: ‘here’s a lot of data, what can you make from it?’”

September 2010  Hilary Mason and Chris Wiggins write in “A Taxonomy of Data Science”:  “…we thought it would be useful to propose one possible taxonomy… of what a data scientist does, in roughly chronological order: Obtain, Scrub, Explore, Model, and iNterpret…. Data science is clearly a blend of the hackers’ arts… statistics and machine learning… and the expertise in mathematics and the domain of the data for the analysis to be interpretable… It requires creative decisions and open-mindedness in a scientific context.”

September 2010 Drew Conway writes in “The Data Science Venn Diagram”:  “…one needs to learn a lot as they aspire to become a fully competent data scientist. Unfortunately, simply enumerating texts and tutorials does not untangle the knots. Therefore, in an effort to simplify the discussion, and add my own thoughts to what is already a crowded market of ideas, I present the Data Science Venn Diagram… hacking skills, math and stats knowledge, and substantive expertise.”

May 2011  Pete Warden writes in “Why the term ‘data science’ is flawed but useful”: “There is no widely accepted boundary for what’s inside and outside of data science’s scope. Is it just a faddish rebranding of statistics? I don’t think so, but I also don’t have a full definition. I believe that the recent abundance of data has sparked something new in the world, and when I look around I see people with shared characteristics who don’t fit into traditional categories. These people tend to work beyond the narrow specialties that dominate the corporate and institutional world, handling everything from finding the data, processing it at scale, visualizing it and writing it up as a story. They also seem to start by looking at what the data can tell them, and then picking interesting threads to follow, rather than the traditional scientist’s approach of choosing the problem first and then finding data to shed light on it.”

May 2011 David Smith writes in “’Data Science’: What’s in a name?”:   “The terms ‘Data Science’ and ‘Data Scientist’ have only been in common usage for a little over a year, but they’ve really taken off since then: many companies are now hiring for ‘data scientists’, and entire conferences are run under the name of ‘data science’. But despite the widespread adoption, some have resisted the change from the more traditional terms like ‘statistician’ or ‘quant’ or ‘data analyst’…. I think ‘Data Science’ better describes what we actually do: a combination of computer hacking, data analysis, and problem solving.”

September 2011 Harlan Harris writes in “Data Science, Moore’s Law, and Moneyball” : “’Data Science’ is defined as what ‘Data Scientists’ do. What Data Scientists do has been very well covered, and it runs the gamut from data collection and munging, through application of statistics and machine learning and related techniques, to interpretation, communication, and visualization of the results. Who Data Scientists are may be the more fundamental question…  I tend to like the idea that Data Science is defined by its practitioners, that it’s a career path rather than a category of activities. In my conversations with people, it seems that people who consider themselves Data Scientists typically have eclectic career paths, that might in some ways seem not to make much sense.”

September 2011 DJ Patil writes in “Building Data Science Teams”: “Starting in 2008, Jeff Hammerbacher (@hackingdata) and I sat down to share our experiences building the data and analytics groups at Facebook and LinkedIn. In many ways, that meeting was the start of data science as a distinct professional specialization….  we realized that as our organizations grew, we both had to figure out what to call the people on our teams. ‘Business analyst’ seemed too limiting. ‘Data analyst’ was a contender, but we felt that title might limit what people could do. After all, many of the people on our teams had deep engineering expertise. ‘Research scientist’ was a reasonable job title used by companies like Sun, HP, Xerox, Yahoo, and IBM. However, we felt that most research scientists worked on projects that were futuristic and abstract, and the work was done in labs that were isolated from the product development teams. It might take years for lab research to affect key products, if it ever did. Instead, the focus of our teams was to work on data applications that would have an immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use both data and science to create something new. “

Update: Gregory Piatetsky-Shapiro posted a great discussion of the journey from data mining to big data. Note that in this timeline, I tried to focus on specific mentions of “data science” and attempts to define it.

See also: A Very Short History of Big Data

About GilPress

I launched the Big Data conversation; writing, research, marketing services; http://whatsthebigdata.com/ & http://infostory.com/
This entry was posted in Big Data Analytics, Big data jobs, Data Science, Data Science History, Data Scientists. Bookmark the permalink.

52 Responses to A Very Short History of Data Science

  1. Cosimo Accoto says:

    great post.. I would suggest to add the to book list: “Taming the Big Data” (2012, April) http://www.amazon.com/Taming-Data-Tidal-Wave-Opportunities/dp/1118208781/ref=sr_1_9?ie=UTF8&qid=1335517797&sr=8-9 and the forthcoming “Big Data Big Analytics” (2012) http://www.amazon.com/Big-Data-Analytics-Intelligence-Businesses/dp/111814760X/ref=sr_1_1?s=books&ie=UTF8&qid=1335517928&sr=1-1 and “Big Data Analytics” (2012) http://www.amazon.com/Big-Data-Analytics-Turning-Business/dp/1118147596/ref=pd_bxgy_b_img_b and the Forrester, McKinsey research papers recently published ;-)

  2. S. Reddy says:

    In the early 70s, my father worked at the Meteorological department of India. His job was to collect the historical weather data, apply statistical modeling to the data, and derive patterns from that data. The goal was to predict long term weather patterns (droughts, etc.) so that the government can plan accordingly. A lot of the modeling was done on ancient computers, and he used to bring home punch-cards in 1972.

    He then worked in various other countries doing similar analysis on long term weather data. To me, this looks just like the definition of a “data scientist”. He has a background in statistics, published many papers, wrote a few books, etc.

    Looks like the “data science” field was an actual job in meteorology way before it became a cool job in the age of big data. And climate and weather data collection produced a lot of data!

    • GilPress says:

      Collection and analysis of weather data is indeed a great place to start looking at the emergence of big data. In the United States, I can point to a specific, pre-computer, date: November 1, 1870, when the U.S. Weather Bureau made its first meteorological observations using 24 locations that provided reports via telegraph.

  3. Richard Lee says:

    A very good piece of work here. There have no doubt been data scientists for many centuries now, they just did not know it. I remember working at NOAA’s National Climactic Data Center where they had weather records going back to Ben Franklin, Davie Crockett and others. Many explorers were data gatherers and scientists without realizing it. This notion of “big data” is merely a blip along that timeline.

  4. Read related article on “Six keywords characterizing milestones in the history of analytic engineering: from 1988 to 2033″ at http://bit.ly/z0Uvd0.

  5. Saqib Khan says:

    Nice research done. Most of us never know the whole history and the evolution of data science history. Good effort, indeed.

  6. Pingback: Press seeks contributions to the ‘Very Short History of Data Science’ | RSSeNews

  7. johnse11 says:

    The concept of micromarketing was promulgated in the late 1980’s and referred to applying computers to segment markets down to local market level, and even to individual consumers, and targeting offers accordingly.

    1. Whitehead, John. The Need to Rethink Analysis, Marketing Week, 14th November 1988.
    2. Whitehead, John. Paying Attention to Detail, Marketing, 22nd February 1990.

  8. Dear Gil, I think a mention of the father of the Relational Database and how IBM played a role, is fairly significant:
    1970 – RELATIONAL DATABASES. IBM scientist Ted Codd published a paper introducing the concept of relational databases. It calls for information stored within a computer to be arranged in easy-to-interpret tables so that nontechnical users can access and manage large amounts of data. Most database structures in use today are based on the IBM concept of relational databases.

    Respectfully yours,

    • Jeff Drobman says:

      Thank you. At least someone remembers all the early work done on “DBMS” data mgt systems by chiefly IBM, and later Oracle. This dates back to the 1960s. I believe Edgar Codd published his major paper on a “relational algebra” for processing databases in 1969.

  9. Andy Brice says:

    How far back do you want to go? You migth want to research the pioneering work of Francis Galton on surveys and statistics:

  10. Eric Genesky says:

    Very cool history of data science; would you mind if I republish it on DZone.com? We’ve got a readership of developers who would really appreciate some expertise on the subjects of big data / data science, and I think they’d get something out of this. Just shoot me an email, and thanks for the good read.

  11. Pingback: Looking back and forward | Insight Voices

  12. dwolcott says:

    Although a niche area it is amazing to read about the data needs of CERN, one of the pioneers in the need for large data storage. A great history of how even back in 1958 when they got their first computer they didn’t have enough storage and it is still a problem they have in 2011. A great read for those interested (Go through all the links and get some great info and pics):

  13. Mikael Huss says:

    Reblogged this on Follow the Data and commented:
    The word “datalogy” (mentioned in the beginning) is still used in Sweden; I used to teach courses in it!

  14. vcjha says:

    Reblogged this on Vcjha's Blog and commented:
    One of the best blog about history of Data crunching..:)

  15. Pingback: “Data science”, ou l’avénement des “sexy geeks” - Cartonomics: Space, Web and Society » Cartonomics: Space, Web and Society

  16. Pingback: Data Science exists since 1974 | Data Machina

  17. Eric says:

    Have you looked into the history of library science at all? Academic libraries were the first to put large amounts of data onto the Internet (before it was public), and they essentially pioneered IR and built some of the earliest of what we’d now call search engines. Libraries were building these systems back in the 1960s and were among the very first organizations to build specialized computer systems for managing data.

  18. Bill Cernansky says:

    How close are we to Asimov’s “psychohistory”, where the behavior of large numbers of people can be predicted through data such as these?

  19. Pingback: .NET i jiné ... : Odkazy z prohlížeče – 9.7.2012

  20. Bob C. says:

    Actually, you could have started from a much earlier time frame. A number of excellent papers were written on the topic (or related) topics back in the 1950s and 1960s. Having worked with disk systems and file organizations back in the early 60s, I read a number of papers and a few books on data organization. Unfortunately, I don’t remember titles and authors that far back. A couple of years ago, I donated some material I had kept to the Museum of Computer History in Mountain View (San Jose), CA. I presume you have already been in contact with them. It may surprise you to find some of the thinking back then was far ahead of its time (at least implementation-wise).

    • GilPress says:

      Bob, I have searched the IEEE and ACM digital libraries and it looks like the earliest mention of “data science” (and with a different meaning than today’s term) is from the 1960s. I’m sure that there were very prescient observations about data analysis in the 1950s and if you find the specific references, please share them.

  21. Luke chen says:

    A great review on the history. Thanks.
    If you don’t mind, I would like to share this article by some chinese sites.
    Catch me with E-Mail if uncomfortable. Best wishes!

  22. Pingback: The Big Data Meme: 5 Scenarios for IT | What's The Big Data?

  23. Pingback: AllAnalytics - Point / Counterpoint - Data Scientists Will Not Be Replaced by Automation

  24. Pingback: In Defense Of Statistics Or, Why Data Scientists Should Make Understanding Statistics a Priority « Introduction to Data Science, Columbia University

  25. Pingback: 数据科学发展简史 | 图林中文译站

  26. Curtis says:

    Enjoyed this post – have you ran into any histories of the use of data in policy making or legislating? or how the use of data (and big data) changed how (process-wise) policy decisions and/or legislation are made?

  27. Pingback: A Little History | SyntheticAnalytics

  28. Pingback: Analytics and emotion: Why storytelling may be the best friend of data science I By Qaalfa Dibeehi

  29. Pingback: A Very Short History of Data Science | analyticalsolution

  30. Pingback: Practical Data Science | spider's space

  31. Pingback: A Very Short History of Data Science | Um blog sobre nada

  32. Stanley Loh says:

    Congratulations for this great study.
    My contributions:
    1) Database research faces the information explosion
    Henry F. Korth and Abraham Silberschatz
    2) The Fourth Paradigm: Data-Intensive Scientific Discovery
    Jim Gray

  33. krexer says:

    I suggest:
    — 1997 Berry & Linoff — “Data Mining Techniques for Marketing, Sales, & Customer Support”
    — 2007 Davenport & Harris — “Competing on Analytics” (for explaining to business leaders why they should care about data science)
    — Leo Breimen’s early work on CART (1984), and his later work on other algorithms (http://en.wikipedia.org/wiki/Leo_Breiman)
    — Jerome Friedman’s early work on CART (1984), and his later work on other algorithms (http://en.wikipedia.org/wiki/Jerome_H._Friedman)
    — Maybe some mention of Vapnik’s work on 1995 Support Vector Machines (http://en.wikipedia.org/wiki/Support_vector_machine)
    — Certainly some other early IBM work (but I don’t have references). Might also mention some of the recent IBM Watson work.

  34. Pingback: A Very Short History Of Data Science by Gil Press | The Brussels Data Science Community

  35. Pingback: Milestones in the Evolution of “What is Data Science?” | BLOG@UBIQUITY

  36. Pingback: data science explained

  37. Pingback: Milestones in the Evolution of “What is Data Science?” | BLOG@UBIQUITY

  38. Pingback: How Data Became Big | BLOG@UBIQUITY

  39. Pingback: A Very Short History of Data Science | BLOG@UBIQUITY

  40. Ilya Geller says:

    Language has its Internal parsing and statistics.
    For instance, there are two sentences:
    a) ‘Fire!’
    b) ‘In this amazing city of Rome some people sometimes may cry in agony: ‘Fire!’’
    Evidently, that the phrase ‘Fire!’ has different importance into both sentences, in regard to extra information in both. This distinction is reflected as the phrase weights: the first has 1, the second – 0.12; the greater weight signifies stronger emotional ‘acuteness’.
    After that, you index each word from each phrase by dictionary, annotate it by subtexts.

    First you need to parse obtaining phrases from clauses, for sentences and paragraphs. Next, you calculate Internal statistics, weights; where the weight refers to the frequency that a context phrase occurs in relation to other context phrases.
    After that, you index each word from each phrase by dictionary, annotate it by subtexts.
    Your text is structured.

    There is no Big Data or its problem. Think.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s