Yun Xiong is an Associate Professor of Computer Science and the Associate Director of the Center for Data Science and Dataology at Fudan University, Shanghai, China. She received her Ph.D. in Computer and Software Theory from Fudan University in 2008. Her research interests include dataology and data science, data mining, big data analysis, developing effective and efficient data analysis techniques for various applications including finance, economics, insurance, bioinformatics, and sociology. The following is an edited version of our recent email exchange.
How has data science developed in China?
Our research center has been a main force in moving forward the research on data science in China. We have held three international workshops on data science and dataology from 2010 to 2012 and published our first monograph, Dataology and Data Science, in 2009. The center has seven units: The Data Resource Service Office, Dataology and Data Science Research Lab, New Economy Development Strategy Research Lab, Bio-Medical Data Research Lab, Brain Informatics Research Group, Intelligent Transportation Data Research Group, and the Financial Data Research Group. The Center for Data Science and Dataology was established in 2007, continuing the work of the Data Mining Group which was established in 1999.
Could you give me examples of research projects at the Dataology and Data Science Research Center?
We do research in a variety of fields. For example, we have studied data mining techniques for gene sequencing, intelligent transportation systems, computer viruses, and the stock market. Our main on-going research project is to study the foundational theory of data science.
What is your definition of data science?
Professor Yangyong Zhu and I define data science as a science of data in cyberspace. In our viewpoint, data science has two key dimensions. One is to provide a novel research method which we call Scientific Research Method with Data, for natural sciences and social sciences; the other is to research the phenomena and laws of datanature.
What do you mean by datanature?
Datanature means all data in cyberspace, including the data reflecting nature and human behaviors, and the data without direct references in reality, such as computer viruses, some network games and junk data.
In your paper with Professor Yangyong Zhu, you talk about “the second data explosion.” What do you mean by that?
The inventions of papermaking and printing brought about the first data explosion. The inventions of computers (especially the Internet and the World Wide Web) and storage devices brought about the second data explosion.
How is data science different from previous methods of data mining and data analysis?
Both data mining and data analysis are data-related technologies which cannot be regarded as ‘science’. In data science, there is a strong emphasis on theories, in addition to technologies. The goal of data science is to study the phenomena and laws of datanature. For example, what is the data explosion? Today, we hear a lot about big data, but what will happen in datanature in the future? What is the ontology in datanature?
The main research topics in dataology and data Science include 1. The foundational theory of dataology and data science; 2. The methods of data experiment and logical reasoning; 3. The theories and methods in domain dataology, e.g., behavior dataology, biological dataology, brain dataology, financial dataology; and 4. The methods and technologies for utilizing and exploiting data as a resource (similar to oil and minerals).
Which specific application of data science are you personally interested in?
Scientific Research Method with Data. With the increasing amount of data in different scientific fields, many scientific problems cannot be solved using traditional methods and we must explore new approaches to dealing with scientific data.
What kind of new directions and breakthroughs in data science you expect in the next five years?
We will see further exploration of the fundamental theories and innovative methodologies of data. This will not be a short-term effort, but an important task for the next half a century or more. In the next five years, there will be breakthroughs in the measurement of data, data algebra, data similarity, data cyclopedia, truth in data, etc.