Data scientists are data junkies—when they see a new data set they are just naturally excited and can’t wait to explore.
Mingsheng Hong is Chief Data Scientist at Hadapt, a Boston-based startup that offers an analytical platform that integrates structured and unstructured data in one cloud-optimized system. Before joining Hadapt, Mingsheng was Field CTO for Vertica. He holds a Ph.D. in Computer Science from Cornell and a BSc in Computer Science from Fudan University. Mingsheng is president of NECINA and is active in St. Baldrick’s Foundation, a volunteer-driven charity that funds research to find cures for childhood cancers. I talked to Mingsheng just before he shaved his head, a visual indicator and act of solidarity expected from successful St. Baldrick’s fundraisers.
As a graduate student, were you thinking of an academic career?
At Cornell, I explored both academic and private industry career tracks. I love research and innovation, and discovered my passion for explaining ideas to people from various backgrounds and getting them excited about these ideas. While that aligns with a more academic track, in the end I decided the private sector was a better fit for me. I’m driven by the challenge of taking an idea and carrying it end-to-end, from idea to product development to sales. During graduate school, I had the opportunity to visit Microsoft for a few summers, and I got a lot of exposure to database R&D and came away with a good feel for the industry. My research work there was commercialized in SQL Server 2008 and 2012, which was very exciting.
The tech evangelism part of my role in the industry – explaining complex ideas to the market and customers in a clear and compelling way – keeps me fulfilled.
What has your industry experience taught you about the essence of what is now called “data science”?
Data scientists look at existing data sets and often propose new ways of analyzing the data that the owners of the data haven’t yet thought about. Often times, people may not even realize they have monetization opportunities hidden in the data.
Let me give you an example. Telecom operators collect large amounts of data on their subscribers. For each network subscriber, a Call Detail Record (CDR) captures call information: who calls whom, for how long, from which number, etc. In the case of large telecom players, that’s a lot of data. The primary use case for CDR is to generate correct monthly billing. While billing is very important, it doesn’t help telecom operators make more money. They just need to make sure they do it every month and do it correctly.
But CDR could essentially be used as a social network with lots of data on how subscribers are connected with each other and how often they talk to one another. Therefore, it is possible to apply a lot of cutting-edge social network and social gaming-type analysis to the data. For example, for each subscriber, you can understand how many of the people she talks to are also subscribers to your network. Once you get a sense of the density of the network (how many people each subscriber talks to), you can take the analysis a step further and find out if there’s any correlation between people who leave your network (known as “churn”) and the percentage of on-network friends they have. You may need to use a machine learning model to help you understand how the churn rate relates to various other indicators. You can then use that model to build a predictor for which users are likely to leave. And once you have the predictor, you can reach out to them before they leave and incentivize at least some of them to stay. You can also provide incentives to your subscribers to promote your network to their friends who subscribe to another telecom network.
These examples illustrate an important dimension of big data–companies are finding hidden gems in the data that they are collecting anyway.
That’s exactly right. One can start leveraging big data by looking around to see what types of data sets are already available and thinking about how to best monetize them.
What skills do you think a data scientist should have?
It’s a combination of hard skills and soft skills. Hard skills include basic math and statistical understanding, good coding skills, and finally, domain expertise. People working for software companies as data scientists are like product managers and need to have a good understanding of how big data is impacting various industries—that’s the type of domain knowledge that’s needed.
Soft skills include, first and foremost, communications skills. Data scientists need to bring customers onboard. “Customers” can be your clients or your management, whomever you need to convince. Another soft skill is business acumen, being able to understand in a short timeframe the pain points or opportunities at hand. Business acumen helps you prioritize, helps you focus your efforts on which data sets to analyze and for what purpose. Finally, a data scientist is typically a curious person. Some people use the term “data junkie”–when data scientists see a new data set, they are just naturally excited and can’t wait to explore. It could be just a bunch of random numbers, but they feel they can make some sense out of it.
Given that there is not much training and certification available right now, what would be your advice to college students? What kind of skills they should invest in if they want to become data scientists?
I would recommend starting early. Whether you are at school or you are a young professional, try to find data sets to work with. For example, you could start playing with US census data—it doesn’t need to be a huge data set. While an important skill for a data scientist is knowing how to scale up data analysis, that’s not the first thing you need to tackle.
To me, the data variety dimension of big data is more interesting for someone who wants to become a data scientist. You could look at some clickstream data (e.g. Google Analytics) and textual data (e.g. Wikipedia, census), and just start analyzing that with tools that require no programming skills, like Excel or Tableau. And then you could write some Python scripts, SQL queries or Map/Reduce jobs, with a focus on getting comfortable with transforming the data and visualizing it in interesting ways to be able to tell a good story.
So you are advising hands-on experience as a first step?
Yes. There are books you could read to learn tools like R and MapReduce, but short tutorials and hands-on exploration is the best way to get started. Data science has some theory behind it (e.g. algorithmic understanding, machine learning knowledge), but it also involves a lot of practical, hands-on engineering, or “munging,” of data sets.
Yes, data science for the most part deals with practical problems that are solved with engineering methods. What would you advise as a first step for companies that want to build their data science capabilities?
As a company, you typically can’t afford to spend a lot of resources training people, and finding people with these skills today is hard. The best thing for companies to do is to find great tools that are easy to use so they don’t need a large army of data scientists. That’s what the tool vendors, including Hadapt, are creating. How do you make big data really easy? It’s about streamlining and automating the workflows to minimize manual labor. As an example, a single big data platform that unifies structured and unstructured data analysis minimizes the administration overhead of deploying and maintaining the computational infrastructure, while delivering huge performance advantages at the same time. In contrast, if the end user is required to deploy, connect and maintain multiple backend systems that form a big data solution, it could be labor intensive, error prone, and slow performing.
For a company trying to assemble a team of data scientists, do you think it should hire people with all of the necessary skills or put together a team of specialists?
I think it is reasonable to expect that data science will be a team effort, based on team members with complementary skills. Data science is a relatively new term and unless you are a Facebook or a LinkedIn, where you are able to hire a team of data scientists, usually a data science team is comprised of one or two data scientists and many other people. It’s like a football team. You have one quarterback and many other functions. That being said, if you have one or two people with all of the required data science skills, it eliminates a lot of communications and everything goes faster.
What’s the ideal place in the organization for the data science team?
In web companies, data scientists typically provide direct input on the questions to ask and analysis that needs to be done. In software vendor companies, data scientists have two roles: on a tactical level, helping close the deals by providing pre-sales support to either acquire new customers or upsell to existing customers. This is where data scientists find new use cases to help customers monetize better or solve their pain points. This makes the sales process much more effective than just competing on price or offering an extra feature. They also work post-sales, where they may be involved in prototyping the new process. On a strategic level, the data scientist collects customers’ requirements but also takes thought leadership whenever possible to drive product direction. This is somewhat ironic, because a data scientist is expected to do his or her job in a data-driven way, but that person also needs to be a visionary and think one step ahead of the market in order to develop an innovative product. In that role, the data scientist is the new product manager.
Are there any unique challenges in managing data scientists?
As a data scientist I’m probably biased, but I would advise giving the data scientist as much freedom as you can… Let them go and explore different data sets and analytic methods and see what they come up with. Data science is one of those jobs that involve a lot of creativity and proactivity, as opposed to needing very specific instructions, to be productive. The manager could provide general directions on the business priority and be a great cheerleader.
What about the future of data science? Where do you see progress being made in the near future?
One interesting area is data visualization, where a key focus today is how to tell a good story. It’s about packaging and selling the insights you’ve found. But also, think about using data visualization to help you get to the insights faster, through an iterative exploration workflow where the end users interact with a big data backend system in a visualized way. A rough analogy would be WYSIWYG. You do something, and you see immediate feedback on the screen—this way you can continuously tweak your data analysis/transformation routine. Right now, a lot of programming is often needed in this iterative process, and I believe there’s great potential to create tools that will increase data scientists’ productivity.
What does a data junkie like you do outside of work? Play with data?
One of my passions outside of my daily job is volunteering at a local non-profit organization, NECINA (New England Chinese Information and Networking Association). It involves career and leadership development, entrepreneurship, helping members build great companies. We’d like to think about our startup incubator program as a non-profit version of Y Combinator. NECINA also serves as a bridge connecting high tech communities in US and Asia.
I also have been passionate about giving back to individuals and society. Success to me is
about making a difference in people’s lives. To quote a great mentor of mine, Chris Lynch, “Two things matter in life: what you do for a living and who you spend time with.” It’s great to do meaningful things with friends, and in the case of St Baldrick’s, I’m part of the Big Data Boston group, where we have together raised over $50,000 for kids cancer research in the last month or so. That’s one way to contribute to society, have fun, and work as a team.