To Do Data Science, You Need a Team of Specialists
Currently the Chief Scientist at PayPal, Mok Oh came on board when eBay acquired WHERE, where he was Chief Innovation Officer. Prior to WHERE, Mok founded EveryScape, a data visualization company. The following is an edited transcript of our recent phone conversation.
How do you define a data scientist?
Data scientists answer the key question of what to do with the huge amount of information being created and there’s no one description that fits all. On my team at PayPal, we have people with Ph.D. in statistics, in machine learning, in natural language processing, in information retrieval, and in data mining. There’s also a bunch of really amazing and smart engineers who pretty much can build anything. I don’t think there’s one definition of a data scientist, a person who can do all of those things. You need a team of people that can work together to get something useful out of the raw data.
So the trick is to manage a team where people bring their own unique set of skills and expertise.
That’s exactly right. Take for example unstructured data. We are trying to understand what impact social media has on buying behavior. Typically, social media is very unstructured and it’s hard to extract sentiment or get valuable information out of it. You need people that can do a lot of data mining, some machine learning and natural language processing. But if you want to create a real-time data product that is smart enough to buy low and sell high, then you want some wonderful engineer who can just optimize the crap out of it so the response time is ten milliseconds. I would not ask a Ph.D., myself included, to do this work. My point is: nobody can be good at everything. I have very few generalists on my team. It’s a conglomeration of specialists.
How should the data science team work with other groups, for example IT?
Ultimately, if you want to scale a data product, you need to collaborate with other groups and work very closely with IT and the data management teams and other relevant teams in the company. But for me, this is very binary. Even in a relatively big company like PayPal, we need to have very small and nimble teams. When it comes to the data science team, our work is all about trying to efficiently prove or disprove a hypothesis. In order to get there efficiently, my general philosophy is that once it’s proven or gotten to a point where it’s proven enough, then we can start working with other groups. But until then, we need to have the freedom to fail as fast as we can. That’s why we’re having this data science ecosystem with a bunch of specialists creating a proof-of-concept that can prove or disprove a hypothesis. When I said binary I meant that first you got to go very small and learn as fast as you can. When you think that the learning has reached some sort of critical mass then you go and involve other people.
Who comes up with the hypotheses and the questions?
One source of new questions is the business analysts. They understand very keenly what needs to happen for the company to succeed. Another source is the data scientist who may say “hey, here’s the data, let’s figure stuff out, see what emerges.” I think there needs to be a balance between the two. Sometime we don’t know what are the right questions to ask. Even if you’re smart and you’ve been in this business for the longest time, you don’t know what you don’t know. Sometimes when we just look at the data, we see very interesting questions emerging. I think it’s an 80/20 rule where 20% is let’s just see what emerges and the other 80% is the important questions that we know today and that people with experience in the business understand very well.
What are the skills you look for when you are recruiting data scientists for your team?
For now, if you want to be a data scientist, you need to know how to code. I think that’s unfortunate. The most important thing for a data scientist is knowing what questions to ask and how to go about answering them. But even if you can ask the right question, if you can’t dive down into the data, extract data, find it, clean it… if you can’t do any of these things, it’s going to be fairly difficult at this point. Understanding how to code, understanding the language, is going to be important. Also important is understanding the math behind it. I don’t think you need a Ph.D. for that. In our team, some of the most productive people have a bachelor’s or master’s degree. Being comfortable with code and math and machine learning is what I will be looking for, knowledge of the languages needed to tell what’s in the data.
What about data visualization?
Great question. It’s always been and still is very dear and important for what I’m doing, since it’s a vehicle used to tell a “nonlinear” story that we understand naturally. It’s another important language data scientists should be comfortable speaking.
I sort of guessed from your bio that you actually wanted to be an architect…
At Oberlin College, where I studied computer science, art history and studio art, I kept wondering if there’s a way to replicate Kandinsky or Jackson Pollock in a machine, so that this kind of expressiveness can come out. Can you do it with an algorithm? I’m not so sure we can, but I still think it’s a very interesting question. Language, whether visual language or other languages (e.g., machine language) is how we express meaning, including the meaning of the data we explore.
I tried to continue to pursue these interests at the University of Pennsylvania where I enrolled in both the master of architecture program as well as the computer science program. But I quickly found out that these were vastly different worlds. The school of engineering was just across the street from the school of architecture, a road I was crossing a number of times a day. It felt like going from one world to another. In one world you were sitting by yourself in the dark, coding and coding, and all of a sudden 12 hours had gone by… Crossing the street, you found yourself talking with beautiful women about beautiful things, engaging in lots of theoretical discussions. I finally decided that being a professional architect is not what I wanted to be.
You also decided not to stay in academia after you got your Ph.D. in computer graphics from MIT.
At MIT I had a great time–the computer graphics group just got started. But I didn’t want to stay in this environment because I like to create things that are useful. I wasn’t entirely sure if the academic system was optimized towards usefulness or towards publishing. I’m not saying academics cannot create something useful–and it’s a different world today because capital is available for academics to start new ventures. But not at that time; I didn’t want to do research, I wanted to create something for people to use. Maybe that was the pull of the architect in me, wanting to build something.
You talked about being able to “speak” different languages. What else do you see as required skills for data scientists?
The other piece is sort of the x-factor. It’s very important for people to be able to communicate and work as a team. There are so many people who are wonderful and brilliant, I would even say that they are a dime a dozen. But to succeed as a data scientist, you have to be able to sell your idea, you have to be able to work well with the rest of the team, and you have to be able to bring out the best in others. This is especially important when you have a team of specialists working on very different things, and who typically would be more introverted than not.
Where do you see data science in three to five years?
I divide it into two domains: There’s the data and there’s the science. In terms of the science piece, I’m not sure, at a fundamental level, that there has been anything new in the last 30 or 40 years. Processors got faster, memories got bigger, things got better, faster, cheaper. But on a fundamental level, I don’t think there’s anything new. So the next step-function for science will probably happen somewhere in the academic world; it’s a type of question that people need to eat, sleep, and breath 24×7 for decades.
The other way to think about the future is to think about the data. Right now, data sucks. Everybody’s spending so much time just cleansing the data. Going forward, for example, we need to make sure that any data from any sensor out there can easily make sense, that raw data can actually be turned into information fast. It’s going to take time and maybe it will take a standards organization to work it out. For example, any kind of information created by cameras–let’s make it a standard, call it MP4… a standard way to store the data, a standard way for accessing it. In terms of data, there are also the privacy and security issues that need to be sorted out. That’s a philosophical-cultural-legal debate that needs to happen. We’re just scratching the surface now.
And where are you going with your own work?
The overarching question for our team at PayPal is how do we leverage data science to help shoppers and enhance online and offline transactions, from buying groceries to something with potentially larger social impact such as giving, while at the same time preserving the privacy that our customers trust us to do. The question could be how do we make sure that the friction in giving is minimized or how do we make sure cross-border giving is optimized. Questions like these are important and interesting and drive the work of our data scientists.
GREAT INTERVIEW,I enjoyed all the questions and all the answers, with the one on data science in 5 years from now, I actually share Mok Oh’s view.
Reblogged this on Sauragar's Blog.
Pingback: The Big Data Meme: 5 Scenarios for IT | What's The Big Data?
Pingback: Mok Oh’s New Big Data Venture | What's The Big Data?