Data Scientists Spend Most of Their Time Cleaning Data


Least Enjoyable

A new survey of data scientists found that they spend most of their time massaging rather than mining or modeling data. Still, most are happy with having the sexiest job of the 21st century. The survey of about 80 data scientists was conducted for the second year in a row by CrowdFlower, provider of a “data enrichment” platform for data scientists. Here are the highlights:

Data preparation accounts for about 80% of the work of data scientists

Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.

76% of data scientists view data preparation as the least enjoyable part of their work

57% of data scientists regard cleaning and organizing data as the least enjoyable part of their work and 19% say this about collecting data sets.

These findings are yet another confirmation of a very widely known and lamented fact of the data scientist’s work experience. In 2009, data scientist Mike Driscoll popularized the term “data munging,” describing the “painful process of cleaning, parsing, and proofing one’s data” as one of the three sexy skills of data geeks. In 2013, Josh Wills (then director of Data Science at Cloudera, now Director of Data Engineering at Slack) told Technology Review “I’m a data janitor. That’s the sexiest job of the 21st century. It’s very flattering, but it’s also a little baffling.” And Big Data Borat tweeted that “Data Science is 99% preparation, 1% misinterpretation.”

Given that the median annual base salary in the U.S. of the hard-to-find and much-in-demand data scientists was $104,000 last year, a number of startups have focused on automating a solution to this essential but boring task. In his 2016 Big Data Landscape, Matt Turck lists a number of them in the “data transformation” box plus companies (such as CrowdFlower) that are addressing this need with crowdsourcing (both in the “infrastructure” section).

Investing in solutions to messy data will continue and IDC has predicted that through 2020, spending on self-service visual discovery and data preparation tools will grow 2.5x faster than traditional IT-controlled tools for similar functionality. Following the same trend, Forrester predicted that in 2016, machine learning will begin to replace manual “data wrangling” (another endearing term like “data munging”) and data governance dirty work, and that vendors will market these solutions as a way to make data ingestion, preparation, and discovery quicker.

Indeed, 55% of the respondents to the CrowdFlower survey agreed with Forrester, predicting that over the next year machine learning will have (or will continue to have) a significant importance for their companies and their departments.

Other findings:

35% of data scientists gave their job the highest mark possible.

Only 14% of data scientists felt they were being held back by their tools.

What data scientists want most is more support and direction from their management or executive team (27%).

Finally, CrowdFlower looked at nearly 4,000 data science job postings on LinkedIn to find out what skills organizations wanted from their new hires. Last year they found that the skills most in demand were programming and coding. This year, they looked for more specific data science tools that are mentioned in job posting.

Here are the Top 10 in-demand skills for data scientists:


% of jobs with skill

SQL 56%
Hadoop 49%
Python 39%
Java 36%
R 32%
Hive 31%
Mapreduce 22%
NoSQL 18%
Pig 16%
SAS 16%

 I’m sure it is relatively easy for employers to test prospective data scientists for their proficiency in any of the above tools and data platforms. But how do they test for their efficiency in removing commas?

Originally published on

About GilPress

I launched the Big Data conversation; writing, research, marketing services; &
This entry was posted in Data Science, Data Science Careers, Data Scientists and tagged . Bookmark the permalink.

23 Responses to Data Scientists Spend Most of Their Time Cleaning Data

  1. Spend no time cleaning data and more time using it for business insights with Maestro:

    Liked by 1 person

  2. And for good reason! The data needed for analysis is largely coming from transactional (OLTP) systems, and is unstructured or normalized. But in order to be effectively analyzed using most tools it must first be flattened (denormalized), often into a star schema data model to yield high performance. Couple this with the need to remove bad data, re-name codes, fix hierarchies and calculate missing fields, and you end up with a ton of work that must be completed before any analysis can begin. Keep in mind that the data sources are constantly changing and updating, so this work of preparing the data is an ongoing exercise, not a one-time thing. This is why many companies opt to develop a data warehouse, thus automating the data preparation work and providing clean, transformed data to analysts and data scientists. It isn’t as sexy as talking about big data or distributed data discovery, but good “old-fashioned” data staging is still the answer to this problem.

    Liked by 1 person

  3. Pingback: The 21st Century’s Most Precious Commodity – The Distribution

  4. Pingback: Custom Importers | Max De Marzi

  5. Pingback: Learning Python programming – Oskar’s blog

  6. Pingback: La limpieza de datos: una tarea imprescindible – Àrea Hackers cívics

  7. Pingback: Fb's Automated Insights needs to offer you an AI-powered group of digital knowledge scientists – Cloud Computing and Help-Desk System

  8. Pingback: Facebook’s Automated Insights wants to give you an AI-powered team of virtual data scientists |

  9. Pingback: Facebook’s Automated Insights wants to give you an AI-powered team of virtual data scientists |

  10. Pingback: Facebook's Automated Insights wants to give you an AI-powered team of virtual data scientists | shaka

  11. Pingback: Facebook's Automated Insights wants to give you an AI-powered team of virtual data scientists

  12. Pingback: How to Find Wally with a Neural Network | Copy Paste Programmers

  13. Pingback: Data-Wrangling: Darum geht es | Trifacta

  14. Pingback: Business in Focus: HEAP - Center for Executive Excellence

  15. Pingback: What R’s most popular tools say about the state of data science - CrimeStopNews.Com

  16. Pingback: What R’s most popular tools say about data science – Quartz – JukeLogic

  17. Pingback: Data Virtualization is Reshaping Analytics – Slacker News

  18. Pingback: Dipping into Data Science — A few thoughts from a “Business” guy – Data Science Austria

  19. Pingback: Demystifying Data Science, Part III: Data Wrangling - Tech News Headline

  20. Pingback: How Shaip Can Support Your Artificial intelligence Projects – Ramsey Elbasheer | History & ML

  21. Pingback: Should You Keep Data Annotation In-House? – 365 Data Science

  22. Pingback: Should You Keep Data Annotation In-House? - Ready AI.M

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s