Do you recall how many times you’ve read articles titled “This is what a Data Scientist does” or “Differences between a Data Scientist and a Data Analyst”? Such articles usually come with various colourful (and sometimes funnily shaped) Venn diagrams, arbitrarily presenting the overlap of the various data professions and highlighting the distribution of different activities (e.g. ML modelling, data storing, data visualisations) across the different data roles. That’s usually fine for the average reader to acquire a high level, rough overview of the various data roles, but do we, as data professionals, know in detail what other data roles comprise of? Wouldn’t it be great if we could extract all this (and more) information directly from actual data and get more detailed and less biased results?
The present analysis makes use of the data collected from the 2019 Kaggle ML & DS Survey and attempts to build a profile around 6 key data roles, shedding some light into their activities and preferences and unravelling some urban myths.
- Does a Data Engineer spend time doing Computer Vision?
- Is Machine Learning something that a Data Analyst does?
- Which data professions use linear regression more than any other ML algorithm? (spoiler alert: ALL of them)
- Is anyone still reading academic publications or do we all just learn from blogs?
Let’s find out!
The analysis focuses on 6 data roles: Data Scientist, Data Analyst, Research Scientist, Business Analyst, Data Engineer and Statistician. A key component of each role’s profile is the data-related activities professionals practice as part of their job. Based on the data provided, 7 key areas are taken into consideration:
- Data Analysis – Analysing and understanding data to influence product or business decisions
- Data Visualisation – Using data visualisation libraries and tools on a regular basis
- Data Infrastructure – Building and/or running the data infrastructure that the company/organisation is using
- Applied Machine Learning – Building or iterating over ML models to improve existing products/workflows or applying ML on new problems
- Machine Learning Research – Doing research that advances the state of the art of ML
- Computer Vision – Using computer vision methods on a regular basis
- Natural Language Processing – Using NLP methods on a regular basis
Information about most data activities can be extracted directly from the answers provided to the question “Select any activities that make up an important part of your role at work”. However, some of them are being inferred indirectly from answers provided to relevant questions. E.g. if someone’s answer to the question “Which categories of computer vision methods do you use on a regular basis?” is anything but None, then one can infer that this individual is practicing Computer Vision, at least to some extent.
- It is important to highlight that the subsequent radar charts are not an indication of how skilled people from each profession are. They show the proportion of people from that role practicing each data activity.
Profiles are additionally complemented by information about:
- Salaries – How much does each profession earn per year (focusing on US salaries).
- Education / Learning – Academic degrees, online learning platforms and media sources.
- Tools and Algorithm Preferences – Algorithms, programming languages and other tools.
Data Scientists are the highest paid group. Given that they are very active across all 7 categories, this should be no surprise.
More than 70% of Data Scientists do applied Machine Learning, but just over 20% claim to do ground breaking ML research. Another piece of evidence that you don’t need a PhD to join the club.
Almost 40% build or run Data Infrastructure in their organisations. Is this the best way to use a Data Scientist’s time? Maybe management needs to understand the importance of hiring Data Engineers.