The Data Science Workflow

The Data Science Workflow

Carlos Salas

Portfolio Manager and Data Scientist

Data science and machine learning is a fast growing industry with a slow growing skill force. In fact, it’s costing the UK economy £2 billion a year yet most are unaware of how to find and structure a data science team. Join Carlos Salas in this video as he explores data science workflow and machine learning algorithms.

Data science and machine learning is a fast growing industry with a slow growing skill force. In fact, it’s costing the UK economy £2 billion a year yet most are unaware of how to find and structure a data science team. Join Carlos Salas in this video as he explores data science workflow and machine learning algorithms.

Now free to watch

This video is now available for free. It is also part of a premium, accredited video course. Speak to an expert today to watch more.

The Data Science Workflow

11 mins 34 secs

Overview

Western economies are currently finding it difficult to balance the skill gap present in data-intensive workforce niches such as data science and machine learning. The most important of these job roles include business analysts, data scientists, software developers, data engineers and DevOps engineers. The easiest way to fully understand each team member’s role is to first understand the whole data science workflow feedback loop. This can be broken down generically into 5 stages: problem definition, data analysis, model research, model prototype and model deployment.

Key learning objectives:

  • Identify data science job roles

  • Understand the data science workflow

Now free to watch

This video is now available for free. It is also part of a premium, accredited video course. Speak to an expert today to watch more.

Summary

What’s the problem with finding talent in data science? 

Western economies are currently finding it difficult to balance the skill gap present in data-intensive workforce niches such as data science and machine learning. As a result, a shortage of data skills in the job market has become prevalent across many western countries. The UK is a clear example of this phenomenon, with the government recently publishing a report in 2021 that points out an approximate potential 178,000 to 234,000 data roles that are yet to be filled. A recent analysis report conducted in 2018 showed that data-driven skills shortages are already costing the UK economy £2 billion a year.

What are the most important jobs in data science?

Although there are many roles within a data science workflow, the most important are:

Business Analysts. Business analysts try to narrow the gap between IT and business by identifying how data can be linked to actionable business insights.

Data Scientists. Data Scientists gather and analyse information from databases, application programming interfaces, or APIs to explore the data, create visualisations and train machine learning models to extract insights for business decision-making users.

Software Developers. Software Developers are the link between Data Scientists and Data Engineers and their main role is to develop production versions of the models developed by Data Scientists. In other words, Software Developers play an important role in making the internally-developed models scalable.

Data Engineers. Data Engineers develop, maintain, test, and evaluate big data solutions within the organisation. They create data pipelines, big data platforms, and data integration into databases, data warehouses, and data lakes working with both on-premise and cloud technologies.

DevOps Engineers. DevOps Engineers rely on a combination of people, processes, and technology to deliver machine learning and software solutions in a robust, scalable, reliable, and automated way.

What are the data science workflow stages? 

1. Problem Definition: Business analysts and even data scientists ask questions about the business in order to work out what problems need solving.

2. Data Analysis: data scientists conduct Exploratory Data Analysis (EDA), data transformation and feature selection as input for machine learning models.

3. Model Research: Data scientists develop and test multiple machine learning models to describe or predict that data in order to produce a tool that can answer the earlier proposed questions in a systematic manner.

4. Model Prototype: Software developers use data scientist feedback to build a production prototype that will make use of the machine learning model on a regular basis.

5. Model Deployment: Software developers, data engineers and DevOps engineers collaborate to efficiently deploy a machine learning model prototype.

What is involved in Data Analysis and Model Research, the most important and time consuming stages? 

1. Data Analysis

Exploratory Data Analysis. This consists of understanding the data via visualisation, descriptive and statistical inference tests using multiple techniques such as univariate analysis, multivariate analysis, correlation analysis and normality tests, among others. 

Data transformation. This is the identification ex-ante of the requirements of our machine learning models, along with the implementation of multiple transformations so that the data can be more easily digested by the models. There are also multiple data transformation methodologies such as rescaling, standardisation, and normalisation.

Feature engineering. This describes the process of selecting, manipulating, and transforming the raw data into features that can be used in the machine learning model in order to improve the model’s performance and robustness.

Feature selection. This is the process of trimming down the number of features to improve the performance and robustness of the model. The feature selection can be executed using techniques such as: 

- Mean Decrease Impurity (MDI), based on using in-sample data and a tree-based classifier

- Mean Decrease Accuracy (MDA), based on out-of-sample data and any type of classifier algorithms

2. Model Research

Model selection. This is carried out by understanding the problem at hand. The data scientist splits the dataset between train and test data in order to proceed to the future steps. This train-test split is essential in order to avoid creating models that only perform under very specific circumstances.

Cross-validation. This stage consists of training the model with a portion of the in-sample data while the robustness of the model is confirmed using validation data. Some machine learning models contain specific parameters, or hyperparameters, that require calibration via cross-validation. 

Generalisation performance. This is the last stage where the data scientist selects the best cross-validated model and tests it using data from the test sample in order to understand whether or not the model generalised out-of-sample. 

Now free to watch

This video is now available for free. It is also part of a premium, accredited video course. Speak to an expert today to watch more.

Carlos Salas

Carlos Salas

Carlos Salas is a professional investor passionate about the lifelong development of an investment process that blends man and machine. Over the last 15 years, he has worked in investment roles for firms such as Santander AM, BNP Paribas, Jefferies, and LCAM. He is currently pursuing three careers simultaneously - as an investment manager, consultant and lecturer.

There are no available videos from "Carlos Salas"