Skip to main content

data-machine-learning

Training and fine-tuning machine learning algorithms require vast quantities of data; however, not all data is good data. Thoroughly preparing datasets prior to model training can improve algorithm performance and reduce data costs while also ensuring ethical decision-making. This guide describes how to prepare data for machine learning by breaking down the process into five steps and discussing tools that can help.

How to prepare data for machine learning: A step-by-step guide

1

Data collection

The first step is collecting data to train the model. This data can come from a wide variety of sources, both internal (such as sales records or customer experience data) and external (such as websites and academic data repositories). Data for machine learning can be structured, or organized in tables and spreadsheets, but often it’s unstructured data like raw text files such as chat or call transcripts, images and videos or semi-structured data like XML and JSON files. Structured, unstructured and semi-structured data for machine learning is typically stored in a data lake because it can store files in their original format.

2

Data cleaning

Next, data needs to be cleaned up, which involves identifying and correcting any missing, duplicate, irrelevant, or incorrect data. Other issues to look for include imbalanced datasets that contain a significantly higher number of data points in one class than another and statistical outliers that could skew results.

This is also the time to detect and remove any personally identifiable information (PII) and other sensitive data that could pose security or privacy risks. Screening data for toxicity (like offensive language) or biases against particular groups is also recommended at this stage. Tools such as Granica Screen can automatically identify PII, bias, and toxicity in training data to help streamline the cleansing process

Explore an interactive demo of Granica Screen to see industry-leading PII, bias, and toxicity discovery capabilities in action.

3

Data transformation

After cleaning, data must be transformed into a format that’s usable by machine learning algorithms. Data teams use a variety of transformation techniques, including those listed in the table below.

Machine Learning Data Transformation Techniques

Technique

Description

Dimensionality reduction

Reduces the number of variables while retaining only relevant information for solving a particular problem.

Discretization

Transforms continuous variable categories, such as temperature and height, into more discrete ones, such as “hot” or “tall.”

Encoding

Converts categorical information, such as hair colors or car models, into numerical data that the ML model can read.

Log transformation

Applies a logarithmic function to dataset values to help balance outliers and heavily skewed data.

Normalization

Standardizes the distribution of variables within a dataset to balance their importance to the ML algorithm.

Scaling

Standardizes the range of variables within a dataset so the algorithm considers them all with equal importance.

4

Data annotation

Data annotation involves labeling data with the features the model needs to recognize in a format that the algorithm understands. An example would be labeling all of the obstacles in a dashcam video used to train an autonomous driving algorithm.

5

Data collection

Finally, data is split into subsets – typically including a training dataset that the algorithm will learn from, a validation dataset used to evaluate model performance during fine-tuning, and a testing dataset to evaluate model performance in production. The goal is to have different data for the three sets so teams can validate and test model performance using information the algorithm hasn’t encountered before.

Data teams use a variety of approaches to splitting datasets, including:

  • Stratified splitting = Dividing data based on class labels and then randomly sampling these subsets to help eliminate imbalances in data points
  • Time series splitting = Preserving the chronological order of time series data by segmenting data into fragments representing different time periods
  • K-Fold cross-validation = Dividing datasets into “k” equally sides folds to enable multiple rounds of training and validation
  • Random sampling = Splitting datasets randomly, an approach used on very large datasets with balanced distributions.

Cost-effective data management for ML and AI

The five steps described above outline the general process for preparing data to ensure that finished models meet performance expectations. However, such thorough data preparation can also yield the favorable side effect of potential cost savings. 

Effectively cleaning and curating data before training can help reduce data lake storage and transfer expenses without reducing model quality. Tools like Granica also use highly compute-efficient algorithms to make the cleansing and splitting process less expensive in cloud data lake environments.

For more information about how to prepare data for machine learning while optimizing costs, contact Granica today.

Granica
Post by Granica
December 11, 2024