Data Preparation for AI
Data Processing Workflow Overview
In this topic, you will review the data preparation workflow and why it is important. An ML model learns patterns from historical data to make informed predictions or decisions on new, unseen data. ML model effectiveness depends on the availability of reliable data, so an effective data preparation process is essential for generating reliable and accurate predictions. Data preparation consists of multiple steps, such as collecting, preprocessing, storing, and retrieving data. Inaccurate, incomplete, inconsistent, or biased data will lead to faulty model predictions, potentially resulting in incorrect conclusions, poor business decisions, and negative impacts on customer experience. Improved data preparation techniques that focus on efficient data labeling, augmentation, and management while maintaining a fixed model architecture have significantly enhanced model outcomes.
Data preparation transforms raw data into a more structured and usable data set that uncovers embedded data patterns. These techniques remove the irrelevant noise and identify the significant patterns and insights that are important for addressing your problem.
The data annotation team must ensure that high-quality, carefully cleaned, labeled, and curated data sets are available for ML scientists to train their models. Data acquisition and preprocessing are iterative and continuous processes. Automating them as much as possible is good practice. It is also important to encourage collaboration between data engineers, data scientists, and domain experts to better understand data needs and challenges.
Even though each AI data preparation workflow is unique, the overall steps can be summarized as follows:
Collect and store data
Preprocess data
Label the data
Transform the data (feature engineering)
Note: The data transformation process, also known as feature engineering, involves selecting, transforming, and creating variables (features) from raw data to enable efficient model learning. ML features are measurable properties or attributes of the observed objects, which are represented as input data samples. Features can be represented as a number or category.

AI model complexity depends on the type of problem it is addressing. Programming frameworks define the libraries and techniques that are available to data scientists but have no impact on the model accuracy. Infrastructure capabilities and capacities (in terms of resources such as graphics processing units (GPUs), CPUs, storage, networking) influence the speed of model training and decision-making, but not the accuracy.
Data Processing Workflow Phases
There are several common techniques that are used in different phases of the data preparation workflow. Phases include collecting raw data, preprocessing, labeling, and transforming data to make it usable for training your models.
Collect and Store Data
You can acquire external data sets, such as online user behavior statistics, or even generate them with AI techniques. AI-generated data provides vast data sets cost-effectively but lacks realism and might result in inaccurate predictions.
Three types of data are used for training AI models:
Structured text-based data (stored in relational databases and data warehouses)
Semi-structured data, like text or documents (stored in not only Structured Query Language [NoSQL] databases and data lakes)
Unstructured text, images, audio, and video recordings (stored in data lakes)
Data comes in various formats, which can be classified as follows:
Text-based (human-readable): This data is ideal for configurations, small data sets, and data interchange between systems where human readability is important.
Binary (machine-readable): This data is suited for larger data sets, complex data structures, and scenarios where efficiency and performance are critical. They are often used in databases, scientific computing, and high-performance data processing applications.
Note: The optimal data type and format depends on the goals that you want to achieve and the structure of your input data.
The following table provides examples of popular data formats.
|Format|Binary or Text|Format|Usage| |JavaScript Object Notation (JSON)|Text|Semistructured|Configuration files| |XML|Text|Semistructured|Web services, document storage, and exchange| |Comma-separated values (CSV)|Text|Structured|Data processing tools and libraries| Parquet|Binary|Column-oriented|Apache Spark, Amazon Redshift| |Avro|Binary|Row-oriented|Hadoop| |Protobuf|Binary|Extensible mechanism for serializing structured data.|TensorFlow| |Pickle|Binary|Python-specific binary serialization format for serializing and deserializing Python objects. Useful for saving ML models, intermediate results, and custom data structures.|Scikit-Learn| |Open Neural Network Exchange (ONNX)|Binary|ONNX is an open-source format for representing ML models. ONNX is designed to facilitate the transfer of models between various frameworks.|PyTorch, TensorFlow|
Note: Data serialization is the process of converting data structures or objects into a format that can be easily stored, transmitted, and reconstructed later. Serialization is essential for data exchange between various systems, particularly when they are written in various programming languages or run on various platforms.
The data structure can be either row- or column-oriented. The row-oriented format is optimized for accessing the specific data sample (with all its features), whereas the column-oriented format is optimized for accessing a single feature (across all the samples).
Data can be user-generated (such as text and images) or system-generated (such as infrastructure and application logs). There are two distinct types of data:
Historical data: This data is stored in storage systems such as databases and data warehouses. This data is usually processed periodically in batch jobs using tools such as MapReduce or Spark. Batch processing is used to compute static features, which rarely change.
Streaming data: Real-time transport systems such as Apache Kafka or RabbitMQ process this type of data. Stream processing is used to compute dynamic features, which change more often.
Note: It is useful to record the source of each data sample and its labels, which is known as data lineage. Data lineage helps identify potential biases in your data and troubleshoot your models.
Data Preprocessing
Data preprocessing includes sampling, cleaning, formatting, and transforming data. Preprocessing is done to make the data usable for the learning (model training) stage. Preprocessing typically involves some of the following data transformations:
Sampling data: Sampling is the process of choosing data from the data set that will be used for training. Various data splits are typically created for training, validation, and testing. There are two sampling families: nonprobability sampling (selection is not based on any probability criteria) and random sampling.
Cleaning data: Cleaning involves deleting the unused or invalid data.
Transforming the format: This process changes the data structure to fit your application's needs.
Handling missing data: Empty fields are usually replaced with an average value, interpolated, or deleted.
Handling outliers: Rare data points that deviate significantly from the dataset must be identified and deleted.
Handling data errors: Invalid data must be transformed or deleted.
Dealing with imbalanced data: This process handles class imbalance using some form of resampling technique or synthetically generated data. If a data imbalance exists, you will have significantly more data points for a specific class, which leads to biased models. Alternatively, you could make your algorithm more robust to imbalanced data.
Augmenting data: If you do not have enough data, you can use some data augmentation techniques (perturbation, label-preserving data transformation) on existing data to increase the amount of training data.
Data Labeling
Most of the AI/ML techniques that are used in practice belong to a class of supervised learning techniques. Supervised learning relies on data labels, which must learn to correctly recognize patterns in the data. Data labeling assigns meaningful tags or annotations to raw data. The labels typically define the target output that the model should predict. Labels provide context to data to help with the ML model-learning process. A team of dedicated annotators usually labels this data manually.
Some tasks can avoid manual labeling by using natural labels that are already present in the data. Natural labels occur in tasks where the application automatically evaluates the model's predictions. Examples are a recommendation system (the system knows if you accepted the recommendation) or Google Maps (the model estimates the time of arrival for a certain route on Google Maps, and when you arrive, it measures how accurate the prediction was).
Because getting high-quality labels is not easy, there are several techniques to generate or enhance existing sets of labels to improve the model quality:
Weak supervision: This technique uses simple and efficient rules to generate labels.
Semi-supervision: This technique uses structural assumptions to generate labels.
Transfer learning: The model is already trained for other tasks and reused as a starting point.
Active learning: This technique increases learning efficiency by choosing the data sample labels that are useful for your model.
Feature Engineering (Data Transformation)
Feature engineering requires expertise in the specific field to extract relevant features from raw data. Data scientists transform raw data into specific features needed to accomplish specific tasks. Extracted data features feed the model training process and enable you to have accurate models. Feature engineering involves the following tasks:
Scaling: This task normalizes data points to similar ranges.
Discretization: This task turns a continuous value into a discrete value.
Encoding: This task transforms continuous data into categories.
Decomposition: This task breaks down complex data into simpler, more manageable components.
Aggregation: This task combines multiple data features into one.
Dimensionality reduction: This task transforms a data set using fewer features (dimensions) without reducing data value.
Last updated