Some of the most important pillars in Data Science and Machine Learning supporting the ability to extract meaningful insights and knowledge from the data are: 1) how representative are those data of the phenomenon being measured and 2) the quality level of those data. How well the data represents a phenomenon is defined exclusively during the data collection process, being characterized by factors such as the type of measurements being taken and how the sensors performing these measurements are placed. After the raw data is collected, it is common practice to perform an ETL process over those data, extracting them from the different sources, transforming them according to storage or applicational needs, and loading them into the end target (where they are made available for consumption).

From a Data Science perspective, the ‘Transform’ step of ETL is the most important one regarding data quality. It is commonly referred to as Data Pre-Processing in this context, as from a data scientist point of view this is usually when the data are first made available to him and corresponds to the first step in Machine Learning pipelines. The problem of ensuring data quality is a critical one: if after the Data Pre-Processing step the data do not have sufficient quality for value to be extracted from them, then all the following steps of the pipeline are mostly inconsequent, a property informally known as ‘garbage in, garbage out’.


Data will have a high level of quality if they are accurate, complete and consistent:

  • Accuracy: Data accuracy refers to erroneous data values that differ from the expected. Factors that affect data accuracy include: 1) errors produced during data input and/or data transmission, 2) incorrect formats of the data columns, and 3) data duplication.
  • Completeness: Data completeness refers to the lack of values in the data features. Factors that affect data completeness include: 1) data unavailability (which can be caused by a temporarily problem in connectivity to the data warehouse or by a failure to collect the data at the source), 2) deletion of data that was initially considered irrelevant, and 3) performing an inadequate data cleaning process.
  • Consistency: Data consistency refers to self-consistency (the same data stored at different locations match exactly), believability (all data contents are equally trusted by users) and interpretability (the meaning of all data is equally easy to understand by users), among other aspects.


All the properties above may be achieved by applying data pre-processing techniques to raw data. These techniques are not exclusive to Data Science: a considerable overlap exists with Data Engineering regarding 1) where and when they should be applied to the data (e.g.: near the source or near the end target), 2) which ones and in which order they should be applied, and 3) who is responsible for applying them (e.g.: the data scientist or the data engineer).


There is no consensus on how to group together different data pre-processing techniques, but data cleaning, feature transformation, feature learning and data augmentation are commonly used:


Data Cleaning: Data cleaning consists in detecting and correcting (or possibly removing) corrupt or incomplete records from the data. A large variety of issues are possible, namely data containing missing values, the presence of duplicates or near-duplicates, the presence of outliers or other noise, the presence of inconsistencies such as out-of-range values, and the use of incorrect data types. Specific techniques have been developed for each of them; for the more frequent and impactful ones, professional tools also exist.


Feature Transformation: Feature transformation (in the context of Data Science) is particularly relevant as part of Machine Learning pipelines, as it specifically addresses the characteristics and intrinsic limitations of many Machine Learning algorithms, both in terms of data variability and feature types. Example of feature transformations include scaling and normalization, selection, bucketing, and encoding.


Feature Learning: The main purpose of feature learning (also known as feature engineering) is to represent the data in such a way that it is easier for the Machine Learning algorithms to obtain a good model from them. Traditionally, feature learning has been a manual process that derived features from business insights and visualization of the data. Nowadays, it is common practice to use (deep) neural networks for feature learning, as they are capable of detecting complex patterns in millions of data samples far more efficiently than any human ever could and their hidden layers intrinsically represent the new features being learned.


Data Augmentation: Data augmentation addresses the issue of insufficient data. There may be a lack of data overall (a problem common in Computer Vision tasks, which is usually solved through data synthesis) but sometimes the issue is that the data are unbalanced (i.e.: some important events are very rare and therefore they are only represented by a few data points). To tackle this, it is possible to undersample the dominant classes or to oversample the minority classes.


All the data pre-processing techniques mentioned above were presented in a very summarized way. Each of them is filled with nuances, and the best way to apply them depends greatly on the data being analysed.


José Portêlo

Lead Machine Learning Engineer