Data Pre-Processing techniques exist to ensure data quality and they can be grouped in different ways. There is no consensus on the best way to create these groups, although examples of such groups usually include data cleaning, feature transformation, feature learning and data augmentation. In this article we will focus on feature transformation.

In a Statistics context, data transformation is the application of a function to each element in a dataset, usually with the purpose of ‘linearizing’ the data and thus making them usable for linear regression modelling. In a Data Science context, data (or feature) transformation includes the previous definition but also many more techniques, most of them associated with intrinsic limitations of many Machine Learning algorithms, such as limited capacity to equally weight features with very distinct ranges and difficulty in directly handling categorical features. Since all these techniques work better on clean data (e.g.: without outliers or missing values), performing a good data cleaning process beforehand is extremely important. Examples of feature transformation methods include feature scaling and normalization, feature selection, feature bucketing, and feature encoding:


Scaling and normalization: Feature scaling and normalization refer to adjusting the values of a data feature so that they fit within a pre-defined range (e.g.: the [-1, 1] interval) and/or follow a pre-defined distribution (e.g.: the normal distribution), while keeping most other properties (e.g.: ordering) unchanged. This is important in order to avoid numerical instabilities, improve algorithmic convergence speed, and make sure all features contribute proportionally to their true impact on the output during the model training process. Common approaches include the min-max normalization and the Z-score normalization.


Selection: Feature selection is the process of picking a subset of relevant features for model training that ideally provide results similar to the ones that would be obtained when using all features. Reducing the number of features is important not only from a business perspective (e.g.: improve model interpretability) but also from a Machine Learning perspective (e.g.: shorter model training times, enhanced generalization and reduced overfit). Examples of feature selection methods include information gain analysis and backward feature elimination.


Bucketing: Feature bucketing is a form of quantization that consists in mapping continuous values onto a set of pre-defined discrete values or in grouping together several values close to each other into a single one. In the first case, the original value range is split into small intervals (i.e.: ‘buckets’) and the original continuous values that fall within a given bucket are replaced by the discrete value that represents that bucket. The second case can be seen as either a generalization of the first to intrinsically multidimensional data (e.g.: images) or as a way to adjust the granularization level of data (e.g.: computing daily averages from hourly data). Feature bucketing is useful for reducing the impact of noise and to reduce the total amount of data to be processed, both at the cost of some loss of information. Feature bucketing can be applied to both numeric and categorical features.


Encoding: Feature encoding consists in transforming non-numerical data (e.g.: categorical features) into numeric features that contain the same information. This is a crucial step before using many Machine Learning algorithms, as all but a few exceptions (e.g.: Decision Trees) can only process numeric inputs. The most common method for feature encoding is one-hot, in which a categorical variable that can take N distinct values is replaced by N binary variables, where for each original value one variable has value 1 and all others have value 0 (e.g.: a variable that can take values dog/cat/mouse is replaced by 3 binary variables whose concatenated representation can be 001, 010 or 100, respectively). Other feature encoding methods include ordinal encoding and frequency encoding. The first is useful when a categorical feature has intrinsic ordering (e.g.: cold/warm/hot), the second is useful when the occurrence frequency of each value taken by the categorical feature is important. Feature encoding can lead to an explosion of the number of features, meaning that dimensionality reduction methods are usually required afterwards.


José Portêlo
Lead Machine Learning Engineer