Data Pre-Processing techniques exist to ensure data quality and they can be grouped in different ways. There is no consensus on the best way to create these groups, although examples of such groups usually include data cleaning, feature transformation, feature learning and data augmentation. In this article we will focus on feature learning.
It is important to start by clarifying the difference between feature learning and feature transformation. Feature learning is distinct from feature transformation in a fundamental aspect: while the objective of feature transformation is to change the data so that they no longer conflict with the intrinsic limitations of Machine Learning algorithms, the objective of feature learning is to modify the data so that it is easier for Machine Learning algorithms to extract relevant information from them. This modification usually consists in either replacing existing features with new ones or adding new features to the data.
Feature learning is an important step in the data pre-processing pipeline because it provides a priori guidance to Machine Learning algorithms (e.g.: highlighting relevant patterns from existing data), allowing them to converge faster and to better solutions. The current meaning of feature learning includes both the manual and automated processes for generating features from the data. The original meaning of feature learning included only the automated processes, as the previously existing manual process was known as feature engineering. However, the development and increased use of automatic feature generating methods (especially the ones based on Deep Neural Networks) has contributed to the usage of feature learning as an umbrella term for both feature engineering and feature learning. Nevertheless, although automatic processes have replaced the manual ones for some types of features, both are still needed as they complement each other.
Manual feature engineering: Feature engineering is the process of using domain knowledge (e.g.: business insights, usually accumulated throughout years of experience) and data visualization to extract features from the data. In the case of evaluating creditworthiness, one such generated feature would state whether or not an individual has defaulted a debt. When dealing with timestamp data, it is also common to generate features using a calendar as reference (e.g.: weekdays vs. weekends; holidays vs. workdays; split the timestamp into month, week, day, hour). Additionally, feature engineering also includes the generation of features from external data sources (i.e.: not related to the original data sources corresponding to the problem at hand), such as including meteorological data (e.g.: hourly average temperature, daily amount of rain).
Although some manual feature engineering processes can be easily and efficiently automated, this is not true in all cases. A common situation consists of representing non-linear relationships between a feature and the target label as a linear dependency. Automated feature learning processes consider this possibility by applying non-linear functions on the different features, but due to combinatorial issues this is usually performed using only low-degree polynomials. For instance, if a set of features has a complex non-linear relationship with the target label, a blind automated feature learning process may not accurately represent it. However, this relationship may become obvious if domain knowledge is used and a hand-crafted feature is generated for it.
Automated feature learning: Early automated feature learning processes include computing simple statistics (e.g.: max, mean, sum, count) when row-level data aggregation makes sense (e.g.: set of items purchased by each customer) and dimensionality reduction methods (e.g.: Principal Component Analysis). Recent automated feature learning processes rely mainly on Deep Neural Networks (e.g.: auto-encoders, convolutional neural networks), as their hidden layers intrinsically represent the new features being learned. For example, in a convolutional neural network for face recognition, the initial hidden layers represent simple visual features (like straight lines and edges), the middle hidden layers represent more intricate features (such as skin and hair textures), and the final hidden layers represent complex shapes (such as eyes and noses).
The main advantage of automated feature learning processes is that they are much easier to perform than their manual counterparts and no domain knowledge is required for generating hundreds (or even thousands) of potentially useful features. Nevertheless, some Machine Learning algorithms will have poor performance if they receive too many data features as input, so it is recommended to evaluate beforehand the actual usefulness of each feature being generated.