Data Pre-Processing techniques exist to ensure data quality and they can be grouped in different ways. There is no consensus on the best way to create these groups, although examples of such groups usually include data cleaning, feature transformation, feature learning and data augmentation. In this article we will focus on data augmentation.
It is a well-established fact that the larger the amount of data available regarding a given problem, the more likely it is for a computer to be able to train a model that better solves it. This can be easily observed in most Computer Vision, Natural Language Processing and Speech Processing tasks. There are undoubtedly technical difficulties regarding how to efficiently store and process large amounts of data, but here we will focus on the effects and consequences of the intrinsic difficulties in the data acquisition process itself that make it difficult (and sometimes impossible) to collect as much useful data as desired. The term usefulrelates to the ability to collect enough data representing each possible pattern or event of interest. Only if a model can be trained to correctly identify all these patterns or events can its input data be considered useful.
Only in very few scenarios is the amount of available data similarly balanced across all patterns of interest (i.e., the data classes). It is much more frequent that a small subset of classes represents almost the entirety of the available data, due to the intrinsic nature of the phenomenon being analysed (e.g.: the amount of valid credit card transactions vs. fraudulent ones, the amount of healthy CAT scan images vs. the ones with signs of cancer). If these class unbalances are small, then adjusting the class weights (or hyperparameters with equivalent impact) in the Machine Learning algorithm or modifying the model evaluation metrics helps reduce the undesired impact of unbalanced data in the obtained results, especially if the data classes unbalance is not too pronounced. For the remaining (and far more common) situations, data augmentation methods are usually the way to go. Examples of such methods include undersampling, oversampling and data synthesis:
Undersampling: In the Data Science context, undersampling means to lower the number of samples of the most represented data classes, thus reducing the unbalance between them and the least represented ones. The need for undersampling is usually linked to practical reasons (e.g.: the resource costs associated with handling large amounts of data), making it less frequently employed than oversampling. Examples of undersampling techniques include random undersampling, centroid clustering and Tomek links.
Oversampling: In the Data Science context, oversampling means to increase the number of samples of the least represented data classes, thus reducing the unbalance between them and the most represented ones. The new data samples will fall within the convex hull defined by the original data (i.e.: the smallest convex set that contains the data features). Examples of oversampling techniques include random oversampling and SMOTE.
Undersampling (of the dominant classes) and oversampling (of the minority classes) are usually performed together, as this helps balancing the opposite effects of both techniques on the overall dataset.
Data synthesis: Similarly to oversampling, data synthesis aims to increase the number of samples of the least represented data classes or to add feature variability to the training data. However, this is the commonly used term when complex data sources (e.g.: images, speech, text) or high-dimensional sparse data need to be analysed and processed, as more elaborate transformations (e.g.: elastic deformations, colour modification, noise injection, random erasing) are required. Given the nature of most of these transformations, samples produced through data synthesis can fall outside the convex hull defined by the original data. If after applying these techniques the issue of low data variability still exists, then it is possible to use Generative Adversarial Networks as an advanced solution for it.
José Portêlo
Lead Machine Learning Engineer