A task-oriented approach to Machine Learning allows us to divide it into three categories: supervised learning (data are labelled according to some criteria), unsupervised learning (no labels exist for the data), and reinforcement learning (no data is directly available, only an interactive environment). In this article we will focus on supervised learning.
In supervised machine learning, the term ‘supervised’ comes from the fact that the machine has access to the correct answers (i.e., the labels) for the questions (i.e., the data features) being posed to it, at the end of each iteration during the model training process. It means that each time the machine iterates over the dataset, it receives feedback regarding what the model is doing right and what the model is doing wrong, represented as the output of a loss function. This feedback is used by the machine to successively tune the model, converging to a minimum on the loss function. Once a (local) minimum is reached, the training process stops and the model is ready to use.
The choice of algorithm for solving any supervised learning problem depends on a number of different factors, as each algorithm has its own strengths and weaknesses and none of them will be the best for solving all problems (this is a direct consequence of the ‘no free lunch theorem’). Some of these factors have varying levels of impact depending on the choice of the algorithm, others are exclusively intrinsic to the data:
- Bias-variance trade-off: Bias relates to incorrect assumptions regarding the data made by the algorithm, causing it to ignore relevant relations between the features and the desired output, which leads to underfitting. Variance relates to sensitivity to small changes (e.g.: noise) in the data features, which leads to overfitting. The desired sweet spot is to choose an algorithm that produces a model that simultaneously correctly detects patterns in the training data (low bias) but also generalizes well to new data (low variance), although this is often impossible to achieve in practice.
- Algorithm complexity vs. data complexity: The complexity of the learning algorithm function should match the complexity of the data. If the data are ‘simple’, then a model with high bias and low variance will learn all relevant patterns from a small amount of data. If the data are ‘complex’, then a model with low bias and high variance and a large amount of data are required for all relevant patterns to be correctly detected.
- Dimensionality of the data: If the data have a large number of features, then the algorithm may have a hard time finding the relevant ones. Although models with high bias and low variance will perform better, it is usually preferred to either remove irrelevant features or to apply dimensionality reduction methods to the data.
- Heterogenic data: If the data contain features of different types (e.g.: discrete, continuous, categorical) and/or with different ranges, many algorithms may not work properly or at all (e.g.: Support Vector Machines, Logistic Regression, Neural Networks), meaning that the data should go through some preprocessing steps. However, algorithms such as Decision Trees are naturally robust to heterogenic data.
- Redundant data: If the data contain highly correlated features, then some algorithms (especially linear and distance-based ones) will produce poor results due to numeric instabilities. Possible ways to minimize this are to detect the correlated features (keeping only one of them) or by introducing regularization to the training process.
The impacts of these factors can be limited (but not completely removed) by choosing appropriate values for the algorithm hyperparameters. Hyperparameters are inputs defined by the user that control some aspects of the algorithm behaviour. Possible approaches for finding the best hyperparameters values include performing an exhaustive grid search over a predefined set (works best if some insight for reasonable values to be searched is used) or an early-stopping approach (it starts by evaluating many random hyperparameter combinations, but only the models being trained with the most promising ones are kept until the end of the process). In order to achieve the best bias-variance trade-off, the data is usually divided into three subsets: training, validation and testing (80-10-10 and 70-15-15 ratios are common). The training set is the main dataset, used to fit the algorithm parameters. The validation set is used to evaluate the model fit to the training set and to tune the hyperparameters. The testing set serves as an independent evaluation to the best model, as it contains unseen data (from the model’s perspective).
The two main tasks associated with supervised learning are classificationand regression:
Classification: A brief definition of classification is to predict an object’s category (discrete value) (for example, to separate boxes by colour). It can be divided into binary classification (two-class problem) and multi-class classification. Binary classification corresponds to determining whether the data belongs to a class or not (e.g.: is a particular mushroom edible or not). Multi-class classification corresponds to identifying the correct class from a pre-defined set of values (e.g.: what is the colour of a box in a picture). Training a multi-class classifier usually requires a complex algorithm and large amounts of data in order for all differentiating patterns to be learned by the model (all-versus-all approach). An alternative is to consider many simpler one-versus-all models and make a decision based on their combined output. Each of these simple models performs binary classification and requires less data to train.
Metrics commonly used to evaluate a classification model performance include accuracy, F-measure and ROC curves. Classification metrics measure how well the model is able to predict the correct label from the data. Examples of real-life classification use-cases include spam filtering (binary), fraud detection (binary) and sentiment analysis (multi-class).
Regression: A brief definition of regression is to predict a numeric (continuous) value (for example, to divide boxes by length). It works by attempting to predict the label (independent variable) using the remaining data features (dependent variables). One of the main difficulties in correctly training a regression model comes with having data that respects its underlying assumptions. These assumptions are: 1) the data is representative of the population at large, 2) the independent variables are measured with no error, 3) the variance across the independent variables is the same, 4) the independent variables are uncorrelated to one another. If these assumptions are not met, then the data should be pre-processed before training the model. Otherwise, there is no guarantee for the quality of the model.
Metrics commonly used to evaluate a regression model performance include root-mean-square error (RMSE) and coefficient of determination (R2). Regression metrics measure the closeness of fit between the model and the data. Examples of real-life regression use-cases include stock price forecasts, sales volume analysis and medical diagnosis.
José Portêlo
Lead Machine Learning Engineer