There has been a recent hype in many business areas where many ‘data evangelists’ have been selling Machine Learning (ML) as a Holy Grail, promising that it will allow companies not only to solve most of their data-related problems but also have “exponential growth in the next 3 to 5 years”. To further increase false expectations on this matter, sometimes concepts such as Big Data, Data Science and Machine Learning are mixed together and presented as one single big notion, which further increases confusion and lack of consensus on the actual borders between the three. This post attempts to clarify this confusion by offering a simple visual representation of the relationship between them, focusing on Machine Learning.

Going straight to the end point, the main (and only) goal of ML is to automatically predict results using the available data. If any given problem cannot be stated as an instance of this, then it probably is not a ML problem to begin with. Machine Learning sits on three pillars: (Big) Data, Variables and Algorithms. Each of them is critical for achieving the stated goal.

The three pillars of Machine Learning (ML).

 

(Big) Data: Data are the cornerstone of all learning. Data can come from a variety of different sources, including tables, text, images, audio and sensors. The more diverse the data are, the better the desired predicted result is likely to be. There are two main ways to collect data, manual or automatic:

 

Despite the higher cost, performing manual data collection is preferable to automatic data collection, as data quality (including labelling) is usually the dominant factor in the quality of the result produced. However, it is extremely challenging to collect high quality data. It is no coincidence that even the few companies that reveal their algorithms rarely reveal their datasets.

 

Variables: Variables are commonly known as features in the ML context. They are the result of transforming the raw data into a representation that can be used as input to the ML algorithms. If the data is stored in tables (structured data), then the features are simply the columns of those tables. In case of unstructured data such as text or images, then a complex and time-consuming feature extraction process is required. Example features include word frequency in a text and edge contours in an image. Additionally, features can be created from other features, such as isolating the day of the week and hour from timestamps.

 

The main difficulty of this step is to select features that are useful for a computer to predict the right result. Although the human brain has specialized itself in extracting features from data, the importance a person attributes to any feature is often subjective. A computer can only look at them objectively. These two views are often not sufficiently aligned.

 

Algorithms: Most problems can be solved in different ways, meaning that using different training algorithms may affect the accuracy and performance of the obtained model. Algorithms are arguably the most visible part of ML, but not the most important one. No choice of algorithm will produce good results if the data lacks quality, a phenomenon known as ‘garbage in – garbage out’.

 

José Portêlo

Lead Machine Learning Engineer

#MACHINELEARNING #DATASCIENCE #BIGDATA