- Posted by João Rodrigues
- On August 28, 2018
- 0 Comments
This summer, from the 10th to the 14th of September, a group of students will come together a Redglue’s offices to participate in the first edition of Red Summer Machine Learning Lab. In this summer lab, the participants will work with Data Science tools in order to solve a Credit Card Fraud Classification problem. To do so, they will use a dataset of credit card transactions. To develop the solution, the participants will be challenged to use the Azure Machine Learning Studio, a cloud GUI-based integrated development environment for developing and deploying Machine Learning Solutions.
It is important that credit card companies are able to recognize and detect fraudulent credit card transactions, in order to protect its costumers.
The dataset that the participants will work with contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
In this lab, the participants will be challenged to work with the Azure Machine Learning Studio, a collaborative drag-and-drop tool to build, test and deploy predictive analytic solutions. Azure ML Studio also publishes the developed models ans web services that can be easily consumed by custom apps or other tools.
Azure Machine Learning Studio can easily be integrated in data pipelines, making it a good option to integrate with a client’s system in order to offer Machine Learning solutions.
The tool has powerful Data Science capabilities, with drag-and-drop modules for data cleaning/preparation, several ML algorithms, automatic model evaluation and more.
The participants will have to develop a Machine Learning solution that is able to provide analytics about the dataset (Exploratory Data Analysis), apply a Machine Learning algorithm capable of successfully predicting and recognizing fraudulent credit card transactions and evaluate the solution. The solution will then be provided as a web service capable of analyzing and predicting unseen cases.
On the first day of our Red Summer ML Lab, the participants learned how to use the Azure Machine Learning Studio, by creating a simple credit fraud Classification experiment. Their experience was seamless and they had no issues importing a dataset, choosing different classifiers, training and evaluating them. They found the ability to have such a visual Machine Learning pipeline with the ability to train and visualize the results very interesting.
Then, we switched to Jupyter Notebooks, in order to work in a different environment. Using python, they loaded the credit fraud dataset and started to work on it.
The first task was to develop a classifier evaluation. They learned about different evaluation metrics and chose the ones they wanted to represent. With this they were ready to implement different classifiers, work on the dataset and see the results.
On the second day, the participants learned how to analyse and take conclusions from the dataset. They arrived at different conclusions but the main points were that the dataset was highly imbalanced (only 0.17% of Fraud cases), that the features had different scales and that there were some outliers in the data.
To fix this issues, they started by implementing different classifiers and then started to work on the scaling of the features, evaluating the impact of these changes on the classifiers.
On the next days, they will work on the imbalance of the dataset and will choose the best approaches, porting them to the Azure ML Studio environment.
On the third day, the participants started working on one of the main issues of our dataset – its class imbalance. To solve this issue, they researched several over and under sampling methods, for example: SMOTE, Tomek Links and Random Sampling, applying various methods and combinations thereof, testing them and evaluating the results.
After implementing the techniques, they immediately noticed that the results of their classifiers got much better, since the imbalance was such a core issue of the datset. Following an iterative approach, they picked the best techniques and applied them in Azure Machine Learning Studio.
On the fourth day, the goal was to apply different ways of evaluating and optimizing the classifiers and also, to implement dimensionality reduction techniques in order to visualize the dataset. To achieve the defined goals, the participants learned about and developed cross-validation techniques and hyper-parameter evaluation, evaluating the results of their classifiers, choosing the solutions that performed the best on the cross validation and with the optimized parameters. Hyper parameter optimization proved to be a computationally expensive process and therefore some participants chose to apply it on Azure ML Studio with a hyper parameter sweep, with the option of optimizing certain evaluation metrics (Recall or Precision, for example).
There was also time to apply dimensionality reduction techniques, for which they learned about Principal Component Analysis and T-SNE. However, this was mainly used to reduce the dataset to 2 dimensions in order to visualize it and see if the Fraud and Non Fraud Cases were organized in clusters.
On the final day of the Red Summer ML Lab, the final reports were shown and we took part in some very compelling presentations, where the different teams explained their approaches and showed their best results. All the solutions were different and it was very interesting to see how the the teams leveraged what they learned and the tools they had to develop their solutions.
Some participants focused more on using the Azure Machine Learning Studio capabilities, taking advantage of its fast and intuitive prototyping and ease of development of Machine Learning models, while others chose to work on their python skills to develop their solutions.
Overall, the presentations were very good and insightful, with the teams explaining the decisions they took with visually appealing results. You can find all the presentations on Youtube, in the links below.
The feedback we got from the participants was also very positive, we hope that they enjoyed the experience and learned some valuable Machine Learning and Data Science skills. In the end, we gave them all some Amazon Gift Cards!
Youtube videos of the presentations:
Thanks for reading and following!