In the previous chapter of this blogpost series, we discussed how generative models could be used to perform oversampling for imbalanced data, highlighting the differences with respect to more traditional methods such as SMOTE. This post will cover a practical example by applying the two different oversampling approaches to two separate datasets.
Description of the datasets and models
We use two binary classification datasets from Kaggle: the first dataset from an insurance use case and the second from a fraud detection one. Both datasets are associated with a similar degree of imbalance, between 5% and 10%. A 10% imbalance is not considered extreme; however, it can provide an excellent test bed for oversampling methods.
Both datasets are quite representative of real-life use cases as they contain both numerical and categorical columns. The main difference between the two datasets lies in their cardinality: with its 500 columns, the second dataset is much bigger than the first one.
Despite the differences in cardinality, the modelling approach was the same for both datasets. We built a pipeline obtained by first applying this preprocessor to deal with mixed numerical and categorical columns and by training an XGBoost classifier using the preprocessed data.
We kept a hold-out dataset aside for testing. For each dataset, we trained a separate pipeline using:
- The original imbalanced dataset
- A dataset obtained by adding additional minority examples using SMOTE
- A dataset obtained by adding additional minority examples synthesised using a Variational AutoEncoder (VAE).
Both oversampling strategies corresponded to adding 5% extra minority samples to the original dataset.
Results
The following table shows the results obtained for the first dataset. In this case, the operation of adding a few percentages of minority examples corresponded to an increase in recall on the hold-out set, both when using SMOTE and a VAE to generate the synthetic points. The gain obtained using a VAE was slightly higher than that obtained with SMOTE. However, it must be pointed out that applying SMOTE is much easier as it does not require training a generative model.
Model | Accuracy | Precision | Recall | ROC AUC score |
---|---|---|---|---|
Original | 84.4% | 53.0% | 39.0% | 0.892 |
SMOTE | 84.4% | 52.3% | 43.1% | 0.891 |
VAE | 84.8% | 53.9% | 44.7% | 0.894 |
Hold-out test metrics obtained using different oversampling techniques for the first dataset.
The results for the second dataset presented a different situation, as explained in the following table.
Model | Accuracy | Precision | Recall | ROC AUC score |
---|---|---|---|---|
Original | 97.3% | 92.0% | 72.5% | 0.972 |
SMOTE | 96.0% | 73.8% | 77.8% | 0.960 |
VAE | 97.3% | 91.0% | 75.1% | 0.972 |
In this case, the synthetic points generated using SMOTE caused a deterioration of the test set performance, primarily due to a drop in precision. The points generated with the VAE, on the other hand, helped marginally increase the model recall. This phenomenon can be attributed to the fact that SMOTE generates the synthetic points using the heuristic approach described in our previous blog post: such an approach will eventually become more unreliable as the dataset cardinality increases.
Discussion of the results
The examples presented in the previous section showed how oversampling minority labels could help boost performances for imbalanced datasets. We saw how a technique like SMOTE could be an excellent solution when the dataset cardinality is not too high. However, it can lead to worse metrics for more complex problems as the synthetic points generated are not realistic enough. On the other hand, using a generative model is a more robust solution as the quality of the synthetic examples tends to be higher. Generative models might introduce a computational overhead which might be unnecessary for smaller datasets. One last remark is that the examples presented in this post are not representative of situations of extreme imbalance: stay tuned for more tests with more pronounced imbalance problems!