How to deal with imbalanced data with generative models - Part 2

Share this article

How to deal with imbalanced data with generative models - Part 2

July 19, 2022 0 min read

By Luca Gilli

Share this article

In the previous chapter of this blogpost series, we discussed how generative models could be used to perform oversampling for imbalanced data, highlighting the differences with respect to more traditional methods such as SMOTE. This post will cover a practical example by applying the two different oversampling approaches to two separate datasets.

Description of the datasets and models

We use two binary classification datasets from Kaggle: the first dataset from an insurance use case and the second from a fraud detection one. Both datasets are associated with a similar degree of imbalance, between 5% and 10%. A 10% imbalance is not considered extreme; however, it can provide an excellent test bed for oversampling methods.

Both datasets are quite representative of real-life use cases as they contain both numerical and categorical columns. The main difference between the two datasets lies in their cardinality: with its 500 columns, the second dataset is much bigger than the first one.

Despite the differences in cardinality, the modelling approach was the same for both datasets. We built a pipeline obtained by first applying this preprocessor to deal with mixed numerical and categorical columns and by training an XGBoost classifier using the preprocessed data.

We kept a hold-out dataset aside for testing. For each dataset, we trained a separate pipeline using:

The original imbalanced dataset
A dataset obtained by adding additional minority examples using SMOTE
A dataset obtained by adding additional minority examples synthesised using a Variational AutoEncoder (VAE).

Both oversampling strategies corresponded to adding 5% extra minority samples to the original dataset.

Results

The following table shows the results obtained for the first dataset. In this case, the operation of adding a few percentages of minority examples corresponded to an increase in recall on the hold-out set, both when using SMOTE and a VAE to generate the synthetic points. The gain obtained using a VAE was slightly higher than that obtained with SMOTE. However, it must be pointed out that applying SMOTE is much easier as it does not require training a generative model.

Model	Accuracy	Precision	Recall	ROC AUC score
Original	84.4%	53.0%	39.0%	0.892
SMOTE	84.4%	52.3%	43.1%	0.891
VAE	84.8%	53.9%	44.7%	0.894

Hold-out test metrics obtained using different oversampling techniques for the first dataset.

The results for the second dataset presented a different situation, as explained in the following table.

Model	Accuracy	Precision	Recall	ROC AUC score
Original	97.3%	92.0%	72.5%	0.972
SMOTE	96.0%	73.8%	77.8%	0.960
VAE	97.3%	91.0%	75.1%	0.972

In this case, the synthetic points generated using SMOTE caused a deterioration of the test set performance, primarily due to a drop in precision. The points generated with the VAE, on the other hand, helped marginally increase the model recall. This phenomenon can be attributed to the fact that SMOTE generates the synthetic points using the heuristic approach described in our previous blog post: such an approach will eventually become more unreliable as the dataset cardinality increases.

Discussion of the results

The examples presented in the previous section showed how oversampling minority labels could help boost performances for imbalanced datasets. We saw how a technique like SMOTE could be an excellent solution when the dataset cardinality is not too high. However, it can lead to worse metrics for more complex problems as the synthetic points generated are not realistic enough. On the other hand, using a generative model is a more robust solution as the quality of the synthetic examples tends to be higher. Generative models might introduce a computational overhead which might be unnecessary for smaller datasets. One last remark is that the examples presented in this post are not representative of situations of extreme imbalance: stay tuned for more tests with more pronounced imbalance problems!

Tags:

blogpost

Luca Gilli, PhD, is CTO and co-founder of Clearbox AI, where he leads R&D and product development. Expert in generative AI, uncertainty quantification, and ML model validation, he is the inventor of Clearbox AI’s core synthetic data technology.