Synthetic data for data augmentation

Improving fraud detection models with synthetic data

STARTING POINT

37%

of mortgage requests were manually reviewed

‎

110k€

potential median loss per fraud (ACFE report)

RESULT

+15%

Recall

Fraud detection is crucial in finance, and machine learning helps automate this process by flagging suspicious transactions for expert review. However, class imbalance—where fraudulent cases are far fewer than legitimate ones—makes it difficult to train accurate models, often increasing manual workload.

For this use case, Precision and Recall are key metrics, with Recall ensuring that no fraud goes undetected. BearingPoint, a leading consultancy in ML-based fraud detection, faced these challenges daily. Through field testing, we demonstrated how synthetic data can improve model performance while reducing manual verification efforts.

By integrating Clearbox AI’s Libraries, BearingPoint effectively addressed class imbalance, enhancing fraud detection accuracy for one of their clients and making AI-driven automation more efficient.

Challenge

How to improve class imbalance affecting fraud detection datasets when dealing with complex data pipelines?

Solution

Our product has been used to generate synthetic data points of fraudulent examples.

Result

BearingPoint was able to train high-performing models on the augmented data, translating to higher recall and lower fraud detection workloads.

The challenge

Several techniques can be adopted to improve class imbalance. Oversampling, for example, consists in creating synthetic minority examples to re-balance the original dataset. SMOTE is one of the most popular techniques which has been proven to be useful in many applications. The problem arises when the cardinality of the dataset increases. This is often the case for fraud detection use cases where we want to make use of as much information as possible. In this case the synthetic examples generated by SMOTE start becoming more and more unrealistic. It is therefore necessary to use alternative methods, for example based on generative models.

The solution

BearingPoint installed our Enterprise Solution on the infrastructure of one of their clients, a retail bank. They connected it to a relational database containing transaction histories and used our tool to quantify class imbalance and find the best data augmentation strategy. They finally generated an enriched dataset containing the original clients plus several synthetic fraudulent examples. They used this dataset to train a machine learning model based on boosted trees.

The result

Accessing the augmented dataset allowed BearingPoint to considerably reduce the number of false flags. As we demonstrated, the model trained on augmented data presented a Recall improvement of 15% (+12% in respective of the best combination of under/oversampling). This automatically translates into more efficient and cost-effective fraud detection workflows and workloads. These results can be extended to other use cases as well, both in the financial sector and others. The technology can be applied to any sector that needs a lot of data to improve its processes. For example, insurance, energy, telco, urban mobility, retail, and healthcare.

We'd love to hear from you

Want to learn more about our services or do you have any questions? Drop us a message - we're happy to help!