Data & AI Consultancy

Augmenting topic detection models with synthetic data

Topic detection models are essential for identifying key themes in customer feedback and can help large organizations quickly respond to user needs. When dealing with insurance-related customer reviews, companies often face imbalanced datasets, with some topics (e.g., policy cancellations) being far less common than others (e.g., general inquiries). This imbalance makes it difficult to train robust language models, as minority classes are underrepresented.

This use case highlights our work with a large insurance company, showing how synthetic data can be used to oversample minority classes and improve model training while providing more accurate insights into customer sentiment.

Challenge
Customer reviews for insurance policies are often highly imbalanced and can be challenging to classify.
Solution
We used the Clearbox Synthetic Enterprise Solution to generate additional data for oversampling minority classes and improving overall model performance.
Result
Models trained on augmented datasets showed improved classification metrics and enabled more accurate insights into customer feedback.

The challenge

In the insurance industry, extracting actionable insights from customer reviews is essential for policy improvements, retention, and satisfaction. However, analyzing these reviews posed challenges: they were often short and colloquial, making NLP modeling complex; certain complaint types were rare, leading to imbalanced data; and they sometimes contained sensitive information, requiring robust privacy measures. As a result, topic detection models struggled to identify minority topics, causing niche customer issues to go unnoticed. This led to suboptimal policy decisions and customer dissatisfaction.

The solution

We partnered with the insurance company to enhance their text analysis using the Clearbox’s Libraries.

We began with data profiling, analyzing review distribution, detecting imbalanced classes, and assessing data quality. To address the data gaps, we generated synthetic text samples that mimicked underrepresented topics, preserving privacy while enriching the dataset.

To improve model performance, we oversampled minority classes, blending real and synthetic data to create a more balanced training set. We then trained and fine-tuned a state-of-the-art NLP model, comparing results before and after augmentation.

The result

Integrating synthetic data into training significantly improved the model’s performance. Classification metrics saw a boost, with higher recall and precision for minority topics, ensuring rare customer concerns were accurately detected. The model also became more generalizable, reducing overfitting and enhancing customer insights.

By using privacy-safe synthetic data, the company remained compliant while improving accuracy. This approach enabled them to better understand customer feedback, proactively address concerns, and drive innovation in their insurance offerings.

We'd love to hear from you

Want to learn more about our services or do you have any questions? Drop us a message - we're happy to help!