Data & AI Consultancy
Augmenting topic detection models with synthetic data
Topic detection models are essential for identifying key themes in customer feedback and can help large organizations quickly respond to user needs. When dealing with insurance-related customer reviews, companies often face imbalanced datasets, with some topics (e.g., policy cancellations) being far less common than others (e.g., general inquiries). This imbalance makes it difficult to train robust language models, as minority classes are underrepresented.
This use case highlights our work with a large insurance company, showing how synthetic data can be used to oversample minority classes and improve model training while providing more accurate insights into customer sentiment.
The challenge
In the insurance industry, extracting actionable insights from customer reviews is essential for policy improvements, retention, and satisfaction. However, analyzing these reviews posed challenges: they were often short and colloquial, making NLP modeling complex; certain complaint types were rare, leading to imbalanced data; and they sometimes contained sensitive information, requiring robust privacy measures. As a result, topic detection models struggled to identify minority topics, causing niche customer issues to go unnoticed. This led to suboptimal policy decisions and customer dissatisfaction.
The solution
We partnered with the insurance company to enhance their text analysis using the Clearbox’s Libraries.
We began with data profiling, analyzing review distribution, detecting imbalanced classes, and assessing data quality. To address the data gaps, we generated synthetic text samples that mimicked underrepresented topics, preserving privacy while enriching the dataset.
To improve model performance, we oversampled minority classes, blending real and synthetic data to create a more balanced training set. We then trained and fine-tuned a state-of-the-art NLP model, comparing results before and after augmentation.
The result
Integrating synthetic data into training significantly improved the model’s performance. Classification metrics saw a boost, with higher recall and precision for minority topics, ensuring rare customer concerns were accurately detected. The model also became more generalizable, reducing overfitting and enhancing customer insights.
By using privacy-safe synthetic data, the company remained compliant while improving accuracy. This approach enabled them to better understand customer feedback, proactively address concerns, and drive innovation in their insurance offerings.
We'd love to hear from you
Want to learn more about our services or do you have any questions? Drop us a message - we're happy to help!