Resources
That's Fresh! Newsletter
Read a selection of our past issues.
- ๐ NumPy 2.0 is almost out!And: Our new data preprocessor with Polars | Interview with S2E at Italy Insurance ForumJune 5, 2024
- ๐ฎ What a month for new LLMs!And: Datacamp webinar with ShaliniMay 22, 2024
- โจ GenAI true value lies beyond operational enhancementsAnd: The Future of Data Protection | New updates about AI ActApril 24, 2024
- ๐ What are 1-bit Large Language Models?And: Linkedin Live about AI Act | Mastercard's Country Manager interviewed our CEOMarch 6, 2024
- LLaMAntino - Effective Text Generation in ItalianAnd: Creating train and test datasets | Use case: Detecting money muling with the help of synthetic dataFebruary 21, 2024
- ๐๏ธ The NY Times sues OpenAI and MicrosoftAnd: Can AI work with little data? | La Stampa: AI means developmentJanuary 10, 2024
- Synthetic Data 101 ๐จAnd: Why synthetic data? | New project with Poste ItalianeNovember 8, 2023
- How easy is it for LLM to infer sensitive information?And: Why is data sharing important? | Our new partnership with S2EOctober 25, 2023
- Have you heard of Pythia?And: Data augmentation tutorial | Did you say AI apocalypse?August 30, 2023
- Google's answer to ChatGPTAnd: Generating synthetic data within relational databases. Let's meet at WAICF!February 8, 2023
- Understanding ChatGPT betterAnd: How to deal with imbalanced data. More about our productDecember 14, 2022
- A curated list of failed ML projectsAnd: How to build a data strategy. Clearbox AI and Bearing Point partnership.November 16, 2022
- Our open source library is now on GitHubAnd: Clearbox AI on Cybernews.June 22, 2022
- Discovering DagsterAnd: Quantifying privacy risks. Use case: a synthetic data sandbox to freely share data.June 8, 2022
- Can interaction data be fully anonymized?And: Synthetic Data for privacy preservation: understanding privacy risks. Discover our Enterprise solution.April 6, 2022
- What are GFlow nets?And: Improve models with Synthetic Data. Use case: augment financial time series.March 16, 2022
- The European Commission selected us for Women TechEU pilot project!And: What is Synthetic Data. The new Synthetic Data platform.March 09, 2022
- The EDPS on Synthetic DataAnd: From raw to good quality data. Changelogs: now you can upload unlabeled datasets.February 23, 2022
- 2022 Gartnerโs Technology TrendsAnd: How to harness the power of AI in companies. Changelogs: new metrics available for your synthetic dataset.February 09, 2022
FROM THE AI WORLD
Synthetic Data is becoming increasingly mainstream. Gartner predicts that 60% of AI models will use Synthetic Data in some form or another by 2025, and in its recent market trends report also reiterated the importance of generative AI that fuels synthetic data generation in its recent hype cycle.
While there are many advantages of Synthetic Data that are driving this trend, not to mention data augmentation, cost-effective and safe data procurement, the main focus these days seems to be on its role in privacy preservation and enhancement. Last week the office of the European Data Protection Supervisor weighed in on the topic in the opinion piece you can find in the box below.
The advantages of Synthetic Data as stated in the article are mainly directed at using it for privacy preserving AI model development as well as a privacy enhancing method for sharing personal data. The article also issues a note of caution on the challenges of risk of re-identification of personal data, anonymisation and other unforeseen risks.
The article reflects a general opinion of advantages and challenges of synthetic datasets, but it misses to discuss how the challenges can be mitigated using a risk based approach and also downplays its advantages.
Quantifying the risks can help the users of Synthetic Data make smart choices while dealing with utility vs privacy conundrum. One such approach is differential privacy. According to NIST: โA differentially private synthetic dataset looks like the original dataset - it has the same schema and attempts to maintain properties of the original dataset (e.g., correlations between attributes) - but it provides a provable privacy guarantee for individuals in the original dataset.โ
Since Synthetic Data will continue to play a major role in AI and analytics we think that a risk quantification approach to privacy will help companies to capitalise on good quality synthetic data to accelerate responsible innovation.
The EDPS on Synthetic Data
What is Synthetic Data? Which are its benefits and risks? The European Data Protection Supervisor recently published a piece that addresses the topics.
CLEARBOX AI
New: upload unlabeled datasets
Among tabular datasets, now you can analyse and generate Synthetic Data for unlabeled datasets, i.e. datasets that do not (yet) have a target column.
BLOGPOST
DFrom raw to good quality data
Find out the most frequent problems with raw data and the techniques to mitigate them in the second chapter of our Data Preparation guide.