Product

Our Enterprise Solution

Discover how our synthetic data solution works and how it can match your business needs.

With Clearbox AI Enterprise Solution, you will be able to enjoy the benefits of high-quality structured synthetic data generated by our proprietary Data Engine.

Our Enterprise Solution is a fully dockerized solution. You can install it on-prem or on the cloud. We designed it to be a turnkey solution for your company's needs. You can generate synthetic data from a structured data source, such as data coming from a relational database or a data warehouse.

Architecture

Data connectors are responsible for ingesting and injecting data within an existing infrastructure.

The data profiling and preparation step creates the dataset documentation and data tests. This information increases the quality of the synthetic dataset.

The Data Engine is the module that generates synthetic data. We use an automated machine learning approach to find the best architecture for each generation task.

Synthetic Data is fictitious data artificially generated that incorporates the statistical properties and distributions of the original data, thus resulting realistic.

The reconstructor module takes care of storing the synthetic data in the exact schema as the original data source.

The quality reports contain the results of the evaluation tests performed on the synthetic data. The utility report describes how close the synthetic dataset is with respect to the original one in terms of statistical distributions. The privacy report contains a set of re-identification risk measures to ensure that the synthetic dataset is anonymous.

Clearbox SDK

Effortlessly integrate synthetic data into your processes with our powerful Python SDK.

Clearbox AI Enterprise Solution Platform

Web-based frontend

Easily interact with the Enterprise Solution and its output through a built-in frontend.

SDK Quickstart

Clearbox's Enterprise Solution is a synthetic data generation tool that allows you to generate new data batches according to different utility/privacy requirements.

The DatasetSchema collects the most important information about the dataset required to generate new data points accurately. It is used to determine, for example, column types and distributions.

          from clearbox_enterprise.dataset import DatasetSchema

dataset = DatasetSchema('bank_marketing.csv', target_column='Subscribed', regression=False)

Once a DatasetSchema is initialised, it is possible to generate new points depicting the same dataset with different generative architectures and parameters.

Data Cloning

          dataset.add_generator(privacy_level=50, generator_type='VAE')

By cloning a dataset, one generates a proxy dataset with the same statistical properties as the original one. Such a synthetic dataset can be an effective privacy-preserving technique to anonymize behavioural data, which would otherwise require the application of more traditional (and information-degrading) techniques such as binning and aggregation.

You are free to experiment with different generative architectures and parameters. They can either delegate the search for an optimal architecture using the autoML feature or play with various parameters.

Each generation produces a synthetic dataset, meta-data about it and a comprehensive quality report accessible via our browser-based frontend. The report shows how close the synthetic dataset is to the original one (utility metrics) and the risk of individual re-identification associated with the synthetic data (privacy metrics).

Data Augmentation

An imbalanced dataset is an annotated dataset where the distribution of the target labels is very uneven, i.e. a label might appear much less frequently than the rest.

You can use Clearbox AI Enterprise Solution to synthetically augment the size of a structured dataset. In particular, it is possible to rebalance a dataset by increasing the instances of a minority class.

          from clearbox_enterprise.data_augmentation import augment_data

augmented_dataset = augment_data(dataset,'generator_name',n_iter=20)

Time Series

The addition of the time makes the data more interesting, but also more complex. Clearbox AI Enterprise Solution provides the methods needed for quality synthetic generation even with time series.

          from clearbox_enterprise.time_series import TimeSeriesSchema

ts_data = TimeSeriesSchema('nymex.csv', project_name='NYMEX')
ts_data.add_generator(automl=True, time_series_memory=2)
synth_ts = ts_data.generate_time_series(n_time_steps=1000, fidelity=100)

Data Connectors

Data Connectors are an essential part of our Enterprise Solution, enabling to easily connect to and generate synthetic data from relational databases and data warehouses with ease.

We provide out-of-the-box functions and support for most used relational databases (mySQL, PostgreSQL, Oracle, Microsoft SQL Server,...) and data warehouses (BigQuery, RedShift, Snowflake, Databricks,...). In addition, our Data Connectors are highly customizable and modularized. This means you can easily tailor the integrations to your specific needs and requirements.

          from clearbox_enterprise.data_connector import BigQueryConnector

connector = BigQueryConnector("credentials.json", "project-id")
dataset = connector.table_to_dataframe(dataset_id='data_id', table_id='table-id')

Get started with synthetic data generation

Let’s have a chat to find out how our Enterprise Solution can match your business needs. We look forward to hearing from you!

For a quick proof-of-value and to better understand the contents of the evaluation reports, sign up for free to our online demo.