Generating Synthetic Data | Synthetic Data UK

top of page

Synthetic Data Generation Methods

Our approach to generating synthetic data from sensitive data in Trusted Research Environments (TREs)

There are many different techniques for generating synthetic data, typically either using statistical approaches or machine learning models to create new synthetic samples based on real data. Machine learning models are useful for modelling complex relationships and characteristics of real datasets, however, this can come with greater disclosure risks. For the majority of purposes of synthetic data for TREs, we don't need to capture this level of complexity, and can instead utilise statistical techniques.

The below details the statistical techniques used to generate synthetic data from population level summary statistics, to create synthetic data at three different fidelity levels:

Level 1 Structural - only uses datatypes and value ranges to generate synthetic data which looks structurally similar to the real data, but contains no statistical properties found in the real data.

Level 2 Statistical - uses the addition of distribution types and summary statistics to sample synthetic data from marginal distributions, therefore providing some level of statistical similarity, but does not capture relationships.

Level 3 Correlated - uses the addition of a correlation matrix to capture relationships between variables by sampling from multivariate distributions.

The process is split up into two parts - the processing, and the generation. This is so TREs can export the processed results through typical disclosure control methods before generating synthetic data. Making it easier to assess. (Add about threshold)

Explain whats needed / processed for each level

the best fit ditribution is identified, and the paramters calculated. below details the types of distributions and parameters fitted

normal_animation_varying_scale (1).gif

Normal Distribution

Parameters:

loc : The mean, or centre, of the distribution

scale : The standard deviation, which measures the spread of the distribution

bottom of page