top of page
logo.png

Synthetic Data Generation Methods

Our approach to generating synthetic data from sensitive data in Trusted Research Environments (TREs)

There are many different techniques for generating synthetic data, typically either using statistical approaches or machine learning models to create new synthetic samples based on real data. Machine learning models are useful for modelling complex relationships and characteristics of real datasets, however, this can come with greater disclosure risks. For the majority of purposes of synthetic data for TREs, we don't need to capture this level of complexity, and can instead utilise statistical techniques.

comparison.png

The below details the statistical techniques used to generate synthetic data from population level summary statistics, to create synthetic data at three different fidelity levels:

Level 1 Structural - only uses datatypes and value ranges to generate synthetic data which looks structurally similar to the real data, but contains no statistical properties found in the real data.

Level 2 Statistical - uses the addition of distribution types and summary statistics to sample synthetic data from marginal distributions, therefore providing some level of statistical similarity, but does not capture relationships.

Level 3 Correlated - uses the addition of a correlation matrix to capture relationships between variables by sampling from multivariate distributions.

diagram.png

The process is split up into two parts - the processing, and the generation. This is so TREs can export the processed results through typical disclosure control methods before generating synthetic data. Making it easier to assess. (Add about threshold)

Explain whats needed / processed for each level

the best fit ditribution is identified, and the paramters calculated. below details the types of distributions and parameters fitted

normal_animation_varying_loc.gif
normal_animation_varying_scale (1).gif

Normal Distribution
 

Parameters:
 

loc : The mean, or centre, of the distribution
 

scale  : The standard deviation, which measures the spread of the distribution

bottom of page