Top Banner
16
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.
Page 2: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

What is synthpop?

A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis and preparing code

Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014

Page 3: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

Sex Age EducationMarital status

Income Life satisfaction

FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED

MALE 41 SECONDARY UNMARRIED 1500 MIXED

FEMALE 18 VOCATIONAL/GRAMMAR UNMARRIED NA PLEASED

FEMALE 78 PRIMARY/NO EDUCATION WIDOWED 900 MIXED

FEMALE 54 VOCATIONAL/GRAMMAR MARRIED 1500 MOSTLY SATISFIED

MALE 20 SECONDARY UNMARRIED -8 PLEASED

FEMALE 39 SECONDARY MARRIED 2000 MOSTLY SATISFIED

MALE 39 SECONDARY MARRIED 1197 MIXED

FEMALE 38 VOCATIONAL/GRAMMAR MARRIED NA MOSTLY DISSATISFIED

FEMALE 73 VOCATIONAL/GRAMMAR WIDOWED 1700 PLEASED

FEMALE 54 SECONDARY WIDOWED 2000 MOSTLY SATISFIED

MALE 30 VOCATIONAL/GRAMMAR UNMARRIED 900 MOSTLY SATISFIED

MALE 68 SECONDARY MARRIED -8 DELIGHTED

MALE 61 PRIMARY/NO EDUCATION MARRIED -8 MIXED

Real (input)

Sex Age EducationMarital status

Income Life satisfaction

MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED

MALE 54 VOCATIONAL/GRAMMAR MARRIED 1700 PLEASED

FEMALE 32 VOCATIONAL/GRAMMAR DIVORCED 870 MIXED

FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED

FEMALE 50 PRIMARY/NO EDUCATION MARRIED NA MOSTLY SATISFIED

FEMALE 37 VOCATIONAL/GRAMMAR MARRIED 158 PLEASED

MALE 28 VOCATIONAL/GRAMMAR NA 1500 MOSTLY SATISFIED

FEMALE 62 PRIMARY/NO EDUCATION MARRIED 830 MOSTLY SATISFIED

MALE 78 PRIMARY/NO EDUCATION MARRIED NA PLEASED

FEMALE 29 SECONDARY MARRIED 580 MOSTLY SATISFIED

MALE 59 PRIMARY/NO EDUCATION MARRIED 1300 MOSTLY SATISFIED

MALE 41 SECONDARY UNMARRIED 1500 MIXED

MALE 18 SECONDARY UNMARRIED -8 PLEASED

FEMALE 73 PRIMARY/NO EDUCATION WIDOWED 1350 MOSTLY SATISFIED

Synthetic (output)

Data that look (structurally) like original data but contain artificial units only

Page 4: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

Data that behave (statistically) like original data

Page 5: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

http://cran.r-project.org/package=synthpop

Generating synthetic versions of sensitive microdata for statistical disclosure control

package

Page 6: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.
Page 7: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.
Page 8: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

Generating synthetic dataSequentially replacing original data values with synthetic values generated from conditional probability distributions

fitfit

drawdraw

Yj ~ (Y0,Y1,...,Yj−1)

syn

theti

c

real

Page 9: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

Generating synthetic data

syn

theti

c

real

syn()

Page 10: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

Overview of synthpop functions

syn

theti

c

real

read.real() write.syn()

sdc()

compare.synds() summary.synds()

compare.fit.synds()glm.synds()summary.fit.synds()

descriptive

models

syn()

Page 11: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.
Page 12: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

syn() & common data problems

Missing-data codes: contNA categorical variables: additional factor level(s) continuous variables: specified by contNA and

modelled separately Semi-continuous variables: semicont Restricted values (interrelationships between

variables): rules & rvalues Linear constraints: denom Non-negativity / non-normality: method set to

‘lognorm’, ‘sqrtnorm’ or ‘cubertnorm’ Deterministic relations: method set to “~I(…)”

Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014

Page 13: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

sdc() & statistical disclosure control

Data labelling: label Removing replicated uniques: rm.replicated.uniques

Bottom- and top-coding: recode.vars, bottom.top.coding, recode.exclude

syn(): smoothing, minbucket

Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014

sdc(syn.obj, real, label="false data", rm.replicated.uniques = TRUE, recode.vars = c("age","income"), bottom.top.coding = list(c(NA,85),c(NA,1500)))

Page 14: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

Sex Age EducationMarital status

Income Life satisfaction

FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED

MALE 41 SECONDARY UNMARRIED 1500 MIXED

FEMALE 18 VOCATIONAL/GRAMMAR UNMARRIED NA PLEASED

FEMALE 78 PRIMARY/NO EDUCATION WIDOWED 900 MIXED

FEMALE 54 VOCATIONAL/GRAMMAR MARRIED 1500 MOSTLY SATISFIED

MALE 20 SECONDARY UNMARRIED -8 PLEASED

FEMALE 39 SECONDARY MARRIED 2000 MOSTLY SATISFIED

MALE 39 SECONDARY MARRIED 1197 MIXED

FEMALE 38 VOCATIONAL/GRAMMAR MARRIED NA MOSTLY DISSATISFIED

FEMALE 73 VOCATIONAL/GRAMMAR WIDOWED 1700 PLEASED

FEMALE 54 SECONDARY WIDOWED 2000 MOSTLY SATISFIED

MALE 30 VOCATIONAL/GRAMMAR UNMARRIED 900 MOSTLY SATISFIED

MALE 68 SECONDARY MARRIED -8 DELIGHTED

MALE 61 PRIMARY/NO EDUCATION MARRIED -8 MIXED

Real (input)

Synthetic (output)

sdc()

Sex Age EducationMarital status

Income Life satisfaction

MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED

MALE 54 VOCATIONAL/GRAMMAR MARRIED 1700 PLEASED

FEMALE 32 VOCATIONAL/GRAMMAR DIVORCED 870 MIXED

FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED

FEMALE 50 PRIMARY/NO EDUCATION MARRIED NA MOSTLY SATISFIED

FEMALE 37 VOCATIONAL/GRAMMAR MARRIED 158 PLEASED

MALE 28 VOCATIONAL/GRAMMAR NA 1500 MOSTLY SATISFIED

FEMALE 62 PRIMARY/NO EDUCATION MARRIED 830 MOSTLY SATISFIED

MALE 78 PRIMARY/NO EDUCATION MARRIED NA PLEASED

FEMALE 29 SECONDARY MARRIED 580 MOSTLY SATISFIED

MALE 59 PRIMARY/NO EDUCATION MARRIED 1300 MOSTLY SATISFIED

MALE 41 SECONDARY UNMARRIED 1500 MIXED

MALE 18 SECONDARY UNMARRIED -8 PLEASED

FEMALE 73 PRIMARY/NO EDUCATION WIDOWED 1350 MOSTLY SATISFIED

Sex Age EducationMarital status

Income Life satisfaction

false data MALE 81 PRIMARY/NO EDUCATION MARRIED 1500 PLEASED

false data MALE 54 VOCATIONAL/GRAMMAR MARRIED 1500 PLEASED

false data FEMALE 32 VOCATIONAL/GRAMMAR DIVORCED 870 MIXED

false data FEMALE 85 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED

false data FEMALE 50 PRIMARY/NO EDUCATION MARRIED NA MOSTLY SATISFIED

false data FEMALE 37 VOCATIONAL/GRAMMAR MARRIED 158 PLEASED

false data MALE 28 VOCATIONAL/GRAMMAR NA 1500 MOSTLY SATISFIED

false data FEMALE 62 PRIMARY/NO EDUCATION MARRIED 830 MOSTLY SATISFIED

false data MALE 78 PRIMARY/NO EDUCATION MARRIED NA PLEASED

false data FEMALE 29 SECONDARY MARRIED 580 MOSTLY SATISFIED

false data MALE 59 PRIMARY/NO EDUCATION MARRIED 1300 MOSTLY SATISFIED

false data MALE 18 SECONDARY UNMARRIED -8 PLEASED

false data FEMALE 73 PRIMARY/NO EDUCATION WIDOWED 1350 MOSTLY SATISFIED

Page 15: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

Sex Age EducationMarital status

Income Life satisfaction

FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED

MALE 41 SECONDARY UNMARRIED 1500 MIXED

FEMALE 18 VOCATIONAL/GRAMMAR UNMARRIED NA PLEASED

FEMALE 78 PRIMARY/NO EDUCATION WIDOWED 900 MIXED

FEMALE 54 VOCATIONAL/GRAMMAR MARRIED 1500 MOSTLY SATISFIED

MALE 20 SECONDARY UNMARRIED -8 PLEASED

FEMALE 39 SECONDARY MARRIED 2000 MOSTLY SATISFIED

MALE 39 SECONDARY MARRIED 1197 MIXED

FEMALE 38 VOCATIONAL/GRAMMAR MARRIED NA MOSTLY DISSATISFIED

FEMALE 73 VOCATIONAL/GRAMMAR WIDOWED 1700 PLEASED

FEMALE 54 SECONDARY WIDOWED 2000 MOSTLY SATISFIED

MALE 30 VOCATIONAL/GRAMMAR UNMARRIED 900 MOSTLY SATISFIED

MALE 68 SECONDARY MARRIED -8 DELIGHTED

MALE 61 PRIMARY/NO EDUCATION MARRIED -8 MIXED

Real (input)

Sex Age EducationMarital status

Income Life satisfaction

MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED

MALE 54 VOCATIONAL/GRAMMAR MARRIED 1700 PLEASED

FEMALE 32 VOCATIONAL/GRAMMAR DIVORCED 870 MIXED

FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED

FEMALE 50 PRIMARY/NO EDUCATION MARRIED NA MOSTLY SATISFIED

FEMALE 37 VOCATIONAL/GRAMMAR MARRIED 158 PLEASED

MALE 28 VOCATIONAL/GRAMMAR NA 1500 MOSTLY SATISFIED

FEMALE 62 PRIMARY/NO EDUCATION MARRIED 830 MOSTLY SATISFIED

MALE 78 PRIMARY/NO EDUCATION MARRIED NA PLEASED

FEMALE 29 SECONDARY MARRIED 580 MOSTLY SATISFIED

MALE 59 PRIMARY/NO EDUCATION MARRIED 1300 MOSTLY SATISFIED

MALE 41 SECONDARY UNMARRIED 1500 MIXED

MALE 18 SECONDARY UNMARRIED -8 PLEASED

FEMALE 73 PRIMARY/NO EDUCATION WIDOWED 1350 MOSTLY SATISFIED

Synthetic (output)

Page 16: Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

Disclosure control

Providing sufficient disclosure protection

Disclosure control measures

Watermarking

Partially synthetic data

Data synthesis

Handling various data types, data

structures and real data problems

Stratified synthesis

Value bounds

Multiple event data

Household and other hierarchical data

Complex survey design

Small geographic areas

Package usability

Making synthpop flexible and

accessible to a wider range of users

A graphical user interface (GUI)

Dealing with computational limitations

Support for LSs projects

Training workshops

Quality of synthetic data

Measuring and improving

analytical validity

Tests of synthesising approaches (parametric vs CART models)

CART extensions

Case studies for ADRC-S projects

Guidelines for best practise

synthpop: future developments

http://cran.r-project.org/package=synthpop