Synthetic data generation for anonymization purposes. Application on the Norwegian Survey on living conditions/EHIS JOHAN HELDAL AND DIANA-CRISTINA IANCU STATISTICS NORWAY, DEPARTMENT OF METHODOLOGY AND DATA COLLECTION JOINT UNECE/EUROSTAT WORK SESSION ON STATISTICAL DATA CONFIDENTIALITY 29-31 OCTOBER 2019, THE HAGUE
24
Embed
Synthetic data generation for anonymization purposes. Application … · 2019. 11. 28. · Bhattacharyya coefficient Hellinger's distance Dissimilarity index Overlap Bhattacharyya
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Synthetic data generation for anonymization purposes. Application on the Norwegian Survey on living conditions/EHIS
JOHAN HELDAL AND DIANA-CRISTINA IANCU
STATISTICS NORWAY, DEPARTMENT OF METHODOLOGY AND DATA COLLECTION
JOINT UNECE/EUROSTAT WORK SESSION ON STATISTICAL DATA CONFIDENTIALITY
29-31 OCTOBER 2019, THE HAGUE
Intro
• In some cases de-identification does not offer sufficient protection
• How to merge synthetic data based on surveys with other data sources, such as registers?
Intro
• In some cases de-identification does not offer sufficient protection
• How to merge synthetic data based on surveys with other data sources, such as registers?
• An ideal solution:
◦ preserves data confidentiality
◦ allows for quality controls
◦ provides the same opportunities for data analysis as the non-anonymized data would
◦ enables international reporting to institutions
◦ provides the possibility to adjust data after a longer period
◦ can be reused with minimal adaptations
Intro
• In some cases de-identification does not offer sufficient protection
• How to merge synthetic data based on surveys with other data sources, such as registers?
• An ideal solution:
◦ preserves data confidentiality
◦ allows for quality controls
◦ provides the same opportunities for data analysis as the non-anonymized data would
◦ enables international reporting to institutions
◦ provides the possibility to adjust data after a longer period
◦ can be reused with minimal adaptations
• Potential solution: create “satellite” datasets with register data that could be merged with the survey data as
needed, based on a key
Synthetic data generation
• Our solution: eliminate the key and match the “satellite” files through statistical matching
• We propose a model-free method that relies on statistical matching to replace the register information
corresponding to individuals from the original sample with register information corresponding to other similar
individuals
• Draw a “register sample” → Add register variables → Match the sample to the original survey and replace the
register data
Data
• Norwegian Survey on living conditions/EHIS (European Health Interview Survey) from 2015
• Comprehensive survey covering several topics
• Multiple uses for both aggregated results and microdata
• Conducted on a representative sample of individuals aged 16 and above
• The sample is divided into 19 strata, corresponding to the 19 counties in Norway
• Survey sample: 14,000 potential respondents for the entire country (700 individuals per county , except for
Oslo, with 1400 individuals)
• Goss sample: 13,748 individuals
• Net sample: 8164 individuals
Data
• Data collected before the interview:
◦ residence municipality of the respondent
◦ componence of the household
◦ name and address of the employer of each household member
◦ respondent’s occupation
• Data added after the interview is conducted:
◦ education
◦ income
◦ whether the respondent lives in a densely or sparsely populated area
◦ more detailed demographic information for the household and each family member, such as country of birth
and immigrant background
Method
• Step 1. Drawing a “register sample”
◦ We draw a sample of 42,000 individuals from the population register
◦ Survey respondents are not excluded prior to drawing the sample
• Step 2. Adding register variables
◦ Variables that would normally be linked to the Norwegian Survey on living conditions/EHIS
Method
• Step 3. Performing the statistical matching
◦ Match the enriched “register sample” to the original survey sample
◦ We follow the procedure outlined in D’Orazio (2016) for the statistical matching, consisting in 5 steps:
1) choosing the target variables
2) identifying the common variables
3) choosing the matching variables
4) applying a statistical matching method
5) evaluating the results
Method
• Step 3.1. Choosing the target variables
◦ Variables chosen for each of the two data sources
◦ The “register sample” can be adjusted by adding or removing register variables according to the needs of the
synthetic dataset users
Method
• Step 3.1. Choosing the target variables
◦ Variables chosen for each of the two data sources
◦ The “register sample” can be adjusted by adding or removing register variables according to the needs of the
synthetic dataset users
• Step 3.2. Choosing the common variables
◦ Assess definitions, accuracy, frequency distributions of the common variables
Method
• Step 3.1. Choosing the target variables
◦ Variables chosen for each of the two data sources
◦ The “register sample” can be adjusted by adding or removing register variables according to the needs of the
synthetic dataset users
• Step 3.2. Choosing the common variables
◦ Assess definitions, accuracy, frequency distributions of the common variables
• Step 3.3. Choosing the matching variables
◦ Choose only relevant variables
◦ Apply the principle of parsimony
◦ In order to preserve the structure of the survey sample, we use county, gender, age and household size as
matching variables
Method
Distribution of gender in the EHIS survey sample compared with the “register sample”
Method
Distribution of age in the EHIS survey sample compared
with the “register sample”
Method
Distribution of household size in the EHIS survey sample compared with the “register sample”
Method
• Step 3.4. Applying a statistical matching method
Random hot deck statistical matching Nearest neighbor distance hot deck statistical matching
Description For every individual in the original sample survey a donor is randomly selected from the donor dataset (“register sample”)
The closest donor to each record in the original survey is selected from the donor dataset, according to a distance computed on a subset of common variables
Version used Only one donor is chosen randomly (“RND1”)
Only one donor is chosen randomly among the 20 closest neighbors in terms of age and household size (“RND2”)
Distance hot deck statistical matching without constraints (“NN”)
Constrained distance hot deck matching (“NNC”)
Allows for the selection of a record as donor multiple times
Yes Yes Yes No
Donation classes County, Gender County, Gender County, Gender County, Gender
• Step 3.4. Applying a statistical matching method
Random hot deck statistical matching Nearest neighbor distance hot deck statistical matching
Description For every individual in the original sample survey a donor is randomly selected from the donor dataset (“register sample”)
The closest donor to each record in the original survey is selected from the donor dataset, according to a distance computed on a subset of common variables
Version used Only one donor is chosen randomly (“RND1”)
Only one donor is chosen randomly among the 20 closest neighbors in terms of age and household size (“RND2”)
Distance hot deck statistical matching without constraints (“NN”)
Constrained distance hot deck matching (“NNC”)
Allows for the selection of a record as donor multiple times
Yes Yes Yes No
Donation classes County, Gender County, Gender County, Gender County, Gender
• Step 3.4. Applying a statistical matching method
Random hot deck statistical matching Nearest neighbor distance hot deck statistical matching
Description For every individual in the original sample survey a donor is randomly selected from the donor dataset (“register sample”)
The closest donor to each record in the original survey is selected from the donor dataset, according to a distance computed on a subset of common variables
Version used Only one donor is chosen randomly (“RND1”)
Only one donor is chosen randomly among the 20 closest neighbors in terms of age and household size (“RND2”)
Distance hot deck statistical matching without constraints (“NN”)
Constrained distance hot deck matching (“NNC”)
Allows for the selection of a record as donor multiple times
Yes Yes Yes No
Donation classes County, Gender County, Gender County, Gender County, Gender