Top Banner
Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January 30, 2006 UCLA Institute for Digital Research and Education Presentation
54

Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Jan 15, 2016

Download

Documents

Kailey Broyhill
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods

John M. AbowdCornell University and Census Bureau

January 30, 2006UCLA Institute for Digital Research and EducationPresentation

Page 2: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Acknowledgements

• Many current and past LEHD staff and senior research fellows contributed to the development of the LEHD infrastructure system and the Quarterly Workforce Indicators. Kevin McKinney, Bryce Stephens and Lars Vilhuber were particularly responsible for the confidentiality protection system.

• Fredrik Andersson and Marc Roemer at LEHD did the data analysis and implementation of the On the Map package. John Carpenter of Excensus, Inc. developed the mapping application.

• Gary Benedetto, Lisa Dragoset, Martha Stinson and Bryan Ricchetti did the synthesis programming for the SIPP-PUF application.

Page 3: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Overview

• What is the problem?• What are synthetic data?• How can the research community benefit from

synthetic data?• The NSF-ITR synthetic data grant• The Census Bureau’s synthetic data and related

products:– QWI Online– On the Map– The new SIPP-SSA-IRS Public Use File

• Tools

Page 4: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Information Release and Data Protection are Competing Objectives

• Statisticians call this the Risk-Utility tradeoff

• Economists prefer to distinguish between technological trade-offs and preference trade-offs

• Information release and data protection are technological tradeoffs

Page 5: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

A Simple Example of the Technological Trade-off

• There are two outputs: information released and data protection

• Consider a census with sampling as the release technology

• The PPF measures the amount of information that must be sacrificed to get additional protection

• The information measure is Shannon’s H (or the Kullback-Liebler difference between the census and the sample)

• The protection measure is the maximum probability of an exact disclosure

Page 6: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Information Gain-Protection PPF

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

Protection

Info

rmat

ion

Page 7: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Marginal Cost of Protection

0.00

0.50

1.00

1.50

2.00

2.50

3.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

Protection

Pri

ce

Page 8: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

What Are Synthetic Data?

• Public use micro data products that reproduce essential features of confidential micro data products

• Essential features include:– Univariate distributions overall and in

subpopulations– Multivariate relations among the variables

Page 9: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Some History

• Original fully synthetic data idea was due to Rubin (JOS, 1993)– Synthesize the Decennial Census long form responses for the

short form households, then release samples that do not include any actual long form records

• Original partially synthetic data idea was due to Little (JOS, 1993)– Synthesize the sensitive values on the public use file

• Critical refinement (Fienberg, 1994)– Use a parametric posterior predictive distribution (instead of a

Bayes bootstrap) to do the sampling• Other authors, particularly Raghunathan, Reiter, Rubin, Abowd,

Woodcock– Partially synthetic data with missing data (Reiter)– Sequential Regression Multivariate Imputation (Raghunathan,

Reither, and Rubin; Abowd and Woodcock)

Page 10: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

How Can You Preserve Confidentiality and Multivariate Relations?

• Fundamental trade-off:– better protection v. better data quality

• Protection results from summarizing the data with a complicated multivariate distribution, then sampling that distribution instead of the original data

• The synthetic data are not any respondent’s actual data

• But, for some techniques, it may still be possible to re-identify the source record in the confidential data

• New techniques address this problem

Page 11: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

How Can the Research Community Benefit from Synthetic Data?

• Sophisticated research users must help develop the synthesizers in order to promote and improve analytic validity

• Many more users will have access to the information because there is a public use micro data product.

Page 12: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

The Research – Synthetic Data Feedback Cycle

ScientificModeling

ScientificModeling

DataSynthesis

DataSynthesis

ConfidentialityProtection

ConfidentialityProtection

AnalyticValidity

AnalyticValidity

Page 13: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

The Multi-layer System

Basic confidential data– Fundamental product of virtually all Census

programs– Leads to the publication of public-use products

(summary data, micro data, narrative data)

Gold-standard confidential data– Edited, documented and archived research

versions of confidential data– Used in internal Census research and at

Research Data Centers

Page 14: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

More Layers

Partially-synthetic micro data– Preserves the record structure or sampling frame

of the gold standard micro data– Replaces the data elements with synthetic values

sampled from an appropriate probability model

Fully-synthetic micro data– Uses only the population or record linkage

structure of the gold standard micro data– Generates synthetic entities and data elements

from appropriate probability models

Page 15: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

The NSF Information Technologies Research Grant

• A program that encourages innovative, high-payoff IT research and education

• Our grant proposal cited the many research studies and data products created by previous NSF support for the Research Data Center network and the Longitudinal Employer-Household Dynamics Program

Page 16: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

What Is It?

• $2.9 million 3-year grant to the RDC network (Cornell is the coordinating institution)

• Provides core support for scientific activities at the RDCs

• To develop public use, analytically valid synthetic data from many of the RDC-accessible data sets

• To facilitate collaboration with RDC projects that help design and test these products

Page 17: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

The Quarterly Workforce Indicators

• QWI was the LEHD Program’s first public use data product

• QWI Online

• Detailed labor force information by sub-state geography, detailed industry, ownership class, sex and age group.

Page 18: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

The Confidentiality Protection System

• All QWI protections are done by noise infusion of the micro-data

• All micro-data items are distorted at least minimal percentage up to a maximal percentage

• Only the distorted items are used in the production of the release product

Page 19: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Protection and Validity Principles

• Cells with few businesses contributing or with few individuals contributing have been distorted in the cross-section but not the time-series

• Bias in the cross-section is controlled and random, no analyst knows its sign

• More information

Page 20: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Theoretical Distribution of the QWI Distortion Factor

Page 21: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Theoretical Distribution of the QWI Distortion Factor

Page 22: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Actual Confidentiality Protection Distortion: Employment, Beginning-of-Quarter

Page 23: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Table 8: Distribution of Error in First Order Serial Correlation

Page 24: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Graph: Distribution of Error in First Order Serial Correlation

Page 25: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Enhancements

• The current product has suppressions for cells too small to protect by noise infusion

• The enhanced product replaces these suppressions with synthetic data

Page 26: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Percentage of Data Items in County Level Release File

Sector Sub-sectorIndustry

GroupReleased 86.45 75.43 70.08 Not significantly distorted 70.06 58.96 57.22 Significantly distorted 16.39 16.47 12.86Suppressed 13.54 24.57 29.91

Released 100.00 100.00 100.00 Not significantly distorted 70.07 58.96 57.23 Significantly distorted* 29.93 41.04 42.77Suppressed 0.00 0.00 0.00*approximate

Employment (Beginning-of-quarter)

Percentage of Data Items in QWI County-level Release File NAICS IL 2001:1-2004:1

Page 27: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Beginning of Period Employment in NAICS Sector 62

Beginning Period Employment in Naics Sector 52Men and Women Ages 19-21

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

2001-1 2001-2 2001-3 2001-4 2002-1 2002-2 2002-3 2002-4 2003-1 2003-2 2003-3 2003-4 2004-1

Time

Co

un

ts Current QWI

Improved QWI

Page 28: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Full Quarter New Hires in NAICS4 3259

Full-Quarter New Hires in Naics4 3259 Women Aged 55-64

0

2

4

6

8

10

12

14

2001-1 2001-2 2001-3 2001-4 2002-1 2002-2 2002-3 2002-4 2003-1 2003-2 2003-3 2003-4 2004-1

Time

Co

un

ts Current QWI

Improved QWI

Page 29: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

The Census Bureau’s First Public Use Synthetic Data Application

• LEHD On-the-map application• Shows commuting patterns at the Census

Block level with characteristics of the origin and destination block groups

• Origin block data are synthetic– Sampled from the posterior predictive distribution

of origin blocks and origin characteristics given destination block, destination block characteristics.

• On-the-map

Page 30: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Where people living in the selected area (Mobile’s neighboring communities of Daphne and Fairhope) work

Source: “On the Map” beta application, Longitudinal Employer-Household Dynamics Program, U.S. Census Bureau September 23, 2005

DRAFT – Beta Test Document OnlyDRAFT – Beta Test Document Only

Page 31: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Where people working in the selected area (downtown Mobile) live

Source: “On the Map” beta application, Longitudinal Employer-Household Dynamics Program, U.S. Census Bureau September 23, 2005

DRAFT – Beta Test Document OnlyDRAFT – Beta Test Document Only

Page 32: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Synthetic Data Model

• yijk are the counts for residence block i, work place block j and characteristics k.

• Characteristics are age groups, earnings groups, industry (NAICS sector), ownership sector.

I

i

yjkijkijkiijkyp

1||| )|(

IjkjkIjkjk yy |1|1 ,...,Dirichlet~

Page 33: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Complications

• Informative prior “shape”

• Prior “sample size”

• Work place counts must be compatible with the protection system used by Quarterly Workforce Indicators (QWI)– Dynamically consistent noise infusion

Page 34: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

W1 W2 …. WJR1 2 5 … … 50R2 3 … … 400R3 … … 50R4 90 … … 200R5 … … 100R6 … … 20R7 … … 20R8 … … 20R9 … … 40R10 … … 100

Total 5 95 … … 1000

Residence Block (i)

Work Block (j) Total

Page 35: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Residence Block (i) Prior distribution Likelihood Posterior Expected Counts

Posterior Probabilities Synthetic Data

(Aggregated Work Block Distribution)

(Original Work Block Distribution)

R1 0.050 0.400 2.350 0.196 1R2 0.400 0.600 5.800 0.483 2R3 0.050 0.350 0.029R4 0.200 1.400 0.117R5 0.100 0.700 0.058 1R6 0.020 0.140 0.012R7 0.020 0.140 0.012R8 0.020 0.140 0.012 1R9 0.040 0.280 0.023

R10 0.100 0.700 0.058 1Total 1.000 1.000 12.000 1.000 6

Work block 5Prior 7QWI estimate 6

Page 36: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Analytic Validity

• Assess the bias

• Assess the incremental variation

Page 37: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Census Tract (1)

Workers

(2) Average commute

distance in true data (in miles)

(3) Average commute

distance in synthetic data (in miles)

(4) Difference in miles

(5) Standard deviation

across 10 implicates over (1)

1 6,747 17.9 17.9 0.0 0.019

2 4,535 14.6 14.8 0.1 0.013

3 2,251 18.5 19.3 0.9 0.018

4 1,932 12.0 13.2 1.3 0.043

5 1,996 15.0 15.0 -0.1 0.028

6 2,135 14.3 15.7 1.3 0.036

7 1,809 12.8 13.9 1.1 0.036

8 2,004 8.5 8.5 0.0 0.039

9 1,515 11.8 12.1 0.3 0.021

10 1,365 21.1 23.2 2.0 0.040

11 1,233 16.3 17.4 1.1 0.031

12 879 15.1 16.8 1.8 0.067

13 811 11.3 11.3 0.0 0.072

14 634 10.4 10.4 -0.1 0.051

15 618 9.6 9.6 0.0 0.046

16 526 11.4 10.1 -1.3 0.088

17 531 17.1 18.4 1.3 0.045

18 541 14.4 14.5 0.2 0.063

19 378 15.0 14.4 -0.6 0.069

20 372 7.7 7.2 -0.5 0.069

21 138 7.8 8.1 0.3 0.064

Total 32,951

Page 38: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Size weighted average of absolute difference in commute distance in confidential and synthetic data

Population in Work Block Mean P20 P40 P60 P80

1-5 9.33 4.72 5.72 8.98 13.95

6-10 5.89 1.88 3.13 4.69 8.71

11-20 3.82 1.76 2.42 2.68 4.24

21-50 3.34 1.19 1.76 2.27 3.58

51-100 2.21 0.69 1.55 1.44 2.36

101-250 1.38 0.40 0.65 1.92 2.12

250-500 0.96 0.16 0.38 0.72 1.64

501-high 0.27 0.05 0.12 0.15 0.13

Page 39: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Confidentiality Protection

• The reclassification index is a measure of how many workers were geographically relocated by the synthetic data.

j

I

iijij yyy /)~abs(

1

Page 40: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Panel 1: Reclassification Index for County Residence Patterns

Population in Work Block Mean P25 P50 P75

1-5 0.59 0.00 0.50 1.00

6-10 0.46 0.29 0.42 0.57

11-20 0.35 0.23 0.35 0.44

21-50 0.24 0.17 0.21 0.33

51-100 0.19 0.12 0.16 0.24

101-250 0.11 0.08 0.11 0.14

250-500 0.10 0.04 0.09 0.12

501-high 0.06 0.03 0.04 0.08

Panel 2: Reclassification Index for Census Tract Residence Patterns

Population in Work Block Mean P25 P50 P75

1-5 0.72 0.50 0.67 1.00

6-10 0.49 0.33 0.50 0.63

11-20 0.46 0.33 0.43 0.58

21-50 0.35 0.27 0.33 0.42

51-100 0.29 0.24 0.28 0.33

101-250 0.22 0.16 0.22 0.27

250-500 0.18 0.11 0.17 0.25

501-high 0.14 0.11 0.14 0.17

Panel 3: Reclassification Index for Block Residence Patterns

Population in Work Block Mean P25 P50 P75

1-5 0.85 0.50 1.00 1.00

6-10 0.57 0.42 0.57 0.67

11-20 0.47 0.35 0.47 0.57

21-50 0.37 0.29 0.35 0.43

51-100 0.33 0.28 0.31 0.39

101-250 0.26 0.22 0.25 0.32

250-500 0.25 0.23 0.26 0.27

501-high 0.21 0.19 0.20 0.22

Page 41: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

SIPP-SSA-IRS Public Use File

• Links IRS detailed earnings records and Social Security benefit data to public use SIPP data

• Basic confidential data: SIPP (1990-1993, 1996); W-2 earnings data; SSA benefit data

• Gold standard: completely linked, edited version of the data with variables drawn from all of the sources

• Partially-synthetic data: created using the record structure of the existing SIPP panels with all data elements synthesized using Bayesian bootstrap and sequential regression multivariate imputation methods

Page 42: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Multiple Imputation Confidentiality Protection

• Denote confidential data by Y and disclosable data by X.

• Both Y and X may contain missing data, so that Y = (Yobs , Ymis) and X = (Xobs, Xmis).

• Assume database can be represented by joint density p(Y,X,θ).

Page 43: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Sequential Regression Multivariate Imputation Method

• Synthetic data values Y are draws from the posterior predictive density:

• In practice, use a two-step procedure: 1) draw m completed datasets using SRMI (imputes values for all missing data)2) draw r synthetic datasets for each completed dataset from predictive density given the completed data.

dXYpXYYpXYYp obsobsobsobsobsobs ,|,,|~

,|~

Page 44: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Confidentiality Protection

• Protection is based on the inability of PUF users to re-identify the SIPP record upon which the PUF record is based.

• This prevents wholesale addition of SIPP data to the IRS and SSA data in the PUF

• Goal: re-identification of SIPP records from the PUF should result in true matches and false matches with equal probability

Page 45: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Disclosure Analysis

• Uses probabilistic record linking

• Each synthetic implicate is matched to the gold standard

• All unsynthesized variables are used as blocking variables

• Different matching variable sets are used

Page 46: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Percentage of Non-matches, False Matches, and True MatchesTotals for Group1-Group7

70 710 2273 5586 1246335935

108205

10 64 264 665 13636292

25121

3 55 179 462 947 378220746

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 to 9 11 to 61 65 to 208 242 to 501 607 to 918 1149 to 3966 4095 to 31256

cell size categories

False Matches

True Matches

Non-matches

Page 47: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Testing Analytic Validity

• Run analyses on each synthetic implicate.– Average coefficients– Combine standard errors using formulae that take account of

average variance of estimates (within implicate variance) and differences in variance across estimates (between implicate variance).

• Run analyses on gold standard data.• Compare average synthetic coefficient and standard

error to the same quantities for the gold standard.• Analytic validity is measured by the overlap in the

coverage of the synthetic and gold standard confidence intervals for a parameter.

Page 48: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Log Annual Earnings Amount

Synthetic Completed Synthetic Completed Synthetic CompletedIndependent Variables Avg. Coeff. Avg. Coeff. Tratio Tratio DF DFIntercept 8.4372 7.8519 9.7979 7.0765 7.9269 7.7770 14.8082 200.2576 2.8961 6.6134college_only 0.7618 0.7154 1.0119 0.5118 0.8012 0.6295 55.6671 18.4204 0.6308 3.5572disab_nowork -0.8474 -0.8595 0.2248 -1.9196 -0.6510 -1.0679 -1.8829 -9.2725 2.9124 3.3828divorced -0.1131 -0.1327 0.0108 -0.2371 -0.1139 -0.1516 -2.2564 -11.6721 2.6819 121.5500femaleblack -0.4383 -0.4969 -0.3716 -0.5050 -0.4627 -0.5311 -14.0983 -26.7033 3.9099 8.8353femalewhite -0.4384 -0.4359 -0.3645 -0.5122 -0.4225 -0.4492 -14.6838 -56.1920 2.6776 21.7965graduate 0.8479 0.8293 0.9778 0.7180 0.9179 0.7407 75.3638 20.4927 0.7458 3.6625highschool_only 0.2767 0.2344 0.3208 0.2326 0.2638 0.2049 19.5239 15.2380 1.8294 6.5118hispanic 0.0317 0.0784 0.0706 -0.0072 0.0950 0.0617 3.2362 7.7969 1.4043 129.2977maleblack -0.2369 -0.3163 -0.1868 -0.2870 -0.2915 -0.3411 -10.0306 -21.3101 4.0804 58.8818married -0.1007 -0.1027 0.0403 -0.2418 -0.0879 -0.1175 -1.7267 -11.6605 2.8145 41.8875ser_totyears2_2000 -0.008388 -0.013910 0.004399 -0.021174 -0.012314 -0.015506 -1.5559 -17.1739 2.9438 5.5757ser_totyears3_2000 0.000213 0.000346 0.000585 -0.000160 0.000387 0.000306 1.3521 16.2383 2.9609 6.9507ser_totyears4_2000 -0.000002 -0.000003 0.000001 -0.000006 -0.000003 -0.000004 -1.3938 -17.2223 2.9741 8.3544ser_totyears_2000 0.170864 0.270394 0.358130 -0.016402 0.295549 0.245240 2.1708 22.0735 2.9221 4.6084somecollege 0.449987 0.407803 0.925822 -0.025847 0.470358 0.345247 23.9657 14.2315 0.5699 3.6969widowed -0.7116 -0.5511 0.2288 -1.6520 -0.4862 -0.6160 -1.8221 -15.2869 2.8415 10.7620

Table 3: Log SER Earnings in 2000 for all individuals with positive earnings in this year

Confidence Interval Confidence IntervalSynthetic Completed

Page 49: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Log Annual Benefit Amount

Synthetic Completed Synthetic Completed Synthetic CompletedIndependent Variables Avg. Coeff. Avg. Coeff. Tratio Tratio DF DFIntercept 5.2392 5.4950 5.3505 5.1280 5.9115 5.0784 110.6907 30.3282 3.0110 3.1841age_first_benefit 0.0163 0.0119 0.0188 0.0139 0.0190 0.0048 28.2442 3.8653 1.3292 3.1370birthdate 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 -3.4365 -7.5960 3.4749 17.3979black_nonhisp -0.1381 -0.1485 -0.0999 -0.1763 -0.1338 -0.1632 -7.2851 -16.5883 5.0031 808.8552college_only 0.1737 0.1620 0.2139 0.1335 0.1786 0.1455 8.5725 16.2717 5.3961 97.3222disab_nowork -0.0641 -0.0610 -0.0339 -0.0942 -0.0375 -0.0844 -4.3107 -4.7709 4.8339 8.9261divorced 0.0508 0.0995 0.0800 0.0216 0.1238 0.0753 2.9333 6.7859 37.7125 197.0986graduate 0.1953 0.1752 0.2383 0.1522 0.1971 0.1533 9.1176 13.9156 5.0602 16.7543highschool_only 0.0922 0.0903 0.1076 0.0769 0.1092 0.0713 10.9656 9.0304 9.3056 6.9655hispanic -0.1035 -0.1601 -0.0848 -0.1223 -0.1307 -0.1895 -9.7891 -10.0587 12.9980 8.5134male 0.3544 0.3321 0.3648 0.3440 0.3402 0.3239 58.0446 66.8254 29.5818 3339.4344married 0.0063 0.0682 0.0581 -0.0455 0.0902 0.0462 0.2367 5.1394 5.7554 101.4151somecollege 0.1429 0.1372 0.1602 0.1257 0.1539 0.1205 14.5992 14.4611 14.1280 13.8189widowed 0.2047 0.2517 0.2714 0.1379 0.2775 0.2260 6.2358 16.4587 4.8026 37.9058

Table 4: Log Monthly Benefit Amount in December, 2001 for individuals 62 and older in 2000

Confidence Interval Confidence IntervalSynthetic Completed

Page 50: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Tools

• NSF sponsored supercomputer

• Virtual RDC

• Cornell INFO 747

Page 51: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

The NSF-sponsored Supercomputer on the RDC Network

• NSF01 is a 64-processor (384GB memory) supercomputer

• Installed and optimized for complex data synthesizing and simulation

• Projects related to the ITR grant have access and priority

Page 52: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

The Virtual RDC

• Virtual RDC (news server)• The virtual RDC environment contains

multiple servers that closely approximate an RDC compute server (e.g., NSF01)

• Disclosure-proofed metadata and synthetic data

• Now fully operational• Any current or potential RDC user can have

an account

Page 53: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Cornell Information Science 747

• INFO 747

• Course available to any potential RDC user, on DVD and via internet feed

• Training for using RDC-based data products

• Training for creating and testing synthetic data

Page 54: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January.

Conclusions

• An important and challenging area that social scientists must be part of

• Use of confidential data collected by a public agency carries with it an obligation to disseminate enough data to permit scientific discourse

• Synthetic data is an important tool for this dissemination