Assessing Disclosure Risk and Analytical Validity for the SIPP- SSA-IRS Public Use File Beta Version 4.1 John M. Abowd U.S. Census Bureau and Cornell University CNSTAT Panel on the Census Bureau’s Dynamics of Economic Well-being System January 26, 2007
46
Embed
John M. Abowd U.S. Census Bureau and Cornell University
Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1. John M. Abowd U.S. Census Bureau and Cornell University CNSTAT Panel on the Census Bureau’s Dynamics of Economic Well-being System January 26, 2007. Background. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta
Version 4.1John M. Abowd
U.S. Census Bureau and Cornell University
CNSTAT Panel on the Census Bureau’s Dynamics of Economic Well-being System
January 26, 2007
Background
• Longstanding goal of the Census Bureau– Statutory mandate to provide survey data used to study critical policy
issues – Focus of long standing internal Census Bureau survey improvement
project that is part of the LEHD Program– This is the first Title 13/Chapter 5 predominant purpose for using IRS data
• Treasury Regulation Change, February 2001 (final regulation February 2003)– New W-2 items authorized: SSN, EIN, Box 1, Box 3, Box 13, number of
quarters, 1099R
• Creation of a public use data set that integrates survey and administrative data is the other predominant Title 13/Chapter 5 purpose
Team and Sponsorship
• The project was conducted by a team of researchers from the
Census Bureau, IRS, Social Security Administration, and a
consortium of university partners
• Main financial support provided by the Census Bureau, Social
Security Administration, and the National Science Foundation
• Primary design decisions made by an inter-agency team lead by
Martha Stinson at the Census Bureau and with the participation
of SSA, IRS, the Congressional Budget Office, and the Joint
Committee on Taxation
Acknowledgements: Research Team
• Martha Stinson (Census Bureau), project manager
• Gary Benedetto, Lisa Dragoset, Sam Hawala, Bryan Ricchetti (Census Bureau)
• Karen Masken (IRS)• Simon Woodcock (Simon Fraser University),
Jerry Reiter (Duke University), Josep Domingo-Ferrer (University of Rovira and Virgili), Vicenc Torra (University of Barcelona), Lars Vilhuber (Cornell University and Census Bureau), consultants
Acknowledgements I: Agencies
• Kenneth Prewitt, C. Louis Kincannon, Hermann Habermann,
Paula Schneider, Nancy Gordon, Frederick Knickerbocker,
Cynthia Clark, Howard Hogan, and Thomas Mesenbourg,
senior management Census Bureau
• Susan Grad, Howard Iams, and Paul van de Water, senior
management SSA
• Mark Mazur and Nicholas Greenia, IRS senior management
and IRS/SOI Census Bureau disclosure liaison
• Daniel Newlon, NSF project officer
Acknowledgements II: Agencies
• Chet Bowie, Al Tupek, Barry Sessamen Dan Weinberg, Ron Prevost,
Jeremy Wu, division and program management Census Bureau
• Brian Greenberg, Dawn Haynes, SSA technical support, contract
management, and disclosure officers
• Patricia Doyle, Judith Eargle and Nancy Bates, Census Bureau SIPP
research direction
• Charlene Leggieri and Sally Obenski, Census Bureau administrative
records management
• Laura Zayatz, Census Bureau statistical disclosure research direction
• John Sabelhaus, Congressional Budget Office research direction
Conceptual Framework
• Link all SIPP panels from the 1990s– Five panels: 1990, 1991, 1992, 1993, 1996
• Link to IRS data – Summary Earnings Records (FICA taxable earnings 1937-1950, and 1951-
2003 annual)
– Detailed Earnings Record (job level data, uncapped, 1978-2003 annual)
History Update System, 831 file (all available historical data through 2002)
• Create product that prevents individuals from being re-identified in the current public use SIPP files
Major Design Decisions
• Limit number of SIPP variables included
• Target national retirement and disability research communities
• Investigate disclosure avoidance methods to protect both survey and administrative data
• But, note that a re-identification in the current SIPP public use files is not a disclosure since those files have also been subjected to extensive disclosure avoidance procedures
• Very high hurdle
Latest Versions
• Gold Standard confidential file at release 4.0
– All confidential data (person-level), all sources
• Beta Public Use File 4.1
– All person-level SIPP, IRS variables from the Gold Standard
Version 4.0
– Benefit and type of benefit measures for initial SSA benefit (if any),
benefit and type of benefit as of April 1, 2000
– Consistent panel weight for civilian, non-institutional population as
of April 1, 2000 (synthesized on each implicate)
– Four missing data implicates with four synthetic implicates each (16
implicates total)
Summary of Discussion Today
• A tour of the methods used to complete and synthesize the SIPP-PUF
• Some disclosure avoidance results• Selected analytical validity results
Multiple Imputation Confidentiality Protection History
• Rubin (1993): treat unsampled individuals in population as missing the survey data, impute missing values (synthetic population), sample and release (fully synthetic data)
• Little (1993): treat sensitive values as missing, impute and release imputed values (partially synthetic data)
• Feinberg (1994): parametric Bayesian procedure eliminated the use of any actual values in synthetic data
• Ragunathan, Reiter, and Rubin (2003): adapted the Sequential Regression Multivariate Imputation method to synthetic data
• Reiter (2004): Inference-valid combination of multiple imputation for missing and synthetic data
• Abowd and Woodcock (2001): Applied SRMI to confidentiality protection of longitudinally linked employer-employee synthetic micro-data
• Synthetic data values are draws from the posterior predictive density:
• In practice, use a two-step procedure: 1) complete the missing data using SRMI2) draw synthetic data from predictive density given the completed data
• Repeating the procedure yields multiple synthetic data implicates
dXYpXYYpXYYp obsobsobsobsobsobs ,|,,|~
,|~
SRMI Method Details
• Specifying the joint density p(Y,X,θ) is unrealistic in most applications
• Instead, approximate the joint density by a sequence of conditional densities defined by generalized linear models
• Synthetic values of some are draws from:
where Ym,Xm are completed data, and densities pk are defined by an appropriate generalized linear model and prior, a Dirichlet-multinomial model, or a Bayesian Bootstrap
dXYpXYypXYyp mm
k
mm
kkk
mm
kk ,|,,|~,|~ ~
Yyk
Maintaining Relationships in the Underlying Data
• Define a multilevel parent-child tree to describe the exact relationships
in the data
• Variables at the root of this tree should have values for all individuals,
completed and synthesized first (but as a function of all data)
• Child variables only completed or synthesized when appropriate given
the parent variable
• For missing data, iterate nine times to complete all missing data,
sample 4 implicates
• For synthetic data, condition on values from the completed data,
sample 4 implicates per completed implicate
Maintaining Multivariate Distributions
• Automated creation and management of stratifying (grouping) variables and conditioning variables
• Bayesian bootstrap procedure for sets of related discrete variables estimated using the automated grouping
• SRMI procedure for most continuous variables using automated grouping, conditioning variable management, Bayesian model selection
Maintaining Univariate Distributions
• Automated management of sets of related continuous variables (e.g., earnings histories)
• Within stratifying groups, automated management of a non-parametric transform with inverse transform to preserve the univariate distribution of all continuous variables within group
SRMI Example: Date of Birth
• Link administrative birth date (more accurate)• Take birth date from Bayesian bootstrap link
of couple administrative records when SSN is not available
• Formulate grouping and control variable lists and hierarchy (two sets)