Sharing Confidential Data George Alter University of Michigan
Sharing Confidential Data
George AlterUniversity of Michigan
Disclosure: Risk & Harm• What do we promise when we conduct
research about people? – That benefits (usually to society) outweigh risk
of harm (usually to individual)– That we will protect confidentiality
• Why is confidentiality so important?– Because people may reveal information to us
that could cause them harm.– Examples: criminal activity, antisocial activity,
medical conditions...
Who are We Afraid of?• Parents trying to find out if their child had an
abortion or uses drugs• Spouse seeking hidden income or infidelity in
a divorce• Insurance companies seeking to eliminate
risky individuals• Other criminals and nuisances• NSA, CIA, FBI, KGB, SABOT, SBL, SMERSH,
KAOS, etc...
What are We Afraid of...• Direct Identifiers
– Inadvertent release of unnecessary information (Name, phone number, SSN…)
– Direct identifiers required for analysis (location, genetic characteristics,…)
• Indirect Identifiers– Characteristics that identify a subject when
combined (sex, race, age, education, occupation)
Deductive Disclosure• A combination of characteristics could
allow an intruder to re-identify an individual in a survey “deductively,” even if direct identifiers are removed.
• Dependent on– Knowing someone in the survey– Matching cases to a database
Deductive Disclosure
Contextual data increases the risk of disclosure – Some attributes can be known by an outsider (age,
race) – Individuals are more identifiable in smaller populations
• The more specific the geography, the more attention must be paid to disclosure risk.
Contextual data in social science researchGeographic context
• Neighborhood characteristics, economic conditions, health services, distance to resources, etc.
Institutional context• School• Hospital• Prison
Current Survey Designs Increase the Risks of Disclosing Subjects’ Identities
• Geographically referenced data• Longitudinal data• Multi-level data:
– Student, teacher, school, school district– Patient, clinic, community
Protecting Confidential Data• Safe data: Modify the data to reduce the risk
of re-identification
• Safe projects: Reviewing research designs
• Safe settings: Physical isolation and secure technologies
• Safe people: Training and Data use agreements
• Safe outputs: Results are reviewed before being released to researchers
Safe data
Disclosure risks can be reduced by:• Multiple sites rather than single locations• Keeping sampling locations secret
– Releasing characteristics of contexts without providing locations
• Oversampling rare characteristics
Safe Data
Data masking• Grouping values• Top-coding• Aggregating geographic areas• Swapping values• Suppressing unique cases• Sampling within a larger data collection• Adding “noise”• Replacing real data with synthetic data
Safe Projects• Research plans are reviewed before access is
approved• Levels of project review
1. Does the research plan require confidential data?
2. Would the research plan identify individual subjects?
3. Is the research scientifically sound? Does it “serve the public good”? • Scientific review requires standards and expertise
Safe Settings
• Data protection plans• Remote submission and execution• Virtual data enclave• Physical enclave
Data Protection Plans should address risks:• unauthorized use of account on computer• computer break-in by exploiting vulnerability• hijacking of computer by malware or botware• interception of network traffic between computers• loss of computer or media• theft of computer or media• eavesdropping of electronic output on computer screen• unauthorized viewing of paper outputWe often focus too much on technology and not enough on risk.
Safe Settings
Improving Data Security Plans• Problems
– PIs lack technical expertise– Requirements are inconsistent and confusing– Monitoring compliance is expensive
• An alternative: Institution-level data security protocols– Tiered guidelines for different levels of risk– Focus on mitigating risks not specifying technologies– Certification of researchers– Institutional oversight
• Remote submission and execution– User submits program code or scripts, which
are executed in a controlled environment• Virtual data enclave
– Remote desktop technology prevents moving data to user’s local computer
– Requires a data use agreement• Physical enclave
– Users must travel to the data
Safe Settings
Virtual Data Enclave
The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.
Safe people
• Data use agreements• Training
Safe people• Parts of a data use agreement at ICPSR
– Research plan– IRB approval– Data protection plan– Behavior rules– Security pledge– Institutional signature
Informed Consent
Interview
Data producer
Data archive
Researcher
Data Use Agreement
Institution
Data flowDa
ta fl
ow Data Dissemination
AgreementResearch
Plan
IRB Approval
Data Protection
PlanData flow
Data Use Agreement: Behavior rules To avoid inadvertent disclosure of persons, families, households, neighborhoods, schools or health services by using the followingguidelines in the release of statistics derived from the dataset.
1. In no table should all cases in any row or column be found in asingle cell.2. In no case should the total for a row or column of a cross-tabulation be fewer than ten.3. In no case should a quantity figure be based on fewer than ten cases.4. In no case should a quantity figure be published if one casecontributes more than 60 percent of the amount.5. In no case should data on an identifiable case, or any of the kindsof data listed in preceding items 1-3, be derivable through subtractionor other calculation from the combination of tables released.
Data Use Agreement
The Recipient Institution will treat allegations, by NAHDAP/ICPSR or other parties, of violations of this agreement as allegations of violations of its policies and procedures on scientific integrity and misconduct. If the allegations are confirmed, the Recipient Institution will treat the violations as it would violations of the explicit terms of its policies on scientific integrity and misconduct.
Problems with DUAs• DUAs are issued by project.
– Every PI gets a new DUA, even if the Institution has already signed the DUA for someone else
• Language and conditions in DUAs are not standard– Frequent negotiations and lawyering
Reducing the costs of DUAs
• Institution-wide agreements– One agreement per institution, not per project– A designated “data steward” adds qualified
researchers to the agreement– Example: Databrary Agreement
• Covers informed consent, data sharing, data use• Researcher certification covering multiple
datasets
Disclosure: Graph with extreme values example
no
Arrested in last year?
yes
Data were collected for a sample of 104 people in a county. Among the variables collected were age, gender, and whether the person was arrested within the last year. Box plots below show the distribution of age, one plot for those arrested and one for those who were not. The number labels are case number in the dataset. The potential identifiability represented by outlying values is compounded here by an unusual combination that could probably be identified using public records for a county in the U.S. --someone approximately 90 years old was arrested in the sample. Including extreme values is a disclosure risk for identifiability when combined with other variables in the dataset.
N 104min age 12max age 95mean age 51std dev 15% female 5.2% arrested 5.8
Safe People: Disclosure risk online tutorial
• Controlled environments allow review of outputso Remote submission and executiono Virtual data enclaveso Physical enclaves
• Disclosure checks may be automated, but manual review is usually necessary
Safe outputs
Weighing Costs and Benefits
• Data protection has costs– Modifying data affects analysis– Access restrictions impose burdens on researchers
• Protection measures should be proportional to risks– Probability that an individual can be (re-)identified– Severity of harm resulting from re-identification
Gradient of Risk & Restriction
Seve
rity
of H
arm
Probability of Disclosure
Tiny RiskWeb
Access
Some Risk Data Use
Agreement
Moderate Risk- Strong DUA &
Technology Rules
High Risk Enclosed Data
Center
Simple Data: minimal harm & very low
chance of disclosure
Complex Data: low harm & low probability of disclosure
Complex data: moderate harm & re-identifiable with difficulty
High severity of harm & highly
identifiable
Thank youGeorge Alter
University of [email protected]
What if databases could send data to a trusted third party, who would compute statistics?
Database 1 Database 2
Secure Multi-Party Computing
MPC does this without the third party.
Encryption
Average IncomeThree people with true salaries S1, S2, S3, which they never reveal.Each computes random numbers Rij (sent from i to j). Report salary plus own random numbers minus those received, i.e.,
X1 = S1 + (R12 + R13) – (R21 + R31)X2 = S2 + (R21 + R23) – (R12 + R32)
+ X3 = S3 + (R31 + R32) – (R13 + R23) Σ = S1 + S2 + S3
Example from Daniel Goroff, Alfred P. Sloan Foundation
Homomorphic Encryption