Perfectly Anonymous Data is Perfectly Useless Data Micah Altman Director of Research--MIT Libraries Prepared for NISO-RDA Symposium Privacy Implications of Research Data Denver, CO Sept 2016
41
Embed
Altman - Perfectly Anonymous Data is Perfectly Useless Data
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1. Perfectly Anonymous Data is Perfectly Useless Data Micah
Altman Director of Research--MIT Libraries Prepared for NISO-RDA
Symposium Privacy Implications of Research Data Denver, CO Sept
2016
2. DISCLAIMER These opinions are my own, they are not the
opinions of MIT, any of the project funders, nor (with the
exception of co-authored previously published work) my
collaborators Secondary disclaimer: Its tough to make predictions,
especially about the future! -- Attributed to Woody Allen,Yogi
Berra,Niels Bohr,Vint Cerf,Winston Churchill, Confucius,Disreali
[sic], Freeman Dyson, Cecil B. Demille,Albert Einstein, Enrico
Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn,Allan Lamport,
Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel,Will
Rogers,M.Taub, MarkTwain, Kerr L. White, etc. PerfectlyAnonymous
Data
3. Collaborators & Co-Conspirators } Privacy Tools for
Sharing Research DataTeam (SalilVadhan, P.I.)
http://privacytools.seas.harvard.edu/people } Research Support }
Supported in part by NSF grant CNS-123723 } Supported in part by
the Sloan Foundation PerfectlyAnonymous Data
4. Related Work Main Project: } Privacy Tools for Sharing
Research Data http://privacytools.seas.harvard.edu/ Related
publications: } Altman M,Wood A, O'Brien D,Vadhan S,Gasser
U.Towards a Modern Approach to Privacy-Aware Government Data
Releases. Berkeley Journal of Technology Law 30(3) 1967-2072.2016 }
Wood A, Airoldi E,Altman M, de MontandreY,Gasser U, O'Brien
D,Vadhan S. Privacy Tools project response to Common Rule Notice of
Proposed Rule Making. Comments on Regulation.Gov . 2016.(Copy
available here:
http://informatics.mit.edu/publications/privacy-tools-project-response-common-rule-
notice-proposed-rule-making);and } Vayena E,Gasser U,Wood A,
O'Brien D,Altman M. Elements of a New Ethical and Regulatory
Framework for Big Data Research.Washington and Lee Law Review.
2016;72(3):420-442. } Altman M, Capps C, Prevost R. Location
Confidentiality and Official Surveys. Social Science Research
Network [Internet].2016. Slides and reprints available from:
informatics.mit.edu PerfectlyAnonymous Data
5. Todays Perspectives & Provocations *What is information
privacy? * * Privacy, Utility and Use * * Calibrating Protection
and Risk * PerfectlyAnonymous Data
6. PerfectlyAnonymous Data What is Information Privacy?
7. Some Definitions Privacy Control over extent and
circumstances of sharing Confidentiality Control over disclosure of
information Identifiability Potential for learning about
individuals based on their inclusion in a data Sensitivity
Potential for harm if information disclosed and used to learn about
indviduals PerfectlyAnonymous Data
8. Privacy is not Anonymization PerfectlyAnonymous Data }
Anonymization / deidentification / PII are legal concepts } May be
no rigorous formal definition } Definition varies by law, may
include } Presence of specific attributes (e.g., PII,HIPAA
identifiers) } Feasibility of record linkage ... } Evaluation of
knowledge of data publisher (e.g.no actual knowledge,readily
ascertainable) } Legal definitions do not match
statistical/scientific concepts
9. Privacy is not Information Security Confidentiality
(Secrecy) Integrity Availability Authenticity Non-Repudiation
PerfectlyAnonymous Data Information security maintains many
security properties within information system Failures of
information secrecy may also violate privacy Information security
controls access does not control computation,use, or inference
Information security functions only within an information system
privacy depends on what happens to information before,during, and
after it is an informatin system
10. Privacy aims to Prevent Harm from Disclosure
PerfectlyAnonymous Data Data Subjects Vulnerable Groups
Institutions Society
11. Public Data isnt Always } The Doesnt Stay inVegas problem
-- information shared locally can be found anywhere } Digitization
andAPIs make aggregation easy } Manypublic records were not
practically public until now } Creates new business models }
Provokes sector-by-sector legislative reaction PerfectlyAnonymous
Data More Information Sweeney, Latanya."Discrimination in online ad
delivery." Queue 11.3 (2013):10. Wikipedia Mugshot database
12. PerfectlyAnonymous Data Privacy, Utility, and Use
(Cases)
13. No-Free-Lunch for Privacy PerfectlyAnonymous Data Any data
analysis that is useful* leaks some measurable private
information
14. Scientific Trend: More Sophisticated Reidentification
Concepts PerfectlyAnonymous Data Record-linkage wheres waldo Match
a real person to precise record in a database Examples: direct
identifiers. Caveats: Satisfies compliance for specific laws, but
not generally; substantial potential for harm remains
Indistinguishability hidden in the crowd Individuals can be linked
only to a cluster of records (of known size) Examples: K-anonymity,
attribute disclosure Caveats: Potential for substantial harms may
remain, must specify what external information is observable, &
need diversity for sensitive attributes Limited Adversarial
Learning confidentiality guaranteed Formally bounds the total
learning about any individual that occurs from a data release
Examples: differential privacy, zero- knowledge proofs Caveats:
Challenging to implement, often requires interactive systems More
Information Willenborg, Leon, and Ton De Waal. Elements of
statistical disclosure control. Vol. 155. Springer Science &
Business Media, 2012. Fung,Benjamin, et al. "Privacy-preserving
data publishing:A survey of recent developments." ACM Computing
Surveys (CSUR) 42.4 (2010):14. Dwork, Cynthia."Differential
privacy: A survey of results." International Conference on Theory
and Applications of Models of Computation. Springer Berlin
Heidelberg, 2008.
15. What use is it? } Utility is defined broadly as the
analytical value of the data } There is no universal operational
definition of utility } There is a structural tradeoff between
privacy and utility simultaneously cant maximize both } However,
new methods can sometimes do better than traditional anonymization
on both fronts.
16. Some Approaches to Measuring of Utility Precision Bias
Variance Statistical Inference Truthfulness Completeness
Consistency Interpretability Semantic Entropy Information
Complexity Comput ational Image Sources
https://commons.wikimedia.org/wiki/File:Solna_Brick_wall_Stretcher_bond_variation1.jpg
https://commons.wikimedia.org/wiki/File:Archery_Target_80cm.svg
17. Challenges to Measuring Utility } Utility may not track
quality, or value } Quality depends on intended use } Value is
dependent on intended and unintended uses;and on markets } Utility
at one level doesnt match utility at another levels Truthful
records accurate aggregates Measure Record Aggregate Quantity Model
Data
18. Traditional Confidentiality-Enhancing Data Publishing
Published Outputs * Jones * * 1961 021* * Jon es * * 1 9 6 1 0 2 1
* * Jon es * * 1 9 7 2 9 4 0 4 * * Jon es * * 1 9 7 2 9 4 0 4 * *
Jon es * * 1 9 7 2 9 4 0 4 * Modal Practice The correlation between
X and Y was large and statistically significant Summary statistics
Contingency table Public use sample microdata Information
Visualization PerfectlyAnonymous Data
19. Name SSN Birthdate Zipcode Gender Favorite Ice Cream # of
crimes committed A. Jones 12341 01011961 02145 M Raspberry 0
B.Jones 12342 02021961 02138 M Pistachio 0 C. Jones 12343 11111972
94043 M Chocolate 0 D.Jones 12344 12121972 94043 M Hazelnut 0 E.
Jones 12345 03251972 94041 F Lemon 0 F. Jones 12346 03251972 02127
F Lemon 1 G. Jones 12347 08081989 02138 F Peach 1 H. Smith 12348
01011973 63200 F Lime 2 I.Smith 12349 02021973 63300 M Mango 4
J.Smith 12350 02021973 63400 M Coconut 16 K. Smith 12351 03031974
64500 M Frog 32 L.Smith 12352 04041974 64600 M Vanilla 64 M.Smith
12353 04041974 64700 F Pumpkin 128 N.Smith 12354 04041974 64800 F
Allergic 256 Just some data? PerfectlyAnonymous Data
20. Name SSN Birthdate Zipcode Gender Favorite Ice Cream # of
crimes committed A. Jones 12341 01011961 02145 M Raspberry 0
B.Jones 12342 02021961 02138 M Pistachio 0 C. Jones 12343 11111972
94043 M Chocolate 0 D.Jones 12344 12121972 94043 M Hazelnut 0 E.
Jones 12345 03251972 94041 F Lemon 0 F. Jones 12346 03251972 02127
F Lemon 1 G. Jones 12347 08081989 02138 F Peach 1 H. Smith 12348
01011973 63200 F Lime 2 I.Smith 12349 02021973 63300 M Mango 4
J.Smith 12350 02021973 63400 M Coconut 16 K. Smith 12351 03031974
64500 M Frog 32 L.Smith 12352 04041974 64600 M Vanilla 64 M.Smith
12353 04041974 64700 F Pumpkin 128 N.Smith 12354 04041974 64800 F
Allergic 256 Whats wrong with this picture? Identifier Sensitive
Private Identifier Private Identifier Identifie r Sensitive
Unexpected Response? Mass resident FERPA too? Californian Twins,
separated at birth? PerfectlyAnonymous Data
21. Help, help, Im being suppressed Name SSN Birthdate Zipcode
Gender Favorite Ice Cream # of crimes committed [Name 1] 12341
*1961 021* M Raspberry .1 [Name 2] 12342 *1961 021* M Pistachio -.1
[Name 3] 12343 *1972 940* M Chocolate 0 [Name 4] 12344 *1972 940* M
Hazelnut 0 [Name 5] 12345 *1972 940* F Lemon .6 [Name 6] 12346
*1972 021* F Lemon .6 [Name 7] 12347 *1989 021* * Peach 64.6 [Name
8] 12348 *1973 632* F Lime 3 [Name 9] 12349 *1973 633* M Mango 3
[Name 10] 12350 *1973 634* M Coconut 37.2 [Name 11] 12351 *1974
645* M * 37.2 [Name 12] 12352 *1974 646* M Vanilla 37.2 [Name 13]
12353 *1974 647* F * 64.4 [Name 14] 12354 *1974 648* F Allergic 256
Redaction VarSynthetic Global Recode Local Suppression Aggregation
+ Perturbation Traditional Static Suppression } Data reduction }
Observation } Measure } Cell } Perturbation } Microaggregation }
Rule-baseddataswapping } Adding noise PerfectlyAnonymous Data
23. How does Anonymization Affect Utility PerfectlyAnonymous
Data Measure Record Aggregate Model Bias Generalization (N/A)
K-anon Cell suppression Record deletion + any measure bias Column
deletion + any aggregation bias Synthetic Precision Cell
Perturbation (All methods increase variance) Truthfulness Cell
Perturbation Microaggregation Swapping Synthetic Cell Perturbation
Microaggregation Swapping (Bias is analogous to truthfulness see
above) Completeness Column suppression Record suppression (Cell
suppression affects completess & interpretability) (Record
incompleteness increases bias) (N/A) Consistency Cell Perturbation
Microaggregation Cell Perturbation (N/A) Interpretability Cell
Perturbation Microaggregation Cell Perturbation Cell suppression
Microaggregation All anonymization methods effect variance,
informational complexity, entropy
24. Special Challenges to Research Uses PerfectlyAnonymous Data
Computational Replication / Process Verification /
HistoricalVerification Reuse, Reanalysis, Extension Data
Integration across Separate Source
25. Data Reuse PerfectlyAnonymous Data } Data Integration }
Pooling Measures (Vertical Partitioning) } Pooling Subjects
(Horizontal Partitioning) } Pooling Studies (Meta-analysis) }
PoolingTime (Longitudinal Integration) } Follow-ups/recontacts
(AddingWaves,Future data collection) } Data Reuse } Research
Extension / Reanalysis / Critique } New Model / Analysis /
Estimation } Extraction of New Signals (Measures)
26. Data Integration and Privacy Challenges PerfectlyAnonymous
Data Aim: Integrate evidence across multiple independent databases
} Horizontal integration possible if known to be different sample
observations of same population } Most privacy protections impede }
Vertical (multi-measure) } Longitudinal integration } Full
meta-analysis } Anonymization impedes follow-up studies }
Mitigating Approaches } Tiered access } Third-party escrow }
Artificial linking identifiers Can leak information Not robust to
errors in identifying information Challenging to coordinate across
institutions } Secure multiparty computation
27. Reuse and Privacy Challenges PerfectlyAnonymous Data Aim:
Critique previousresults,extend model/methods, ask new questions
possibly far in future } Methods that impede data integration
partially impede critique, extension, new questions } Synthetic
data approaches, suppressing measures, can impeded new questions }
Privacy protections that affect interpretability are a challenge to
meaningful long-term access } Generalizations, synthetic data
approaches, aggregation impeded extraction of new measures for new
questions } Mitigating Approaches } Broad consent } Tiered
access
28. Replication/Verification PerfectlyAnonymous Data } Some
common characteristics } Uses theoriginal data,model,
methods,implementation } Corresponds to those used when results
were published } Often archived as replication packages }
Calibration } Verify understanding of software, method,model before
extending } Often a preliminary to reuse } Check research integrity
} Recreate, as closely as possible, the process actually followed
by an author in producing previously published results } Detect
substantial omissions in methods,Unsophisticated data
manipulation,Misleading precision in publication } Retrospective
process quality evaluation } Evaluate process methods in practice }
Detect:Transcription errors,Numerical issues,Software bugs }
Historical/legal verification } Verify some phenomenon/relation
existed/didnot exist in data? } Verify historical aggregate } Did
some individual have/not have a particular property
29. Replication, Verification and Privacy Challenges
PerfectlyAnonymous Data } Verify particular process and results
from a prior publication/statement } Affected by variance, bias,
interpretability,completeness } Calibrate prior to extension,
critique } Affected by bias, variance, interpretability } Assess
process/method robustness } Affected by precision,variance, bias }
Verify historical statements about records } Affected by
truthfulness, completeness } MitigatingApproaches } Tiered access }
Consent for replication } Privacy aware publishing } Publish
research only on post- protected data } Avoid publishing results
with unnecessary precision } Document uncertainty in } Replication
servers } Process logs + audit trails
30. Big Data Research and Privacy Challenges } Big Data can be
Rich, Messy & Surprising } The Blog problem : Pseudononymous
communication used for topic mining can be linked through
stylometric analysis } Observable Behavior Leaves Unique
Fingerprints } The GIS problem: Location trails are
individualistic, externally observable,difficult to mas }
Traditional Anonymization Methods can destroy utility } The Netflix
Problem: many people may have unique long-tail behavior
PerfectlyAnonymous Data More Information J. Novak, P. Raghavan,A.
Tomkins, 2004.Anti-aliasing on the Web, Proceedings of the 13th
internationalconference on World Wide Web Narayanan,Arvind, and
Vitaly Shmatikov. "Robust de-anonymization of large sparse
datasets." 2008 IEEE Symposium on Security and Privacy (sp 2008).
IEEE,2008. De Montjoye,Yves-Alexandre, et al. "Unique in the crowd:
The privacy bounds of human mobility." Scientific reports 3
(2013).
31. Computational Methods Beyond Anonymization }
ControllingAccess } Virtual Data Enclaves } Controlling Computation
} Secure Multiparty Computation } Functional Encryption }
Homomorphic Encryption } Blockchain } Controlling Inference }
Differential Privacy } Restricting Use } Executable Policy
Languages PerfectlyAnonymous Data More Information Altman M,Wood A,
O'Brien D,Vadhan S, Gasser U.Towardsa Modern Approach to
Privacy-AwareGovernment Data Releases. Berkeley JournalofTechnology
Law 30(3) 1967-2072. 2016 Altman M, CappsC,Prevost R. Location
Confidentiality and Official Surveys.Social Science Research
Network [Internet]. 2016
32. Caution: No Current Method Cure Algorithmic Discrimination
} Big data algorithms may pick up unanticipated relationships in
data } Algorithms that incorporate human behavior may amplify human
biases } Excluding sensitive input is not enough } Check if
classifications have distributional bias } Check if classification
error is biased PerfectlyAnonymous Data More Information Sweeney,
Latanya."Discrimination in online ad delivery." Queue 11.3
(2013):10. Larson, et al. How We Analyzed the COMPAS Recidivism
Algorithm https://www.propublica.org/article/how-we-analyzed-the-
compas-recidivism-algorithm
33. Bad News/Good News for Big Data and Privacy } Bad News:
De-identification is not enough - accumulation of many information
releases may compose into substantial disclosure; and/or lead to
algorithmic discrimnation } Good News: large numbers are your
friend - new methods such as differential privacy can provide
strong disclosure protection and good estimates for large samples }
Bad News: Traditional anonymization primitives (e.g. local
suppression, topcoding) can add variance,require reweighting,and
bias answers } Good News: some new cryptography-based methods
(multiparty secure computation, some differentially private
algorithms) are unbiased, and directly interpretable } Bad News:
Traditional methods focus explicitly on limiting data release }
Good News: new frameworks for limiting inferences (differential
privacy); computation (multiparty secure computation); and uses
(policy languages)
34. Bake Privacy In PerfectlyAnonymous Data } Fair Information
Practice: } Notice/awareness } Choice/consent }
Access/participation (verification, accuracy, correction) }
Integrity/security } Enforcement/redress } Self-regulation, private
remedies; government enforcements } Privacy by design: } Proactive
not reactive; Preventative not remedial } Privacy as the default
setting } Privacy embedded into design } Full Functionality
Positive-Sum, not Zero- Sum } End-to-End Security Full Lifecycle
Protection } Visibility and Transparency Keep it Open } Respect for
User Privacy Keep it User-Centric } OECD Principles } Collection
limitation } Data quality } Purpose specification } Use limitation
} Security Safeguards } Openness } Individual participation }
Accountability
35. Notice and Consent, by itself Does not Scale
PerfectlyAnonymous Data } Most people believe they have lost
control of their private information } Many individuals do not
understand how and what is shared, or potential consequences } Most
people encounter thousands of terms of use every year } Because big
data is rich, messy and surprising -- future uses are more
difficult to anticipate More Information State of Privacy in
America, 2016, Pew Fact Tank:
http://www.pewresearch.org/fact-tank/2016/01/20/the-state-of-privacy-in-america/
Big Data And Privacy: A Technical Perspective, PCAST, 2016.
https://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdf
McDonald, Aleecia M., and Lorrie Faith Cranor. "Cost of reading
privacy policies, the." ISJLP 4 (2008):543.APA
36. Lifecycle approach to data management } Review of uses,
threats, and vulnerabilities as information is used over time }
Select appropriate controls at each stage
37. Catalog of privacy controls } Procedural, technical,
educational, economic, and legal means for enhancing privacyat each
stage of the information lifecycle Procedural Economic Educational
Legal Technical Access/Release Access controls; Consent; Expert
panels; Individual privacy settings; Presumption of openness vs.
privacy; Purpose specification; Registration; Restrictions on use
by data controller; Risk assessments Access/Use fees (for data
controller or subjects); Property rights assignment Data asset
registers; Notice; Transparency Integrity and accuracy
requirements; Data use agreements (contract with data recipient)/
Terms of service Authentication; Computable policy; Differential
privacy; Encryption (incl. Functional; Homomorphic); Interactive
query systems; Secure multiparty computation
38. Calibrating Controls Illustrating how to choose privacy
controls that are consistent with the uses, threats, and
vulnerabilities at each lifecycle stage More Information Altman
M,Wood A, O'Brien D,Vadhan S, Gasser U.Towardsa Modern Approach to
Privacy-AwareGovernment Data Releases. Berkeley JournalofTechnology
Law 30(3) 1967-2072. 2016
39. Principles of a Modern Approach to Information Privacy
& Confidentiality PerfectlyAnonymous Data } Calibrating privacy
and security controls to the intended uses and privacy risks
associated with the data } When conceptualizing informational
risks, considering not just reidentification risks but also
inference risks, or the potential for others to learn about
individuals from the inclusion of their information in the data }
Addressing informational risks using a combination of privacy and
security controls rather than relying on a single control such as
consent or deidentification } Anticipating, regulating, monitoring,
and reviewing interactions with data across all stages of the
lifecycle (including the post-access stages), as risks and methods
will evolve over time and
40. Recommended Readings PerfectlyAnonymous Data } Altman
M,Wood A,O'Brien D,Vadhan S, Gasser U. Towards a Modern Approach to
Privacy-Aware Government Data Releases. Berkeley Journal of
Technology Law 30(3) 1967-2072. 2016 } Vayena E, Gasser U,Wood A,
O'Brien D,Altman M. Elements of a New Ethical and Regulatory
Framework for Big Data Research.Washington and Lee Law Review.
2016;72(3):420-442.
41. Creative Commons License This work by MicahAltman is
licensed under the Creative CommonsAttribution-ShareAlike 3.0
United States License.To view a copy of this license,visit
http://creativecommons.org/licenses/by- sa/3.0/us/ or send a letter
to Creative Commons,171 Second Street, Suite 300, San Francisco,
California,94105, USA. PerfectlyAnonymous Data