Top Banner
Perfectly Anonymous Data is Perfectly Useless Data Micah Altman Director of Research--MIT Libraries Prepared for NISO-RDA Symposium Privacy Implications of Research Data Denver, CO Sept 2016
41

Altman - Perfectly Anonymous Data is Perfectly Useless Data

Jan 22, 2018

Download

Education

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  1. 1. Perfectly Anonymous Data is Perfectly Useless Data Micah Altman Director of Research--MIT Libraries Prepared for NISO-RDA Symposium Privacy Implications of Research Data Denver, CO Sept 2016
  2. 2. DISCLAIMER These opinions are my own, they are not the opinions of MIT, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators Secondary disclaimer: Its tough to make predictions, especially about the future! -- Attributed to Woody Allen,Yogi Berra,Niels Bohr,Vint Cerf,Winston Churchill, Confucius,Disreali [sic], Freeman Dyson, Cecil B. Demille,Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn,Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel,Will Rogers,M.Taub, MarkTwain, Kerr L. White, etc. PerfectlyAnonymous Data
  3. 3. Collaborators & Co-Conspirators } Privacy Tools for Sharing Research DataTeam (SalilVadhan, P.I.) http://privacytools.seas.harvard.edu/people } Research Support } Supported in part by NSF grant CNS-123723 } Supported in part by the Sloan Foundation PerfectlyAnonymous Data
  4. 4. Related Work Main Project: } Privacy Tools for Sharing Research Data http://privacytools.seas.harvard.edu/ Related publications: } Altman M,Wood A, O'Brien D,Vadhan S,Gasser U.Towards a Modern Approach to Privacy-Aware Government Data Releases. Berkeley Journal of Technology Law 30(3) 1967-2072.2016 } Wood A, Airoldi E,Altman M, de MontandreY,Gasser U, O'Brien D,Vadhan S. Privacy Tools project response to Common Rule Notice of Proposed Rule Making. Comments on Regulation.Gov . 2016.(Copy available here: http://informatics.mit.edu/publications/privacy-tools-project-response-common-rule- notice-proposed-rule-making);and } Vayena E,Gasser U,Wood A, O'Brien D,Altman M. Elements of a New Ethical and Regulatory Framework for Big Data Research.Washington and Lee Law Review. 2016;72(3):420-442. } Altman M, Capps C, Prevost R. Location Confidentiality and Official Surveys. Social Science Research Network [Internet].2016. Slides and reprints available from: informatics.mit.edu PerfectlyAnonymous Data
  5. 5. Todays Perspectives & Provocations *What is information privacy? * * Privacy, Utility and Use * * Calibrating Protection and Risk * PerfectlyAnonymous Data
  6. 6. PerfectlyAnonymous Data What is Information Privacy?
  7. 7. Some Definitions Privacy Control over extent and circumstances of sharing Confidentiality Control over disclosure of information Identifiability Potential for learning about individuals based on their inclusion in a data Sensitivity Potential for harm if information disclosed and used to learn about indviduals PerfectlyAnonymous Data
  8. 8. Privacy is not Anonymization PerfectlyAnonymous Data } Anonymization / deidentification / PII are legal concepts } May be no rigorous formal definition } Definition varies by law, may include } Presence of specific attributes (e.g., PII,HIPAA identifiers) } Feasibility of record linkage ... } Evaluation of knowledge of data publisher (e.g.no actual knowledge,readily ascertainable) } Legal definitions do not match statistical/scientific concepts
  9. 9. Privacy is not Information Security Confidentiality (Secrecy) Integrity Availability Authenticity Non-Repudiation PerfectlyAnonymous Data Information security maintains many security properties within information system Failures of information secrecy may also violate privacy Information security controls access does not control computation,use, or inference Information security functions only within an information system privacy depends on what happens to information before,during, and after it is an informatin system
  10. 10. Privacy aims to Prevent Harm from Disclosure PerfectlyAnonymous Data Data Subjects Vulnerable Groups Institutions Society
  11. 11. Public Data isnt Always } The Doesnt Stay inVegas problem -- information shared locally can be found anywhere } Digitization andAPIs make aggregation easy } Manypublic records were not practically public until now } Creates new business models } Provokes sector-by-sector legislative reaction PerfectlyAnonymous Data More Information Sweeney, Latanya."Discrimination in online ad delivery." Queue 11.3 (2013):10. Wikipedia Mugshot database
  12. 12. PerfectlyAnonymous Data Privacy, Utility, and Use (Cases)
  13. 13. No-Free-Lunch for Privacy PerfectlyAnonymous Data Any data analysis that is useful* leaks some measurable private information
  14. 14. Scientific Trend: More Sophisticated Reidentification Concepts PerfectlyAnonymous Data Record-linkage wheres waldo Match a real person to precise record in a database Examples: direct identifiers. Caveats: Satisfies compliance for specific laws, but not generally; substantial potential for harm remains Indistinguishability hidden in the crowd Individuals can be linked only to a cluster of records (of known size) Examples: K-anonymity, attribute disclosure Caveats: Potential for substantial harms may remain, must specify what external information is observable, & need diversity for sensitive attributes Limited Adversarial Learning confidentiality guaranteed Formally bounds the total learning about any individual that occurs from a data release Examples: differential privacy, zero- knowledge proofs Caveats: Challenging to implement, often requires interactive systems More Information Willenborg, Leon, and Ton De Waal. Elements of statistical disclosure control. Vol. 155. Springer Science & Business Media, 2012. Fung,Benjamin, et al. "Privacy-preserving data publishing:A survey of recent developments." ACM Computing Surveys (CSUR) 42.4 (2010):14. Dwork, Cynthia."Differential privacy: A survey of results." International Conference on Theory and Applications of Models of Computation. Springer Berlin Heidelberg, 2008.
  15. 15. What use is it? } Utility is defined broadly as the analytical value of the data } There is no universal operational definition of utility } There is a structural tradeoff between privacy and utility simultaneously cant maximize both } However, new methods can sometimes do better than traditional anonymization on both fronts.
  16. 16. Some Approaches to Measuring of Utility Precision Bias Variance Statistical Inference Truthfulness Completeness Consistency Interpretability Semantic Entropy Information Complexity Comput ational Image Sources https://commons.wikimedia.org/wiki/File:Solna_Brick_wall_Stretcher_bond_variation1.jpg https://commons.wikimedia.org/wiki/File:Archery_Target_80cm.svg
  17. 17. Challenges to Measuring Utility } Utility may not track quality, or value } Quality depends on intended use } Value is dependent on intended and unintended uses;and on markets } Utility at one level doesnt match utility at another levels Truthful records accurate aggregates Measure Record Aggregate Quantity Model Data
  18. 18. Traditional Confidentiality-Enhancing Data Publishing Published Outputs * Jones * * 1961 021* * Jon es * * 1 9 6 1 0 2 1 * * Jon es * * 1 9 7 2 9 4 0 4 * * Jon es * * 1 9 7 2 9 4 0 4 * * Jon es * * 1 9 7 2 9 4 0 4 * Modal Practice The correlation between X and Y was large and statistically significant Summary statistics Contingency table Public use sample microdata Information Visualization PerfectlyAnonymous Data
  19. 19. Name SSN Birthdate Zipcode Gender Favorite Ice Cream # of crimes committed A. Jones 12341 01011961 02145 M Raspberry 0 B.Jones 12342 02021961 02138 M Pistachio 0 C. Jones 12343 11111972 94043 M Chocolate 0 D.Jones 12344 12121972 94043 M Hazelnut 0 E. Jones 12345 03251972 94041 F Lemon 0 F. Jones 12346 03251972 02127 F Lemon 1 G. Jones 12347 08081989 02138 F Peach 1 H. Smith 12348 01011973 63200 F Lime 2 I.Smith 12349 02021973 63300 M Mango 4 J.Smith 12350 02021973 63400 M Coconut 16 K. Smith 12351 03031974 64500 M Frog 32 L.Smith 12352 04041974 64600 M Vanilla 64 M.Smith 12353 04041974 64700 F Pumpkin 128 N.Smith 12354 04041974 64800 F Allergic 256 Just some data? PerfectlyAnonymous Data
  20. 20. Name SSN Birthdate Zipcode Gender Favorite Ice Cream # of crimes committed A. Jones 12341 01011961 02145 M Raspberry 0 B.Jones 12342 02021961 02138 M Pistachio 0 C. Jones 12343 11111972 94043 M Chocolate 0 D.Jones 12344 12121972 94043 M Hazelnut 0 E. Jones 12345 03251972 94041 F Lemon 0 F. Jones 12346 03251972 02127 F Lemon 1 G. Jones 12347 08081989 02138 F Peach 1 H. Smith 12348 01011973 63200 F Lime 2 I.Smith 12349 02021973 63300 M Mango 4 J.Smith 12350 02021973 63400 M Coconut 16 K. Smith 12351 03031974 64500 M Frog 32 L.Smith 12352 04041974 64600 M Vanilla 64 M.Smith 12353 04041974 64700 F Pumpkin 128 N.Smith 12354 04041974 64800 F Allergic 256 Whats wrong with this picture? Identifier Sensitive Private Identifier Private Identifier Identifie r Sensitive Unexpected Response? Mass resident FERPA too? Californian Twins, separated at birth? PerfectlyAnonymous Data
  21. 21. Help, help, Im being suppressed Name SSN Birthdate Zipcode Gender Favorite Ice Cream # of crimes committed [Name 1] 12341 *1961 021* M Raspberry .1 [Name 2] 12342 *1961 021* M Pistachio -.1 [Name 3] 12343 *1972 940* M Chocolate 0 [Name 4] 12344 *1972 940* M Hazelnut 0 [Name 5] 12345 *1972 940* F Lemon .6 [Name 6] 12346 *1972 021* F Lemon .6 [Name 7] 12347 *1989 021* * Peach 64.6 [Name 8] 12348 *1973 632* F Lime 3 [Name 9] 12349 *1973 633* M Mango 3 [Name 10] 12350 *1973 634* M Coconut 37.2 [Name 11] 12351 *1974 645* M * 37.2 [Name 12] 12352 *1974 646* M Vanilla 37.2 [Name 13] 12353 *1974 647* F * 64.4 [Name 14] 12354 *1974 648* F Allergic 256 Redaction VarSynthetic Global Recode Local Suppression Aggregation + Perturbation Traditional Static Suppression } Data reduction } Observation } Measure } Cell } Perturbation } Microaggregation } Rule-baseddataswapping } Adding noise PerfectlyAnonymous Data
  22. 22. Perfect Privacy* Name SSN Birthdate Zipcode Gender Favorite Ice Cream # of crimes committed * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * PerfectlyAnonymous Data * Global Tabular Suppression
  23. 23. How does Anonymization Affect Utility PerfectlyAnonymous Data Measure Record Aggregate Model Bias Generalization (N/A) K-anon Cell suppression Record deletion + any measure bias Column deletion + any aggregation bias Synthetic Precision Cell Perturbation (All methods increase variance) Truthfulness Cell Perturbation Microaggregation Swapping Synthetic Cell Perturbation Microaggregation Swapping (Bias is analogous to truthfulness see above) Completeness Column suppression Record suppression (Cell suppression affects completess & interpretability) (Record incompleteness increases bias) (N/A) Consistency Cell Perturbation Microaggregation Cell Perturbation (N/A) Interpretability Cell Perturbation Microaggregation Cell Perturbation Cell suppression Microaggregation All anonymization methods effect variance, informational complexity, entropy
  24. 24. Special Challenges to Research Uses PerfectlyAnonymous Data Computational Replication / Process Verification / HistoricalVerification Reuse, Reanalysis, Extension Data Integration across Separate Source
  25. 25. Data Reuse PerfectlyAnonymous Data } Data Integration } Pooling Measures (Vertical Partitioning) } Pooling Subjects (Horizontal Partitioning) } Pooling Studies (Meta-analysis) } PoolingTime (Longitudinal Integration) } Follow-ups/recontacts (AddingWaves,Future data collection) } Data Reuse } Research Extension / Reanalysis / Critique } New Model / Analysis / Estimation } Extraction of New Signals (Measures)
  26. 26. Data Integration and Privacy Challenges PerfectlyAnonymous Data Aim: Integrate evidence across multiple independent databases } Horizontal integration possible if known to be different sample observations of same population } Most privacy protections impede } Vertical (multi-measure) } Longitudinal integration } Full meta-analysis } Anonymization impedes follow-up studies } Mitigating Approaches } Tiered access } Third-party escrow } Artificial linking identifiers Can leak information Not robust to errors in identifying information Challenging to coordinate across institutions } Secure multiparty computation
  27. 27. Reuse and Privacy Challenges PerfectlyAnonymous Data Aim: Critique previousresults,extend model/methods, ask new questions possibly far in future } Methods that impede data integration partially impede critique, extension, new questions } Synthetic data approaches, suppressing measures, can impeded new questions } Privacy protections that affect interpretability are a challenge to meaningful long-term access } Generalizations, synthetic data approaches, aggregation impeded extraction of new measures for new questions } Mitigating Approaches } Broad consent } Tiered access
  28. 28. Replication/Verification PerfectlyAnonymous Data } Some common characteristics } Uses theoriginal data,model, methods,implementation } Corresponds to those used when results were published } Often archived as replication packages } Calibration } Verify understanding of software, method,model before extending } Often a preliminary to reuse } Check research integrity } Recreate, as closely as possible, the process actually followed by an author in producing previously published results } Detect substantial omissions in methods,Unsophisticated data manipulation,Misleading precision in publication } Retrospective process quality evaluation } Evaluate process methods in practice } Detect:Transcription errors,Numerical issues,Software bugs } Historical/legal verification } Verify some phenomenon/relation existed/didnot exist in data? } Verify historical aggregate } Did some individual have/not have a particular property
  29. 29. Replication, Verification and Privacy Challenges PerfectlyAnonymous Data } Verify particular process and results from a prior publication/statement } Affected by variance, bias, interpretability,completeness } Calibrate prior to extension, critique } Affected by bias, variance, interpretability } Assess process/method robustness } Affected by precision,variance, bias } Verify historical statements about records } Affected by truthfulness, completeness } MitigatingApproaches } Tiered access } Consent for replication } Privacy aware publishing } Publish research only on post- protected data } Avoid publishing results with unnecessary precision } Document uncertainty in } Replication servers } Process logs + audit trails
  30. 30. Big Data Research and Privacy Challenges } Big Data can be Rich, Messy & Surprising } The Blog problem : Pseudononymous communication used for topic mining can be linked through stylometric analysis } Observable Behavior Leaves Unique Fingerprints } The GIS problem: Location trails are individualistic, externally observable,difficult to mas } Traditional Anonymization Methods can destroy utility } The Netflix Problem: many people may have unique long-tail behavior PerfectlyAnonymous Data More Information J. Novak, P. Raghavan,A. Tomkins, 2004.Anti-aliasing on the Web, Proceedings of the 13th internationalconference on World Wide Web Narayanan,Arvind, and Vitaly Shmatikov. "Robust de-anonymization of large sparse datasets." 2008 IEEE Symposium on Security and Privacy (sp 2008). IEEE,2008. De Montjoye,Yves-Alexandre, et al. "Unique in the crowd: The privacy bounds of human mobility." Scientific reports 3 (2013).
  31. 31. Computational Methods Beyond Anonymization } ControllingAccess } Virtual Data Enclaves } Controlling Computation } Secure Multiparty Computation } Functional Encryption } Homomorphic Encryption } Blockchain } Controlling Inference } Differential Privacy } Restricting Use } Executable Policy Languages PerfectlyAnonymous Data More Information Altman M,Wood A, O'Brien D,Vadhan S, Gasser U.Towardsa Modern Approach to Privacy-AwareGovernment Data Releases. Berkeley JournalofTechnology Law 30(3) 1967-2072. 2016 Altman M, CappsC,Prevost R. Location Confidentiality and Official Surveys.Social Science Research Network [Internet]. 2016
  32. 32. Caution: No Current Method Cure Algorithmic Discrimination } Big data algorithms may pick up unanticipated relationships in data } Algorithms that incorporate human behavior may amplify human biases } Excluding sensitive input is not enough } Check if classifications have distributional bias } Check if classification error is biased PerfectlyAnonymous Data More Information Sweeney, Latanya."Discrimination in online ad delivery." Queue 11.3 (2013):10. Larson, et al. How We Analyzed the COMPAS Recidivism Algorithm https://www.propublica.org/article/how-we-analyzed-the- compas-recidivism-algorithm
  33. 33. Bad News/Good News for Big Data and Privacy } Bad News: De-identification is not enough - accumulation of many information releases may compose into substantial disclosure; and/or lead to algorithmic discrimnation } Good News: large numbers are your friend - new methods such as differential privacy can provide strong disclosure protection and good estimates for large samples } Bad News: Traditional anonymization primitives (e.g. local suppression, topcoding) can add variance,require reweighting,and bias answers } Good News: some new cryptography-based methods (multiparty secure computation, some differentially private algorithms) are unbiased, and directly interpretable } Bad News: Traditional methods focus explicitly on limiting data release } Good News: new frameworks for limiting inferences (differential privacy); computation (multiparty secure computation); and uses (policy languages)
  34. 34. Bake Privacy In PerfectlyAnonymous Data } Fair Information Practice: } Notice/awareness } Choice/consent } Access/participation (verification, accuracy, correction) } Integrity/security } Enforcement/redress } Self-regulation, private remedies; government enforcements } Privacy by design: } Proactive not reactive; Preventative not remedial } Privacy as the default setting } Privacy embedded into design } Full Functionality Positive-Sum, not Zero- Sum } End-to-End Security Full Lifecycle Protection } Visibility and Transparency Keep it Open } Respect for User Privacy Keep it User-Centric } OECD Principles } Collection limitation } Data quality } Purpose specification } Use limitation } Security Safeguards } Openness } Individual participation } Accountability
  35. 35. Notice and Consent, by itself Does not Scale PerfectlyAnonymous Data } Most people believe they have lost control of their private information } Many individuals do not understand how and what is shared, or potential consequences } Most people encounter thousands of terms of use every year } Because big data is rich, messy and surprising -- future uses are more difficult to anticipate More Information State of Privacy in America, 2016, Pew Fact Tank: http://www.pewresearch.org/fact-tank/2016/01/20/the-state-of-privacy-in-america/ Big Data And Privacy: A Technical Perspective, PCAST, 2016. https://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdf McDonald, Aleecia M., and Lorrie Faith Cranor. "Cost of reading privacy policies, the." ISJLP 4 (2008):543.APA
  36. 36. Lifecycle approach to data management } Review of uses, threats, and vulnerabilities as information is used over time } Select appropriate controls at each stage
  37. 37. Catalog of privacy controls } Procedural, technical, educational, economic, and legal means for enhancing privacyat each stage of the information lifecycle Procedural Economic Educational Legal Technical Access/Release Access controls; Consent; Expert panels; Individual privacy settings; Presumption of openness vs. privacy; Purpose specification; Registration; Restrictions on use by data controller; Risk assessments Access/Use fees (for data controller or subjects); Property rights assignment Data asset registers; Notice; Transparency Integrity and accuracy requirements; Data use agreements (contract with data recipient)/ Terms of service Authentication; Computable policy; Differential privacy; Encryption (incl. Functional; Homomorphic); Interactive query systems; Secure multiparty computation
  38. 38. Calibrating Controls Illustrating how to choose privacy controls that are consistent with the uses, threats, and vulnerabilities at each lifecycle stage More Information Altman M,Wood A, O'Brien D,Vadhan S, Gasser U.Towardsa Modern Approach to Privacy-AwareGovernment Data Releases. Berkeley JournalofTechnology Law 30(3) 1967-2072. 2016
  39. 39. Principles of a Modern Approach to Information Privacy & Confidentiality PerfectlyAnonymous Data } Calibrating privacy and security controls to the intended uses and privacy risks associated with the data } When conceptualizing informational risks, considering not just reidentification risks but also inference risks, or the potential for others to learn about individuals from the inclusion of their information in the data } Addressing informational risks using a combination of privacy and security controls rather than relying on a single control such as consent or deidentification } Anticipating, regulating, monitoring, and reviewing interactions with data across all stages of the lifecycle (including the post-access stages), as risks and methods will evolve over time and
  40. 40. Recommended Readings PerfectlyAnonymous Data } Altman M,Wood A,O'Brien D,Vadhan S, Gasser U. Towards a Modern Approach to Privacy-Aware Government Data Releases. Berkeley Journal of Technology Law 30(3) 1967-2072. 2016 } Vayena E, Gasser U,Wood A, O'Brien D,Altman M. Elements of a New Ethical and Regulatory Framework for Big Data Research.Washington and Lee Law Review. 2016;72(3):420-442.
  41. 41. Creative Commons License This work by MicahAltman is licensed under the Creative CommonsAttribution-ShareAlike 3.0 United States License.To view a copy of this license,visit http://creativecommons.org/licenses/by- sa/3.0/us/ or send a letter to Creative Commons,171 Second Street, Suite 300, San Francisco, California,94105, USA. PerfectlyAnonymous Data