Introduction New Disclosure Assessment Methods Conclusions and Further Research Assessing Disclosure Risk via Record Linkage Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Universitat Rovira i Virgili Dept. of Computer Engineering and Mathematics UNESCO Chair in Data Privacy Av. Pa¨ ısos Catalans 26, 43007 Tarragona, Catalonia October 5, 2015 Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
26
Embed
Assessing Disclosure Risk via Record Linkage · 2017-11-03 · Introduction New Disclosure Assessment Methods Conclusions and Further Research Assessing Disclosure Risk via Record
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Assessing Disclosure Risk via Record Linkage
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas
Universitat Rovira i VirgiliDept. of Computer Engineering and Mathematics
UNESCO Chair in Data PrivacyAv. Paı̈sos Catalans 26, 43007 Tarragona, Catalonia
October 5, 2015
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
IntroductionOur contributionBackground
Why anonymization?
Private information is routinely collected and stored.
Google
Hospitals
Universities
ProblemPrivacy and Utility
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
IntroductionOur contributionBackground
Statistical disclosure control
Statistical disclosure control methods are about protecting the privacy ofindividual subjects whose answers constitute the original data set.Two main approaches exist:
Utility-first: Priority is given to preserving certain utility properties.Disclosure risk is assessed a posteriori.
Privacy-first: A privacy model is adopted to specificy privacy guaranteesbefore anonymization. Utility is assessed a posteriori.
Note
We propose an a posteriori disclosure risk analysis that simulates attacks viarecord linkage.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
IntroductionOur contributionBackground
Statistical disclosure control
Statistical disclosure control methods are about protecting the privacy ofindividual subjects whose answers constitute the original data set.Two main approaches exist:
Utility-first: Priority is given to preserving certain utility properties.Disclosure risk is assessed a posteriori.
Privacy-first: A privacy model is adopted to specificy privacy guaranteesbefore anonymization. Utility is assessed a posteriori.
Note
We propose an a posteriori disclosure risk analysis that simulates attacks viarecord linkage.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
IntroductionOur contributionBackground
Statistical disclosure control
Statistical disclosure control methods are about protecting the privacy ofindividual subjects whose answers constitute the original data set.Two main approaches exist:
Utility-first: Priority is given to preserving certain utility properties.Disclosure risk is assessed a posteriori.
Privacy-first: A privacy model is adopted to specificy privacy guaranteesbefore anonymization. Utility is assessed a posteriori.
Note
We propose an a posteriori disclosure risk analysis that simulates attacks viarecord linkage.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
IntroductionOur contributionBackground
Statistical disclosure control
Statistical disclosure control methods are about protecting the privacy ofindividual subjects whose answers constitute the original data set.Two main approaches exist:
Utility-first: Priority is given to preserving certain utility properties.Disclosure risk is assessed a posteriori.
Privacy-first: A privacy model is adopted to specificy privacy guaranteesbefore anonymization. Utility is assessed a posteriori.
Note
We propose an a posteriori disclosure risk analysis that simulates attacks viarecord linkage.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
IntroductionOur contributionBackground
Record linkageStandard record linkage
Data 1 External information
The data protector needs to make assumptions on the attacker’s backgroundknowledge (external non-de-identified data sets available, attributes that can beused for linkage, etc.).
Record linkage mainly focuses on identity disclosure.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
IntroductionOur contributionBackground
Record linkageOur approach
Original data Anonymized data
We assume a maximum-knowledge attacker.
Attribute disclosure can be assessed.
The attacker can assess the accuracy of any record linkage he wishes to claim.
The protector can use the methodology to tune the anonymization level so thatthe attacker can claim no linkage.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
IntroductionOur contributionBackground
Permutation distanceThe permutation distance measures the dissimilarity between two records.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
Maximum-knowledge attacker model
Axiom (Kerckhoffs’s principle)
A cryptosystem should be secure even if everything about the system, exceptthe key, is public knowledge.
It can be applied in two different ways:
the attacker knows both the original and the anonymized data set, butnot the linkage between anonymized and original records(re-identification disclosure).
the attacker knows all the original data set except one attribute, and allthe anonymized data set (attribute disclosure)
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
A Relative Measure of Disclosure Risk
Definition
For each record x ∈ X we define its linked record yx ∈ Y as one of theanonymized records in Y at shortest distance from x.
Let MX,Y(x, yx) be a function measuring the amount of masking between xand yx, that will be
linkage distance (re-identification disclosure).
MX,Y(x, yx) = |rankXm (x)− rankY m (yx)| (attribute disclosure).
Given Y1 and Y2 two different anonymizations of X,
distM(X,Y1) and distM(X,Y2)
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
A Relative Measure of Disclosure Risk
Definition
For each record x ∈ X we define its linked record yx ∈ Y as one of theanonymized records in Y at shortest distance from x.
Let MX,Y(x, yx) be a function measuring the amount of masking between xand yx, that will be
linkage distance (re-identification disclosure).
MX,Y(x, yx) = |rankXm (x)− rankY m (yx)| (attribute disclosure).
Given Y1 and Y2 two different anonymizations of X,
distM(X,Y1) and distM(X,Y2)
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
A Relative Measure of Disclosure Risk
Definition
For each record x ∈ X we define its linked record yx ∈ Y as one of theanonymized records in Y at shortest distance from x.
Let MX,Y(x, yx) be a function measuring the amount of masking between xand yx, that will be
linkage distance (re-identification disclosure).
MX,Y(x, yx) = |rankXm (x)− rankY m (yx)| (attribute disclosure).
Given Y1 and Y2 two different anonymizations of X,
distM(X,Y1) and distM(X,Y2)
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
Record linkage
Algorithm (Disclosure risk assessmentvia record linkage)
Require: Original data set X.Require: Anonymized data set Y.dist ← distribution of linkage distancesbetween X and Y.dist ′ ← distribution of distances of anon-disclosive linkage.return comparison of dist and dist ′.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
Record linkage
Algorithm (Disclosure risk assessmentvia record linkage)
Require: Original data set X.Require: Anonymized data set Y.dist ← distribution of linkage distancesbetween X and Y.dist ′ ← distribution of distances of anon-disclosive linkage.return comparison of dist and dist ′.
XA Ba1 b1a2 b2
DXA Ba1 b1a1 b2a2 b1a2 b2
We call dictionary an artificial datasetincluding all the possible recordscontaining combinations of attributevalues of the original data set.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
Record linkage
Algorithm (Disclosure risk assessmentvia record linkage)
Require: Original data set X.Require: Anonymized data set Y.dist ← distribution of linkage distancesbetween X and Y.dist ′ ← distribution of distances of anon-disclosive linkage.return comparison of dist and dist ′.
XA Ba1 b1a2 b2
DXA Ba1 b1a1 b2a2 b1a2 b2
Note (Dictionary Linkage)
In the dictionary linkage test we comparethe distribution of linkage distancesbetween X and Y to the distribution oflinkage distances between DX and Y.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
Record linkage
Algorithm (Disclosure risk assessmentvia record linkage)
Require: Original data set X.Require: Anonymized data set Y.dist ← distribution of linkage distancesbetween X and Y.dist ′ ← distribution of distances of anon-disclosive linkage.return comparison of dist and dist ′.
Note (Linkage to Permuted Data Set)
In this test we compare the distributionof linkage distances between X and Y tothe distribution of linkage distancesbetween X and Y′, where Y′ is a dataset of the same dimension as X, andwith the same attributes, but randomlypermuted and assigned to records.
YA Ba1 b1a2 b2...
...an bn
Y′
A Baσ(1) bρ(1)aσ(2) bρ(2)
......
aσ(n) bρ(n)
Note (Dictionary Linkage)
In the dictionary linkage test we comparethe distribution of linkage distancesbetween X and Y to the distribution oflinkage distances between DX and Y.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
Attribute linkageThe Attribute Disclosure Test is based on attribute linkage. Let X be theoriginal data set with m attributes.
Note (Attribute Disclosure Test)
The attacker knows A1, . . . ,Am−1 attributes of X and his goal is to determinethe value of Am as accurately as possible.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
Experimental results: dictionary linkage test
Distribution of linkage distances between X and Y and the distribution oflinkage distances between DX and Y.
Right, same as the left plot but replacing X by a random permutation Xσ andDX by DXσ .
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
Linkage to permuted data set and attribute linkageLinkage to permuted data set
Attribute linkage
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
Experimental results: dictionary linkage test
Distribution of linkage distances between X and Y and the distribution oflinkage distances between DX and Y.
Right, same as the left plot but replacing X by a random permutation Xσ andDX by DXσ .
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
Linkage to permuted data set and attribute linkageLinkage to permuted data set
Attribute linkage
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Maximum-Knowledge Attacker ModelA Relative Measure of Disclosure RiskDisclosure Evaluation Risk via Record LinkageExperimental Results
Correlations
Noise addition Differential privacy
Solid curve: distance between attribute correlation matrices of X and Y.Dashed curve: minimum linkage distance between X and Y.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage
IntroductionNew Disclosure Assessment Methods
Conclusions and Further Research
Conclusions and further research
We have proposed a general method for disclosure risk assessmentbased on record linkage by a maximum-knowledge attacker
We have presented three specific record linkage tests, two are focusedon re-identification disclosure risk and one focused on attributedisclosure risk.
Achieving perfect anonymization requires huge noise, and hencecauses a lot of utility damage.
Our empirical results show that the amount of noise needed for safeanonymization is proportional to the dependency between the attributesof the original data set (the more independent, less noise needed).
As future research, we will use different distances to see which one ismore representative on the assessment of the disclosure risk.
Josep Domingo-Ferrer, Sara Ricci and Jordi Soria-Comas Assessing Disclosure Risk via Record Linkage