Data Anonymization (1)
Jan 08, 2016
Data Anonymization (1)
Outline Problem concepts algorithms on domain generalization
hierarchy Algorithms on numerical data
The Massachusetts Governor Privacy Breach
•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge
•Name•Address•Date Registered•Party affiliation•Date last voted
• Zip
• Birth date
• Sex
Medical Data Voter List
• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis
• Zip
• Birth date
• Sex
Sweeney, IJUFKS 2002
Quasi IdentifierQuasi Identifier
87 % of US population
3
Definition Table
Column: attributes, row: records
Quasi-identifier A list of attributes that can potentially be
used to identify individuals
K-anonymity Any QI in the table appears at least k
times
Basic techniques Generalization
Zip {02138, 02139} 0213* Domain generalization hierarchy
A0 A1…An Eg. {02138, 02139} 0213* 021* 02*0** This hierarchy is a tree structure
suppression
Balance
Better privacy guaranteeLower data utility
There are many schemes satisfying the k-anonymity specification.We want to minimize the distortion of table, in order to maximize data utility
• Suppression is required if we cannot find a k-anonymity group for a record.
Criteria Minimal generalization
Minimal generalization that satisfy the k-anonymization specification
Minimal table distortion Minimal generalization with minimal
utility loss Use precision to evaluate the loss
[sweeny papers] Application-specific utility
Complexity of finding optimal solution on generalization NP-hard (bayardo ICDE05) So all proposed algorithms are
approximate algorithms
Shared features in different solutions Always satisfy the k-anonymity
specification If some records not, suppress them
Differences are at the utility loss/cost function Sweeney’s precision metric Discernibility & classification metrics Information-privacy metric
Algorithms Assume the domain generalization hierarchy is
given Efficiency Utility maximization
Metrics to be optimized Two cost metrics – we want to minimize
(bayardo ICDE05) Discernibility
Classification The dataset has a class label column – preserving
the classification model
# of items in the k-anony group
# Records in minor classes in the group
metrics A combination of information loss and
anonymity gain (wang ICDE04) Information loss, anonymity gain Information-privacy metric
metrics Information loss
Dataset has class labels Entropy
a set S, labeled by different classes Entropy is used to calculate the impurity of labels
Information loss of a generalization G{c1,c2,…cn} p
I(G) = info(Sp) - info (Rci)
i
ii pp log Pi is the percentage of label iInfo(S)=
i p
ci
N
N
Anonymity gain A(VID) : # of records with the VID AG(VID) >= A(VID): generalization
improves or does not change A(VID) Anonymity gain
P(G) = x – A(VID)x = AG (VID) if AG (VID) <=K
x = K, otherwise
As long as k-anonymity is satisfied, further generalization of the VID does not gain
Information-privacy combined metricIP = info loss/anonymity gain = I(G)/P(G)
We want to minimize IPIf P(G) ==0, use I(G) only
Either small I(G) or large P(G) will reduce IP…If P(G)s are same, pick one with minimum I(G)
Domain-hierarchy based algorithms The sweeny’s algorithm Bayardo’s tree pruning algorithm Wang’s top-down and bottom up
algorithms They are all dimension-by-dimension
methods
Multidimensional techniques Categorical data?
Categories are mapped to numerize the categories
Bayardo 95 paper Order matters? (no research on that)
Numerical data K-anonymization n-dim space
partitioning Many existing techniques can be applied
Single-dimensional vs. multidimensional
The evolving procedure
Categorical(domain hierarchy)[sweeney, top-down/bottom-up]
numerized categories, single dimensional [bayardo05]
numerized/numerical multidimensional[Mondrian,spatial indexing,…]
Method 1: Mondrain Numerize categorical data Apply a top-down partioning process
step1
Step2.1 Step2.2
Allowable cut
Method 2: spatial indexing Multidimensional spatial techniques
Kd-tree (similar to Mondrain algorithm) R-tree and its variations
R-tree R+-tree
Leaf layer
Upperlayer
Compacting bounds
Example: uncompacted: age[1-80], salary[10k-100k]compacted: age[20-40], salary[10k-50k]
Original Mondrain does not consider compacting boundsFor R+-Tree, it is automatically done.
Information is betterpreserved
Benefits of using R+-Tree Scalable: originally designed for
indexing disk-based large data Multi-granularity k-anonymity: layers Better performance Better quality
Performance
Mondrain
Utility Metrics
Discenibility penalty KL divergence: describe the difference
between a pair of distributions
Certainty penalty
Anonymized data distribution
T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range
Other issues Sparse high-dimensionality
Transactional data boolean matrix“On the anonymization of sparse high-dimensional
data” ICDE08 Relate to the clustering problem of
transactional data! The above one uses matrix-based clustering item based clustering (?)
Other issues Effect of numerizing categorical data
Ordering of categories may have certain impact on quality
General-purpose utility metrics vs. special task oriented utility metrics
Attacks on k-anonymity definition