Data Anonymization (1)

Data Anonymization (1)

Outline Problem concepts algorithms on domain generalization

hierarchy Algorithms on numerical data

The Massachusetts Governor Privacy Breach

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date Registered•Party affiliation•Date last voted

• Zip

• Birth date

• Sex

Medical Data Voter List

• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis

• Zip

• Birth date

• Sex

Sweeney, IJUFKS 2002

Quasi IdentifierQuasi Identifier

87 % of US population

3

Definition Table

Column: attributes, row: records

Quasi-identifier A list of attributes that can potentially be

used to identify individuals

K-anonymity Any QI in the table appears at least k

times

Basic techniques Generalization

Zip {02138, 02139} 0213* Domain generalization hierarchy

A0 A1…An Eg. {02138, 02139} 0213* 021* 02*0** This hierarchy is a tree structure

suppression

Balance

Better privacy guaranteeLower data utility

There are many schemes satisfying the k-anonymity specification.We want to minimize the distortion of table, in order to maximize data utility

• Suppression is required if we cannot find a k-anonymity group for a record.

Criteria Minimal generalization

Minimal generalization that satisfy the k-anonymization specification

Minimal table distortion Minimal generalization with minimal

utility loss Use precision to evaluate the loss

[sweeny papers] Application-specific utility

Complexity of finding optimal solution on generalization NP-hard (bayardo ICDE05) So all proposed algorithms are

approximate algorithms

Shared features in different solutions Always satisfy the k-anonymity

specification If some records not, suppress them

Differences are at the utility loss/cost function Sweeney’s precision metric Discernibility & classification metrics Information-privacy metric

Algorithms Assume the domain generalization hierarchy is

given Efficiency Utility maximization

Metrics to be optimized Two cost metrics – we want to minimize

(bayardo ICDE05) Discernibility

Classification The dataset has a class label column – preserving

the classification model

# of items in the k-anony group

# Records in minor classes in the group

metrics A combination of information loss and

anonymity gain (wang ICDE04) Information loss, anonymity gain Information-privacy metric

metrics Information loss

Dataset has class labels Entropy

a set S, labeled by different classes Entropy is used to calculate the impurity of labels

Information loss of a generalization G{c1,c2,…cn} p

I(G) = info(Sp) - info (Rci)

i

ii pp log Pi is the percentage of label iInfo(S)=

i p

ci

N

N

Anonymity gain A(VID) : # of records with the VID AG(VID) >= A(VID): generalization

improves or does not change A(VID) Anonymity gain

P(G) = x – A(VID)x = AG (VID) if AG (VID) <=K

x = K, otherwise

As long as k-anonymity is satisfied, further generalization of the VID does not gain

Information-privacy combined metricIP = info loss/anonymity gain = I(G)/P(G)

We want to minimize IPIf P(G) ==0, use I(G) only

Either small I(G) or large P(G) will reduce IP…If P(G)s are same, pick one with minimum I(G)

Domain-hierarchy based algorithms The sweeny’s algorithm Bayardo’s tree pruning algorithm Wang’s top-down and bottom up

algorithms They are all dimension-by-dimension

methods

Multidimensional techniques Categorical data?

Categories are mapped to numerize the categories

Bayardo 95 paper Order matters? (no research on that)

Numerical data K-anonymization n-dim space

partitioning Many existing techniques can be applied

Single-dimensional vs. multidimensional

The evolving procedure

Categorical(domain hierarchy)[sweeney, top-down/bottom-up]

numerized categories, single dimensional [bayardo05]

numerized/numerical multidimensional[Mondrian,spatial indexing,…]

Method 1: Mondrain Numerize categorical data Apply a top-down partioning process

step1

Step2.1 Step2.2

Allowable cut

Method 2: spatial indexing Multidimensional spatial techniques

Kd-tree (similar to Mondrain algorithm) R-tree and its variations

R-tree R+-tree

Leaf layer

Upperlayer

Compacting bounds

Example: uncompacted: age[1-80], salary[10k-100k]compacted: age[20-40], salary[10k-50k]

Original Mondrain does not consider compacting boundsFor R+-Tree, it is automatically done.

Information is betterpreserved

Benefits of using R+-Tree Scalable: originally designed for

indexing disk-based large data Multi-granularity k-anonymity: layers Better performance Better quality

Performance

Mondrain

Utility Metrics

Discenibility penalty KL divergence: describe the difference

between a pair of distributions

Certainty penalty

Anonymized data distribution

T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range

Other issues Sparse high-dimensionality

Transactional data boolean matrix“On the anonymization of sparse high-dimensional

data” ICDE08 Relate to the clustering problem of

transactional data! The above one uses matrix-based clustering item based clustering (?)

Other issues Effect of numerizing categorical data

Ordering of categories may have certain impact on quality

General-purpose utility metrics vs. special task oriented utility metrics

Attacks on k-anonymity definition

Data Anonymization (1)

Documents