Anonymizing Sequential Releases ACM SIGKDD 2006 Benjamin C. M. Fung Simon Fraser University [email protected] Ke Wang Simon Fraser University [email protected].

Anonymizing Sequential Releases

ACM SIGKDD 2006

Benjamin C. M. Fung

Simon Fraser University

[email protected]

Ke Wang

Simon Fraser University

[email protected]

2

Motivation: Sequential Releases

• Previous works address single release only.

• Data are released in multiple shots.

• An organization makes a new release:– New information become available.– A tailored view for each data sharing purpose.– Separate release for sensitive information and

identifying information.

• Related releases sharpens the identification of individuals by a global quasi-identifier.

3

T2: Previous Release

Pid Job Disease

1 Banker Cancer

2 Banker Cancer

3 Clerk HIV

4 Driver Cancer

5 Engineer HIV

T1: Current Release

Pid Name Job Class

1 Alice Banker c1

2 Alice Banker c1

3 Bob Clerk c2

4 Bob Driver c3

5 Cathy Engineer c4

The join on T1.Job = T2.Job

Pid Name Job Disease Class

1 Alice Banker Cancer c1


3 Bob Clerk HIV c2

4 Bob Driver Cancer c3

5 Cathy Engineer HIV c4

- Alice Banker Cancer c1


Do not want Name to be linked to Disease in the join of the two releases.

4


Pid Job Disease

1 Banker Cancer

2 Banker Cancer

3 Clerk HIV

4 Driver Cancer

5 Engineer HIV

T1: Current Release

Pid Name Job Class

1 Alice Banker c1

2 Alice Banker c1

3 Bob Clerk c2

4 Bob Driver c3

5 Cathy Engineer c4





3 Bob Clerk HIV c2





join sharpens identification:{Bob, HIV} has groups size 1.

5


Pid Job Disease

1 Banker Cancer

2 Banker Cancer

3 Clerk HIV

4 Driver Cancer

5 Engineer HIV

T1: Current Release

Pid Name Job Class

1 Alice Banker c1

2 Alice Banker c1

3 Bob Clerk c2

4 Bob Driver c3

5 Cathy Engineer c4





3 Bob Clerk HIV c2





join weakens identification:{Alice, Cancer} has groups size 4.

lossy join: combat join attack.

6


Pid Job Disease

1 Banker Cancer

2 Banker Cancer

3 Clerk HIV

4 Driver Cancer

5 Engineer HIV

T1: Current Release

Pid Name Job Class

1 Alice Banker c1

2 Alice Banker c1

3 Bob Clerk c2

4 Bob Driver c3

5 Cathy Engineer c4





3 Bob Clerk HIV c2





join enables inferences across tables:AliceCancer has 100% confidence.

7

Related Work

• k-anonymity [SS98, FWY05, BA05, LDR05, WYC04, WLFW06]– Quasi-identifier (QID): a set of identifying

attributes in the table. If some record is linked to an external source by a QID value, so are at least k-1 other records.

– The database is made anonymous to itself.– In sequential releases, the database must be

made anonymous to the combination of all releases thus far.

8

Related Work

• l-diversity [MGK06]

– Ensures that sensitive values are “well-represented” in each QID group, measured by entropy.

• Confidence limiting [WFY05, WFY06]:

qid s, confidence < h

where qid is a value on QID, s is a sensitive value.

9

Related Work

• View releases– e.g., T1 and T2 are two views, both can be

modified before the release: more room for satisfying privacy and information requirements.

– [MW04, DP05] measure information disclosure of a view set wrt a secret view.

– [YWJ05, KG06] detect privacy violation by a view set over a base table.

– They measure or detect violations, but do not remove them.

10

Sequential Release• Sequential release:

– Current release T1. Previous release T2.– T1 was unknown when T2 was released.– T2, once released, cannot be modified when T1 is

released.

• Solution #1: k-anonymize all attributes in T1.– Excessive distortion.

• Solution #2: generalize T1 based on T2.– Monotonically distort the later release.

• Solution #3: release a “complete” cohort of all potential releases anonymized at one time.– Require predicting all future releases

11

Intuition of Our Approach

• A lossy join hides the true join relationship to cripple a global QID.

• Generalizing the current release T1 so that the join with the previous release T2 becomes lossy enough to disorient the attacker.

• Two general notions of privacy: (X,Y)-anonymity and (X,Y)-linkability, where X and Y are sets of attributes.

15

(X,Y)-Privacy• k-anonymity: # of distinct records for each

QID value ≥ k.

• (X,Y)-anonymity: # of distinct Y values for each X value ≥ k.

• (X,Y)-linkability: the maximum confidence that a record contains y given that it contains x ≤ k, where (x,y) are values on X and Y.

• Generalize k-anonymity [SS98] and confidence limiting [WFY05, WFY06].

16

Example: (X,Y)-AnonymityPid Job Zip PoB Test

1 Banker 123 Canada HIV

1 Banker 123 Canada Diabetes

1 Banker 123 Canada Eye

2 Clerk 456 Japan HIV

2 Clerk 456 Japan Diabetes

2 Clerk 456 Japan Eye

2 Clerk 456 Japan Heart• QID = {Job, Zip, PoB} is not a key.• k-anonymity fails to ensure that each value

on QID is linked to at least k distinct patients.

17

Example: (X,Y)-Anonymity• With (X,Y)-anonymity,

– specify the anonymity wrt patients by letting

X = {Job, Zip, PoB} and

Y = Pid– Each X group must be linked to at least k

distinct values on Pid.

• If X = {Job, Zip, PoB} and Y = Test, each X group is required to be linked to at least k distinct tests.

18

Example: (X,Y)-LinkabilityPid Job Zip PoB Test




4 Banker 123 Canada Diabetes



• {Banker,123,Canada} HIV (75% confidence).• With Y = Test, the (X,Y)-linkability states that no

test can be inferred from a value on X with a confidence higher than a given threshold.

19

Problem Statement

• The data holder has previously released T2 and wants to release T1, where T2 and T1 are projections of the same underlying table.

• Want to ensure (X,Y)-privacy on the join of T1 and T2.

• Sequential anonymization is to generalize T1 on X ∩ att(T1) so that the join of T1 and T2 preserves the (X,Y)-privacy and T1 remains as useful as possible.

20

Generalization / Specialization• Each generalization replaces all child

values with the parent value. – A cut contains exactly one

value on every root-to-leafpath.

• Each specialization v {v1,…,vc}, replaces the value v in every record containing v with the child value vi that is consistent with the original domain value in the record.

JobANY

Professional Admin

Engineer Lawyer Banker Clerk

21

Generalization / Specialization

• An interval of a continuous attribute is split on-the-fly to maximize information utility.– e.g., age [30-40) [30-37), [37-40)– The split at 37 maximizes the information

gain.

• A taxonomy tree is dynamically grown for each continuous (non-join) attribute.

22

Match Function

• Given T1 and T2, the attacker may apply prior knowledge to match the records in T1 and T2.

• So, the data holder applies such prior knowledge for matching:

– schema information of T1 and T2.

– taxonomies for attributes.

– following inclusion-exclusion principle.

23

Match Function• Let t1 T1 and t2 T2.• Consistency Predicate: t1.A matches t2.A

if they are on the same generalization path for attribute A. – e.g., Male matches Single Male.

• Inconsistency Predicate: t1.A matches t2.B only if t1.A and t2.B are not semantically inconsistent.– Excludes impossible matches.– e.g., Male and Pregnant are semantically

inconsistent, so are Married Male and 6 Month Pregnant.

24

Algorithm Overview

Top-Down Specialization for Sequential AnonymizationInput: T1, T2, a (X,Y)-privacy requirement, a taxonomy tree

for each attribute in X1 where X1=X ∩ att(T1). Output: a generalized T1 satisfying the privacy requirement.

1. generalize every value of Aj to ANYj where Aj X1;

2. while there is a valid candidate in ỤCutj do

3. find the winner w of highest Score(w) from ỤCutj;

4. specialize w on T1 and remove w from ỤCutj;

5. update Score(v) and the valid status for all v in ỤCutj;6. end while

7. output the generalized T1 and ỤCutj;

25

Monotonic Privacy • Theorem 1: On a single table, the (X,Y)-privacy

is anti-monotone wrt specialization on X.– If violated, remains violated after a specialization.

• AY(X) is non-increasing wrt specialization on X.– X always reduces the set of records that contain a X

value, therefore, reduces the set of Y values that co-occur with a X value.

• LY(X) is non-decreasing wrt specialization on X.

– A specialization v {v1,…,vc} transforms a value x on X to the specialized values x1,…,xc on X.

– If ly(xi) < ly(x) for some xi, there must exist some xj such that ly(xj) > ly(x) (otherwise, ly(x) < ly(xi)).

26

Monotonic Privacy

• On the join of T1 and T2, in general, (X,Y)-anonymity is not anti-monotone wrt a specialization on X ∩ att(T1).– Specializing T1 may create dangling records.

• Two tables are population-related if every record in each table has at least one matching record in the other table no dangling record.

• Lemma 1: If T1 and T2 are population-related, AY(X) is non-increasing wrt specialization on X ∩ att(T1).

27

Monotonic Privacy

• Lemma 2: If Y contains attributes from T1 or T2, but not from both, LY(X) does not decrease after specialization of T1 on the attributes X ∩ att(T1).

• Theorem 2: Assume that T1 and T2 are projections of the same underlying tables, (X,Y)-anonymity and (X,Y)-linkability on the join of T1 and T2 are anti-monotone wrt specialization of T1 on X ∩ att(T1).

28

Score Metric

• Score(v) evaluates the “goodness” of a specialization v for preserving privacy and information.

• Each specialization v gains some information and loses some privacy. We maximize

• InfoGain(v) is measured on T1.• PrivLoss(v) is measured on the join of T1 and T2.

29

Information Gain

• If T1 is released for classification on a specified class column, InfoGain(v) could be the reduction of the class entropy:

• T1[v] denotes the set of generalized records in T1 that contain v before the specialization.

• T1[vi] denotes the set of records in T1 that contain vi after the specialization.

• InfoGain(v) could be the notion of distortion.

30

Privacy Loss

• PrivLoss(v) is measured by the decrease of AY(X) or the increase of LY(X) due to the specialization of v: AY(X) - AY(Xv) for (X,Y)-anonymity

LY(Xv) - LY(X) for (X,Y)-linkability

where X and Xv represent the attributes before and after specializing v respectively.

31

Challenges1. Each specialization on w affects the

matching of join, thus, privacy checking.• too expensive to rejoin the two tables for

each specialization.

2. Materializing the join is impractical.• A lossy join can be very large.

Our solution: Incrementally maintains some count statistics to update Score(v) without executing the join.

32

Data Structure

• Expensive operations on specializing w– accessing the records in T1 containing w– matching the records in T1 with the records in

T2.

• X1 = X ∩ att(T1) and X2 = X ∩ att(T2),

• J1 and J2 denote the join attributes in T1 and T2.

33

Data Structure• Tree1: partition T1 records by the

attributes X1 and J1-X1 in that order, one level per attribute.– Link[v] links up all nodes for v at the attribute

level of v.

• Tree2: partition T2 records by the attributes J2 and X2-J2 in that order.– Tree2 is static.

• Probe the matching partitions in Tree2. – Match the last |J1| attributes in a partition in

Tree1 with the first |J2| attributes in Tree2.

34

Analysis• On specializing w, Link[w] provides a direct access

to the records involved in T1• Tree2 provides a direct access to the matching

partitions in T2.• Matching is performed at the partition level, not at

the record level. • The cost of each iteration has two parts.

1. Specialize the affected partitions on Link[w]. 2. Update the score and status of candidates using count

statistics.

• Each record in T1 is accessed at most | X ∩ att(T1) | h times where h is the maximum height of the taxonomies.

35

Empirical Study

• The Adult data set. 45222 records.• Two versions of (T1,T2)• Set A (categorical attributes only)

– T1 contains the Class attribute, the 3 categorical attributes and the 3 join attributes.

– T2 contains the 2 categorical attributes and the 3 join attributes.

• Set B (both categorical and continuous)– T1 contains the additional 6 continuous

attributes from Taxation Department.

Department Attribute # of

Leaves

# of

Levels

Taxation

(T1)

Education (E) 16 5

Occupation (O) 14 3

Work-class (W) 8 5

Common

(T1 & T2)

Marital-status (M) 7 4

Relationship (Ra) 6 3

Sex (S) 2 2

Immigration

(T2)

Native-country (Nc) 40 5

Race (Ra) 5 3

Schema for Set A

• T1 contains the Class attribute

37

Empirical Study• Classification metric

– Classification error on the generalized testing set of T1.

• Distortion metric [SS98]– Categorical: 1 unit of distortion for each

generalization.– Continuous: Suppose v is generalized to interval

[a-b). Unit of distortion = (b-a)/(f2-f1), where [f1,f2) is the full range of the attribute.

– Normalize total distortion by the number of records.

38

(X,Y)-Anonymity• TopN attributes: most important for classification.

– Chosen by successively removing the top attribute in a decision tree.

• Join attributes are the Top3 attributes.– If not important, simply remove them.

• X contains – TopN attributes in T1 for a specified N (to ensure

that the generalization is performed on important attributes),

– all join attributes,– all attributes in T2 (to ensure X is global).

39

Distortion of (X,Y)-anonymity• Ki is a key in Ti.• XYD: produced by our method with Y = K1.• KAD: produced by k-anonymity on T1 with

QID=att(T1).

Set A Set B

40

Classification error of (X,Y)-anonymity• XYE: produced by our method with Y = K1.• XYE(row): produced by our method with Y={K1,K2}.• BLE: produced by the unmodified data.• KAE: produced by k-anonymity on T1 with QID=att(T1).• RJE: produced by removing all join attributes from T1.

Set A Set B

41

(X,Y)-Linkability• Y contains the TopN attributes.

– If not important, simply remove them.

• X contains the rest of the attributes in T1 and T2, except T2.Ra and T2.Nc because otherwise no privacy requirement can be satisfied.

• Focus on the classification error because the distortion due to (X,Y)-linkability is not comparable with the distortion due to k-anonymity.

42

Classification error of (X,Y)-linkability• XYE: produced by our method with Y = TopN.• BLE: produced by the unmodified data.• RJE: produced by removing all join attributes from T1.• RSE: produced by removing all attributes in Y from T1.

Set A Set B

43

(X,Y)-anonymity (k=40) (X,Y)-linkability (k=90%)

Scalability

44

Conclusion• Previous works on k-anonymization

focused on a single release of data.• Studied the sequential anonymization

problem. • Extended the privacy notion to this model. • Introduced lossy join as a way to hide the

join relationship among releases.• Addressed computational challenges due

to large size of lossy join.• Extendable to more than one previously

released tables T2,…,Tp.

45

References[BA05] R. Bayardo and R. Agrawal. Data privacy

through optimal k-anonymization. In IEEE ICDE, pages 217.228, 2005.

[DP05] A. Deutsch and Y. Papakonstantinou. Privacy in database publishing. In ICDT, 2005.

[FWY05] B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In IEEE ICDE, pages 205.216, April 2005.

[KG06] D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. In ACM SIGMOD, Chicago, IL, June 2006.

46

References[LDR05] K. LeFevre, D. J. DeWitt, and R.

Ramakrishnan. Incognito: Efcient full-domain k-anonymity. In ACM SIGMOD, 2005.

[MGK06] A. Machanavajjhala, J. Gehrke, and D. Kifer. l-diversity: Privacy beyond k-anonymity. In IEEE ICDE, 2006.

[MW04] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS, 2004.

[SS98] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In IEEE Symposium on Research in Security and Privacy, May 1998.

47

References[WFY05] K. Wang, B. C. M. Fung, and P. S. Yu.

Template-based privacy preservation in classification problems. In IEEE ICDM, pages 466.473, November 2005.

[WFY06] K. Wang, B. C. M. Fung, and P. S. Yu. Handicapping attacker's condence: An alternative to k-anonymization. Knowledge and Information Systems: An International Journal, 2006.

[WYC04] K. Wang, P. S. Yu, and S. Chakraborty. Bottom-up generalization: A data mining solution to privacy protection. In IEEE ICDM, November 2004.

48

References[WLFW06] R. C. W. Wong, J. Li., A. W. C. Fu, and

K. Wang. (,k)-anonymity: An enhanced k-anonymity model for privacy preserving data publishing. In ACM SIGKDD, 2006.

[YWJ05] C. Yao, X. S. Wang, and S. Jajodia. Checking for k-anonymity violation by views. In VLDB, 2005.

Anonymizing Sequential Releases ACM SIGKDD 2006 Benjamin C. M. Fung Simon Fraser University [email protected] Ke Wang Simon Fraser University [email protected].

Documents

previous release pidjobdisease

engineerhiv t1

new release

single release

separate release

lossy join

related releases

qid value