Top Banner
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong Kong [email protected] Gabriel Ghinita 1 Yufei Tao 2 Panos Kalnis 1
19

1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

1

On the Anonymization of Sparse High-Dimensional

Data

1 National University of Singapore{ghinitag,kalnis}@comp.nus.edu.sg

2 Chinese University of Hong [email protected]

Gabriel Ghinita1 Yufei Tao2 Panos Kalnis1

Page 2: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

2

Publishing Transaction Data Publishing transaction data

Retail chain-owned shopping cart data

Infer consumer spending patterns

Correlations among purchased items

e.g., 90% of cereals buyers also buy milk

What about privacy?

Page 3: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

3

Privacy Threat

Quasi-identifying

Items

Sensitive

Items

Page 4: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

4

Privacy Paradigm ℓ-diversity

prevent association between quasi-identifier and sensitive attributes

Create groups of transactions freq. of an SA value in a group < 1/p

Objective Enforce privacy Preserve correlations among items Challenge: high data dimensionality

Page 5: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

5

Data Re-organization

Band Matrix Organization

PRESERVES

CORELATIONS!

Page 6: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

6

Published Data

Summary of Sensitive Items

Page 7: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

7

Contributions Novel data representation

Preserves correlation among items

Efficient heuristic for group formation Linear time to data size Supports multiple sensitive items

Page 8: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

State-of-the-art: Mondrian[FWR06]

Generalization-based data-space partitioning similar to k-d-trees

split recursively until privacy condition does not hold

constrained global recoding

k = 2

[FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006

Age

20 40 60

Weig

ht

40

60

80

100

GENERALIZATION + HIGH DIMENSIONALITY

=

UNACCEPTBLE INFORMATION LOSS

Page 9: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

State-of-the-art: Anatomy[XT06]

Permutation-based method discloses exact QID values

DiseaseUlcer(1)

Pneumonia(1)Flu(1)

Dyspepsia(1)

Gastritis(1) Dyspepsia(1)

[XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), 2006

Age ZipCode42 5200047 4300051 3200062 4100055 2700067 55000

Age ZipCode Disease

42 52000 Ulcer47 43000 Pneumonia51 32000 Flu55 27000 Gastritis62 41000 Dyspepsia67 55000 Dyspepsia

“Anatomized” table|G|! permutationsRANDOM GROUP FORMATION

DOES NOT PRESERVE CORRELATIONS

Page 10: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

10

Band Matrix Representation

Bandwidth = U+L+1 Minimizing bandwidth is NP-hard

Page 11: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

11

Reverse Cuthil-McKee (RCM) Heuristic Bandwidth Minimization

Solves corresponding graph labeling problem Permutes rows and columns Complexity N* D * log D

N = matrix rows (# transactions) D = maximum degree of any vertex

Page 12: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

12

Group Formation Correlation-aware Anonymization of High-

Dimensional Data (CAHD)

Use the order given by RCM Consecutive transactions highly correlated

O(pN) complexity

Page 13: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

13

Group Formation

Page 14: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

Experimental Evaluation

Page 15: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

15

RCM Visualization

Page 16: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

16

Experimental Setting BMS dataset Compare with hybrid PermMondrian(PM)

Combines Mondrian with Anatomy Query Workload

Reconstruction Error

Page 17: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

17

Recostruction Error vs p

Page 18: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

18

Execution Time

Page 19: 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong.

19

Conclusions Anonymizing transaction data

High-dimensionality Preserving correlation

Future work Different encodings for data representation

Enhance correlation among consecutive rows