Top Banner
PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte Yingjiu Li Singapore Management Univ
34

PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

Dec 16, 2015

Download

Documents

Harley Jason
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM Workshop April 8, 2006

Deriving Private Information from Perturbed Data Using IQR-based Approach

Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte Yingjiu Li Singapore Management Univ

Page 2: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 2

Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg

Page 3: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 3

Source: http://www.privacyinternational.org/survey/dpmap.jpg

HIPAA for health care California State Bill 1386

Grann-Leach-Bliley Act for financial COPPA for childern’s online privacy etc.

PIPEDA 2000

European Union (Directive 94/46/EC)

Page 4: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 4

Mining vs. Privacy

• Data mining The goal of data mining is summary results (e.g., classification,

cluster, association rules etc.) from the data (distribution)

• Individual Privacy Individual values in database must not be disclosed, or at least no

close estimation can be derived by attackers

• Privacy Preserving Data Mining (PPDM) How to “perturb” data such that

we can build a good data mining model (data utility) while preserving individual’s privacy at the record level (privacy)?

Page 5: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 5

Our Focus

SSN Name Zip Age Sex Balance … Income Interest Paid

1 *** *** 28223 20 M 10k … 85k 2k

2 *** *** 28223 30 F 15k … 70k 18k

3 *** *** 28262 20 M 50k … 120k 35k

. . . . . . . … . .

n *** *** 28223 20 M 80k … 110k 15k

Focus in this talkk-anonymity,

L-diversity

SDC etc.

Page 6: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 6

Additive Noise based PPDM

• Distribution reconstruction AS method, Agrawal and Srikant, SIGMOD 00 EM method, Agrawal and Aggarwal, PODS 01

• Individual value reconstruction Spectral Filtering (SF) , Kargupta et al. ICDM 03 PCA, Huang, Du and Chen SIGMOD 05

Page 7: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 7

Additive Randomization (Y = X +R )

50 | 40K | ... 30 | 70K | ... ...

...

Randomizer Randomizer

ReconstructDistribution

of Age

ReconstructDistributionof Salary

ClassificationAlgorithm

Model

65 | 20K | ... 25 | 60K | ... ...30

becomes 65

(30+35)

Alice’s age

Add random number to

Age

• R.Agrawal and R.Srikant SIGMOD 00

Page 8: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 8

Distribution Reconstruction

n

ij

XiiY

jXiiY

afayxf

afayxf

n 1 )())((

)())((1

fX0 := Uniform distribution

j := 0 // Iteration number repeat

fXj+1(a) :=

j := j+1

until (stopping criterion met)

• Converges to maximum likelihood estimate – Agrawal and Aggarwal PODS 01

0

200

400

600

800

1000

1200

20 60

Age

Num

ber

of P

eopl

e

OriginalRandomizedReconstructed

• Algorithm

Page 9: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 9

Individual Reconstruction• Spectral Filtering Technique (Kargupta et al. ICDM03)

Apply EVD Using the covariance of V, extract the first k principle components

λ1≥ λ2··· ≥ λ k ≥ λ e and e1, e2, · · · ,ek are the corresponding eigenvectors of Qk = [e1 e2 · · · ek] forms an orthonormal basis of a subspace X

Find the orthogonal projection on to X:

Estimate data as PUU pˆ

TUp QQ

Up

TkkQQP

PCA Technique, Huang, Du and Chen, SIGMOD 05

Page 10: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 10

Motivation• The goal of randomization-based perturbation

To hide the sensitive data by randomly modifying the data values using some additive noise

To keep the aggregate characteristics or distribution remain unchanged or recoverable

• Do those aggregate characteristics or distribution contain confidential information which may be exploited by snoopers to derive individual’s sensitive data?

private information

Page 11: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 11

Our Scenario

• Each individual data is associated with one privacy interval privacy policies corporate agreements

• The data holder can utilize or release data to the third party for analysis, however, he is required not to disclose any individual data within its privacy interval

Balance … Income Interest Paid

1 10k … 85k 2k

2 15k … 70k 18k

3 50k … 120k 35k

. . … . .

n 80k … 110k 15k

• A single party (data holder) holds a collection of original individual data

Page 12: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 12

Inter-Quantile Range (IQR)

• Inter-Quantile Range [xα1 , x α2 ] is defined as P( xα1 ≤ x ≤ x α2 ) ≥ c%, while c = α2 − α1 denotes the confidence.

• IQR measures the amount of spread and variability of the variable. Hence it can be used by attackers to estimate the range of each individual value.

• IQR we used: [x(1-c)/2 , x (1+c)/2 ]

α2

α1

xα1xα2

Page 13: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 13

Comparison with other Privacy definition

• Interval privacy (Agrawal and Srikant, SIGMOD00) If the original value can be estimated with c% confidence to lie in

the interval [a, b], then the interval width (b-a) defines the amount of privacy at c% confidence level

• Mutual Information (Aggarwal and Agrawal, PODS01)• Reconstruction privacy (Rizvi & Haritsa, VLDB02)• -to- privacy breach (Evfimievski et al. PODS03)

Page 14: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 14

Disclosure Measure

],[],[

],[],[

2/)1(2/)1(

2/)1(2/)1(

ccui

li

ccui

li

i xxuu

xxuud

],[ ui

li uu

Individual’s privacy interval

],[ 2/)1(2/)1( cc xx

Attacker’s estimated range

n

i idnD1

/1

Measure Similarity

Complete disclosed point if its Complete disclosed point if its estimated rangeestimated range

• contains the original valuecontains the original value • fully falls within the pre- fully falls within the pre- specified privacy interval specified privacy interval

Page 15: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 15

Empirical Evaluation

• Data sets: Bank

5 attributes (Home Equity, Stock/Bonds, Liabilities, Savings, CDs) 50,000 tuples

Signal 35 correlated features (sinusoidal, square, triangle, normal distributions ) 30,000 tuples

• Pre-specified individual’s privacy intervals: [ui(1-p), ui(1+p)] p is varied

Page 16: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 16

IQR from Reconstructed Dist. Using AS with Uniform noise

• IQR Direct inference ---perturbed • IRQ with AS inference ---reconstructed • IRQ ideal inference ---original

• Uniform noise: [-125,125]• Bank Data set • Attribute: Stock/Bonds• 95% IQR• information loss for AS : 14.6%

Ratio of Complete disclosure Ratio of Complete disclosure pointspoints

Page 17: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 17

Interval p % no. of disclosed points(100%) D

direct IQR ideal IQR with AS ideal AS

35 13.9 21.2 3.5 0.605 0.663

40 16.0 32.5 15.1 0.660 0.698

45 17.9 43.0 29.6 0.712 0.746

50 19.8 52.9 41.8 0.763 0.796

55 22.0 62.9 53.2 0.814 0.844

60 23.9 72.9 63.4 0.864 0.889

65 26.0 83.3 73.5 0.916 0.932

70 28.0 94.3 83.7 0.972 0.977

75 29.9 99.9 94.5 0.999 0.999

80 32.0 100 100 1 1

],[],[

],[],[

2/)1(2/)1(

2/)1(2/)1(

ccui

li

ccui

li

i xxuu

xxuud

n

i idnD1

/1

IQR from Reconstructed Dist. Using AS with Uniform noise

Page 18: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 18

AS vs. SF with Gaussian Noise

• Gaussian noise N(0,8)• Signal dataset• Feature 2 (sinusoidal distributed)• 95% IQR• information loss for AS : 32.9%• information loss for SF : 47.0%

Page 19: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 19

Disclosure vs. noise

• Uniform noise with varied range

• Bank Data set • Attribute: Stock/Bonds• 95% IQR

Page 20: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 20

Extend to Multivariate Cases• In practice, the distribution of multiple numerical attributes are

often modeled by one multi-variate normal distribution, N(μ,Σ)• The ellipsoid {z : (z − μ)′ Σ−1(z − μ) ≤ χ2

p(α)} contains a fixed percentage, (1 −α)100% of data values.

• The projection of this ellipsoid on axis zi has bound: 1c

2c

1

2

2Z

1Z

])(,)([ 22iipiiipi

2Z

1Z

Page 21: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 21

Related Work

• Rotation based approach: Y = RX When R is an orthonormal matrix (RRT = I)

Vector length: |Rx| = |x| Euclidean distance: |Rx – Ry| = |x-y| Inner product : <Rx,Ry> = <x,y>

Popular classifiers and clustering methods are invariant to this perturbation.

K. Liu, H. Kargupta etc. Random projection based multiplicative data perturbation for privacy preserving distributed data mining. TKDE 2006.

K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. ICDM 2005

Page 22: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 22

Is Y=RX Secure?

0.3333 0.6667 0.6667

-0.6667 0.6667 -0.3333

-0.6667 -0.3333 0.6667

10 15 50 45 80

85 70 120 23 110

2 18 35 134 15

61.33 63.67 110.00 119.67 63.33

49.33 30.67 55.00 -59.33 -31.67

-33.67 -21.33 -30.00 51.67 -51.67

=

Y = R X

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

RRT = RTR = I

Page 23: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 23

Our Preliminary Results

• Even Y = RX + E is NOT secure when some a-priori knowledge is available to attackers.

4.751 2.429 2.282

1.156 4.457 0.093

3.034 3.811 4.107

10 15 50 45 80

85 70 120 23 110

2 18 35 134 15

265.95 286.63 475.68 581.71 520.53

394.30 338.49 569.58 174.22 277.79

362.55 394.11 665.37 776.46 463.08

=

Y = R X

+

7.334 4.199 9.199 6.208 9.048

3.759 7.537 8.447 7.313 5.692

0.099 7.939 3.678 1.939 6.318

+ ER can be any random matrix

Page 24: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 24

A-priori Knowledge ICA Based Attack

• Privacy can be breached when a small subset of the original data X , is available to attackers

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

Page 25: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 25

Summary

• The reconstructed distribution can be exploited by attackers to derive sensitive individual information.

• Present a simple IQR attacking method

• Complex and effective attacking methods exist More research is needed on attacking methods from the attacker

point of view

Page 26: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 26

Acknowledgement

• NSF Grant CCR-0310974 IIS-0546027

• Personnel Xintao Wu Songtao Guo Ling Guo

• More Info http://www.cs.uncc.edu/~xwu/ [email protected],

Page 27: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 27

Questions?

Thank you!

Page 28: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 28

Information Loss

• Distribution level

• Individual value level

F

F

F

U

UUUUre

UUUUae

||||

||ˆ||)ˆ,(

||ˆ||)ˆ,(

]|)()(|[2

1),( ''

X

dxxfxfEffI XXXX

Page 29: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 29

National Laws• US

HIPAA for health care Passed August 21, 96 lowest bar and the States are welcome to enact more stringent rules

California State Bill 1386 Grann-Leach-Bliley Act of 1999 for financial institutions COPPA for childern’s online privacy etc.

• Canada PIPEDA 2000

Personal Information Protection and Electronic Documents Act Effective from Jan 2004

• European Union (Directive 94/46/EC) Passed by European Parliament Oct 95 and Effective from Oct 98. Provides guidelines for member state legislation Forbids sharing data with states that do not protect privacy

Page 30: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 30

ICA Direct Attack?

• Can we get X when only Y is available? It seems Independent Component Analysis can help.

Y = R X + E General Linear Perturbation Model

X = A S + N ICA Model

Page 31: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 31

ICA

1 111 1

1

( ) ( )

( ) ( )

m

n nm m n

s t x tA A

A A s t x t

Linear Mixing ProcessLinear Mixing Process

Mixing Matrix Source Observed

Separation ProcessSeparation Process

Separated Demixing Matrix

1 111 1

1

( ) ( )

( ) ( )

n

m m mn n

y t x tW W

y t W W x t

Independent?Independent?

Cost Function

Optimize

Page 32: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 32

Restriction of ICA

• Restrictions: All the components si should be independent;

They must be non-Gaussian with the possible exception of one component.

The number of observed linear mixtures m must be at least as large as the number of independent components n

The matrix A must be of full column rank

• Can we apply the ICA directly? No

Correlations among attributes of X More than one attributes may have Gaussian distributions

Y = RX + E

X = AS + N

Page 33: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 33

A-priori Knowledge based ICA (AK-ICA) Attack

Page 34: PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

PDM April 8, 2006 34

Correctness of AK-ICA

• We prove that J exists such that

J represents the connection between the distributions of and X

S YS

XJSAX yx ˆ

~