PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

Post on 16-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

PDM Workshop April 8, 2006

Deriving Private Information from Perturbed Data Using IQR-based Approach

Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte Yingjiu Li Singapore Management Univ

PDM April 8, 2006 2

Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg

PDM April 8, 2006 3

Source: http://www.privacyinternational.org/survey/dpmap.jpg

HIPAA for health care California State Bill 1386

Grann-Leach-Bliley Act for financial COPPA for childern’s online privacy etc.

PIPEDA 2000

European Union (Directive 94/46/EC)

PDM April 8, 2006 4

Mining vs. Privacy

• Data mining The goal of data mining is summary results (e.g., classification,

cluster, association rules etc.) from the data (distribution)

• Individual Privacy Individual values in database must not be disclosed, or at least no

close estimation can be derived by attackers

• Privacy Preserving Data Mining (PPDM) How to “perturb” data such that

we can build a good data mining model (data utility) while preserving individual’s privacy at the record level (privacy)?

PDM April 8, 2006 5

Our Focus

SSN Name Zip Age Sex Balance … Income Interest Paid

1 *** *** 28223 20 M 10k … 85k 2k

2 *** *** 28223 30 F 15k … 70k 18k

3 *** *** 28262 20 M 50k … 120k 35k

. . . . . . . … . .

n *** *** 28223 20 M 80k … 110k 15k

Focus in this talkk-anonymity,

L-diversity

SDC etc.

PDM April 8, 2006 6

Additive Noise based PPDM

• Distribution reconstruction AS method, Agrawal and Srikant, SIGMOD 00 EM method, Agrawal and Aggarwal, PODS 01

• Individual value reconstruction Spectral Filtering (SF) , Kargupta et al. ICDM 03 PCA, Huang, Du and Chen SIGMOD 05

PDM April 8, 2006 7

Additive Randomization (Y = X +R )

50 | 40K | ... 30 | 70K | ... ...

...

Randomizer Randomizer

ReconstructDistribution

of Age

ReconstructDistributionof Salary

ClassificationAlgorithm

Model

65 | 20K | ... 25 | 60K | ... ...30

becomes 65

(30+35)

Alice’s age

Add random number to

Age

• R.Agrawal and R.Srikant SIGMOD 00

PDM April 8, 2006 8

Distribution Reconstruction

n

ij

XiiY

jXiiY

afayxf

afayxf

n 1 )())((

)())((1

fX0 := Uniform distribution

j := 0 // Iteration number repeat

fXj+1(a) :=

j := j+1

until (stopping criterion met)

• Converges to maximum likelihood estimate – Agrawal and Aggarwal PODS 01

0

200

400

600

800

1000

1200

20 60

Age

Num

ber

of P

eopl

e

OriginalRandomizedReconstructed

• Algorithm

PDM April 8, 2006 9

Individual Reconstruction• Spectral Filtering Technique (Kargupta et al. ICDM03)

Apply EVD Using the covariance of V, extract the first k principle components

λ1≥ λ2··· ≥ λ k ≥ λ e and e1, e2, · · · ,ek are the corresponding eigenvectors of Qk = [e1 e2 · · · ek] forms an orthonormal basis of a subspace X

Find the orthogonal projection on to X:

Estimate data as PUU pˆ

TUp QQ

Up

TkkQQP

PCA Technique, Huang, Du and Chen, SIGMOD 05

PDM April 8, 2006 10

Motivation• The goal of randomization-based perturbation

To hide the sensitive data by randomly modifying the data values using some additive noise

To keep the aggregate characteristics or distribution remain unchanged or recoverable

• Do those aggregate characteristics or distribution contain confidential information which may be exploited by snoopers to derive individual’s sensitive data?

private information

PDM April 8, 2006 11

Our Scenario

• Each individual data is associated with one privacy interval privacy policies corporate agreements

• The data holder can utilize or release data to the third party for analysis, however, he is required not to disclose any individual data within its privacy interval

Balance … Income Interest Paid

1 10k … 85k 2k

2 15k … 70k 18k

3 50k … 120k 35k

. . … . .

n 80k … 110k 15k

• A single party (data holder) holds a collection of original individual data

PDM April 8, 2006 12

Inter-Quantile Range (IQR)

• Inter-Quantile Range [xα1 , x α2 ] is defined as P( xα1 ≤ x ≤ x α2 ) ≥ c%, while c = α2 − α1 denotes the confidence.

• IQR measures the amount of spread and variability of the variable. Hence it can be used by attackers to estimate the range of each individual value.

• IQR we used: [x(1-c)/2 , x (1+c)/2 ]

α2

α1

xα1xα2

PDM April 8, 2006 13

Comparison with other Privacy definition

• Interval privacy (Agrawal and Srikant, SIGMOD00) If the original value can be estimated with c% confidence to lie in

the interval [a, b], then the interval width (b-a) defines the amount of privacy at c% confidence level

• Mutual Information (Aggarwal and Agrawal, PODS01)• Reconstruction privacy (Rizvi & Haritsa, VLDB02)• -to- privacy breach (Evfimievski et al. PODS03)

PDM April 8, 2006 14

Disclosure Measure

],[],[

],[],[

2/)1(2/)1(

2/)1(2/)1(

ccui

li

ccui

li

i xxuu

xxuud

],[ ui

li uu

Individual’s privacy interval

],[ 2/)1(2/)1( cc xx

Attacker’s estimated range

n

i idnD1

/1

Measure Similarity

Complete disclosed point if its Complete disclosed point if its estimated rangeestimated range

• contains the original valuecontains the original value • fully falls within the pre- fully falls within the pre- specified privacy interval specified privacy interval

PDM April 8, 2006 15

Empirical Evaluation

• Data sets: Bank

5 attributes (Home Equity, Stock/Bonds, Liabilities, Savings, CDs) 50,000 tuples

Signal 35 correlated features (sinusoidal, square, triangle, normal distributions ) 30,000 tuples

• Pre-specified individual’s privacy intervals: [ui(1-p), ui(1+p)] p is varied

PDM April 8, 2006 16

IQR from Reconstructed Dist. Using AS with Uniform noise

• IQR Direct inference ---perturbed • IRQ with AS inference ---reconstructed • IRQ ideal inference ---original

• Uniform noise: [-125,125]• Bank Data set • Attribute: Stock/Bonds• 95% IQR• information loss for AS : 14.6%

Ratio of Complete disclosure Ratio of Complete disclosure pointspoints

PDM April 8, 2006 17

Interval p % no. of disclosed points(100%) D

direct IQR ideal IQR with AS ideal AS

35 13.9 21.2 3.5 0.605 0.663

40 16.0 32.5 15.1 0.660 0.698

45 17.9 43.0 29.6 0.712 0.746

50 19.8 52.9 41.8 0.763 0.796

55 22.0 62.9 53.2 0.814 0.844

60 23.9 72.9 63.4 0.864 0.889

65 26.0 83.3 73.5 0.916 0.932

70 28.0 94.3 83.7 0.972 0.977

75 29.9 99.9 94.5 0.999 0.999

80 32.0 100 100 1 1

],[],[

],[],[

2/)1(2/)1(

2/)1(2/)1(

ccui

li

ccui

li

i xxuu

xxuud

n

i idnD1

/1

IQR from Reconstructed Dist. Using AS with Uniform noise

PDM April 8, 2006 18

AS vs. SF with Gaussian Noise

• Gaussian noise N(0,8)• Signal dataset• Feature 2 (sinusoidal distributed)• 95% IQR• information loss for AS : 32.9%• information loss for SF : 47.0%

PDM April 8, 2006 19

Disclosure vs. noise

• Uniform noise with varied range

• Bank Data set • Attribute: Stock/Bonds• 95% IQR

PDM April 8, 2006 20

Extend to Multivariate Cases• In practice, the distribution of multiple numerical attributes are

often modeled by one multi-variate normal distribution, N(μ,Σ)• The ellipsoid {z : (z − μ)′ Σ−1(z − μ) ≤ χ2

p(α)} contains a fixed percentage, (1 −α)100% of data values.

• The projection of this ellipsoid on axis zi has bound: 1c

2c

1

2

2Z

1Z

])(,)([ 22iipiiipi

2Z

1Z

PDM April 8, 2006 21

Related Work

• Rotation based approach: Y = RX When R is an orthonormal matrix (RRT = I)

Vector length: |Rx| = |x| Euclidean distance: |Rx – Ry| = |x-y| Inner product : <Rx,Ry> = <x,y>

Popular classifiers and clustering methods are invariant to this perturbation.

K. Liu, H. Kargupta etc. Random projection based multiplicative data perturbation for privacy preserving distributed data mining. TKDE 2006.

K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. ICDM 2005

PDM April 8, 2006 22

Is Y=RX Secure?

0.3333 0.6667 0.6667

-0.6667 0.6667 -0.3333

-0.6667 -0.3333 0.6667

10 15 50 45 80

85 70 120 23 110

2 18 35 134 15

61.33 63.67 110.00 119.67 63.33

49.33 30.67 55.00 -59.33 -31.67

-33.67 -21.33 -30.00 51.67 -51.67

=

Y = R X

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

RRT = RTR = I

PDM April 8, 2006 23

Our Preliminary Results

• Even Y = RX + E is NOT secure when some a-priori knowledge is available to attackers.

4.751 2.429 2.282

1.156 4.457 0.093

3.034 3.811 4.107

10 15 50 45 80

85 70 120 23 110

2 18 35 134 15

265.95 286.63 475.68 581.71 520.53

394.30 338.49 569.58 174.22 277.79

362.55 394.11 665.37 776.46 463.08

=

Y = R X

+

7.334 4.199 9.199 6.208 9.048

3.759 7.537 8.447 7.313 5.692

0.099 7.939 3.678 1.939 6.318

+ ER can be any random matrix

PDM April 8, 2006 24

A-priori Knowledge ICA Based Attack

• Privacy can be breached when a small subset of the original data X , is available to attackers

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

PDM April 8, 2006 25

Summary

• The reconstructed distribution can be exploited by attackers to derive sensitive individual information.

• Present a simple IQR attacking method

• Complex and effective attacking methods exist More research is needed on attacking methods from the attacker

point of view

PDM April 8, 2006 26

Acknowledgement

• NSF Grant CCR-0310974 IIS-0546027

• Personnel Xintao Wu Songtao Guo Ling Guo

• More Info http://www.cs.uncc.edu/~xwu/ xwu@uncc.edu,

PDM April 8, 2006 27

Questions?

Thank you!

PDM April 8, 2006 28

Information Loss

• Distribution level

• Individual value level

F

F

F

U

UUUUre

UUUUae

||||

||ˆ||)ˆ,(

||ˆ||)ˆ,(

]|)()(|[2

1),( ''

X

dxxfxfEffI XXXX

PDM April 8, 2006 29

National Laws• US

HIPAA for health care Passed August 21, 96 lowest bar and the States are welcome to enact more stringent rules

California State Bill 1386 Grann-Leach-Bliley Act of 1999 for financial institutions COPPA for childern’s online privacy etc.

• Canada PIPEDA 2000

Personal Information Protection and Electronic Documents Act Effective from Jan 2004

• European Union (Directive 94/46/EC) Passed by European Parliament Oct 95 and Effective from Oct 98. Provides guidelines for member state legislation Forbids sharing data with states that do not protect privacy

PDM April 8, 2006 30

ICA Direct Attack?

• Can we get X when only Y is available? It seems Independent Component Analysis can help.

Y = R X + E General Linear Perturbation Model

X = A S + N ICA Model

PDM April 8, 2006 31

ICA

1 111 1

1

( ) ( )

( ) ( )

m

n nm m n

s t x tA A

A A s t x t

Linear Mixing ProcessLinear Mixing Process

Mixing Matrix Source Observed

Separation ProcessSeparation Process

Separated Demixing Matrix

1 111 1

1

( ) ( )

( ) ( )

n

m m mn n

y t x tW W

y t W W x t

Independent?Independent?

Cost Function

Optimize

PDM April 8, 2006 32

Restriction of ICA

• Restrictions: All the components si should be independent;

They must be non-Gaussian with the possible exception of one component.

The number of observed linear mixtures m must be at least as large as the number of independent components n

The matrix A must be of full column rank

• Can we apply the ICA directly? No

Correlations among attributes of X More than one attributes may have Gaussian distributions

Y = RX + E

X = AS + N

PDM April 8, 2006 33

A-priori Knowledge based ICA (AK-ICA) Attack

PDM April 8, 2006 34

Correctness of AK-ICA

• We prove that J exists such that

J represents the connection between the distributions of and X

S YS

XJSAX yx ˆ

~

top related