Privacy-Preserving Data Mining - Yale University · •Policies for privacy-preserving data mining: languages, reconciliation, and enforcement. •Incentive-compatible privacy-preserving

Privacy-Preserving Data Mining

Rebecca WrightComputer Science DepartmentStevens Institute of Technology

www.cs.stevens.edu/~rwright

PORTIA Site Visit12 May, 2005

The Data Revolution

• The current data revolution is fueled by the perceived,actual, and potential usefulness of the data.

• Most electronic and physical activities leave some kindof data trail. These trails can provide useful informationto various parties.

• However, there are also concerns about appropriatehandling and use of sensitive information.

• Privacy-preserving methods of data handling seek toprovide sufficient privacy as well as sufficient utility.

Advantages of Privacy Protection• protection of personal information

• protection of proprietary or sensitive information

• enables collaboration between different dataowners (since they may be more willing or able tocollaborate if they need not reveal theirinformation)

• compliance with legislative policies

Overview• Introduction

• Primitives

• Higher-level protocols

– Distributed data mining

– Publishable data

– Coping with massiveness

– Beyond privacy-preserving data mining

• Implementation and experimentation

• Lessons learned, conclusions

Models for Distributed Data Mining, I

• Horizontally Partitioned • Vertically Partitioned

… … …

… …

… …

P1

P2

P3

… … … … … … …

P1 P2

…

… … … … … … …

…

Models for Distributed Data Mining, II

• Fully Distributed • Client/Server(s)

CLIENT

Wishes tocompute

on servers’data

SERVER(S)

Each holdsdatabase

…

…

…

…

…

…

P1

P2

P3

Pn-1

Pn

Cryptography vs. Randomization

inefficiency

privacy loss

inaccuracy

randomization approach

cryptographic approach

Cryptography vs. Randomization

inefficiency

privacy loss

inaccuracy

randomization approach

cryptographic approach

Secure Multiparty Computation

• Allows n players to privately compute a function f oftheir inputs.

• Overhead is polynomial in size of inputs and complexityof f [Yao86, GMW87, BGW88, CCD88, ...]

• In theory, can solve any private distributed data miningproblem. In practice, not efficient for large data.

P1

P2

Pn

Primitives for PPDM

• Common tools include secret sharing,homomorphic encryption, secure scalar product,secure set intersection, secure sums, and otherstatistics.

• PORTIA work:

– [BGN05]: homomorphic encryption of 2-DNFformulas (arbitrary additions, one multiplication),based on bilinear maps. (P)

– [AMP04]: Medians, kth ranked element. (P)

– [FNP04]: set intersection and cardinality.

Higher-Level Protocols

• [LP00]: private protocol for lnx and xlnx

• Various protocols to search remote, encrypted, oraccess controlled data (e.g. for keywords, items incommon): [BBA03(P), Goh03, FNP04, BCOP04,ABG+05(P), EFH#]

• [YZW05]: frequency mining protocol. (P)

Data Mining Models

• [WY04,YW05]: privacy-preserving construction ofBayesian networks from vertically partitioned data.

• [YZW05]: classification from frequency mining in fullydistributed model (naïve Bayes classification, decisiontrees, and association rule mining). (P)

• [JW#]: privacy-preserving k-means clustering forarbitrarily partitioned data. (In vertically partitionedcase, similar to two-party [VC03].)

• [AST05]: privacy-preserving computation ofmultidimensional aggregates on vertically or horizontallypartitioned data using randomization.

Privacy-Preserving Bayes Networks [WY04,YW05]

Goal: Cooperatively learn Bayesian networkstructure on the combination of DBA and DBB ,ideally without either party learning anything exceptthe Bayesian network itself.

DBA DBB

Alice Bob

K2 Algorithm for BN Learning

• Determining the best BN structure for a given data set is NP-hard,so heuristics are used in practice.

• The K2 algorithm [CH92] is a widely used BN structure-learningalgorithm, which we use as the starting point for our solution.

• Considers nodes in sequence. Adds new parent that most increasesa score function f, up to a maximum number of parents per node.

€

f (i,π (i)) =α 0!α1!

(α 0 +α1+1)!∏

Our Solution: Approximate Score

Modified score function: approximates the same relativeordering, and lends itself well to private computation.

• Apply natural log to f and use Stirling’s approximation

• Drop constant factor and bounded term. Result is:

where t = α0 + α1 + 1

€

g(i,π (i)) = 12 (lnα0 + lnα1(∑ − ln t) +

α0 lnα0 +α1 lnα1 − t ln t( ))

Our Solution: ComponentsSub-protocols used:

• Privacy-preserving scalar product protocol: based onhomomorphic encryption

• Privacy-preserving computation of α-parameters: usesscalar product

• Privacy-preserving score computation: uses α-parameters,[LP00] protocols for lnx and xlnx

• Privacy-preserving score comparison: uses [Yao86]

All intermediate values (scores and parameters) areprotected using secret sharing. [YW05] improves on[MSK04] for parameter computation.


• Primitives








Publishable Data• Goal: Modify data before publishing so that results have

good privacy and good utility.

– Some situations favor one more than the other.

– May prevent some things from being learned at all.

• [DN04]: Extends privacy definitions of [EGS03,DN03]relating a priori and a posteriori knowledge, and providessolutions in a moderated publishing model.

• [CDMSW04]: provide quantifiable definitions of privacyand utility. One’s privacy is guaranteed to the extentthat one does not stand out from others.

Publishable Data: k-Anonymity

• Modify database before publishing so (quasi-identifier of)every record in the database is identical to at least k – 1 otherrecords [Swe02, MW04].

• [AFK+05]: optimal k-anonymization is NP-hard even ifthe data values are ternary. Presents efficientapproximation algorithms for k-anonymization. (P)

• [ZYW05]: in two formulations, present solutions for adata publisher to learn a k-anonymized version of a fullydistributed database without learning anything else. (P)

Coping with Massiveness

• Data mining on massive data sets in an importantfield in its own right.

• It is also privacy-relevant, because:

– Massive data sets are likely to be distributed and multiplyowned.

– Efficiency improvements are needed in order to have anyhope of adding overhead for privacy.

• [FKMSZ05]: Stream algorithms for massive graphs (P)

• [DKM04]: Approximate massive-matrix computations(P)

Beyond Privacy-Preserving Data Mining

• [JW#]: Extends private inference control of [WS04] towork with more complex query functions. Client learnsquery result if and only if inference rule is met (andlearns nothing else).

• [KMN05]: Simulatable auditing to ensure that querydenials do not leak information. (P)

• [ABG+04]: P4P: Paranoid Platform for PrivacyPreferences. Mechanism for ensuring released data isusable only for allowed tasks. (P)

Enforce policies about what kind of queries orcomputations on data are allowed.


• Primitives








Implementation and Experimentation

• secure scalar product protocol [SWY04]

• MySQL private information retrieval (PIR) [BBFS#]

• Fairplay: a system implementing Yao’s two party securefunction evaluation [MNPS04]

• Bayesian network implementation [KRFW#] (D)

• secure computation of surveys using Fairplay and use forTaulbee survey [FPRS04] (P,D)

Survey Software [FPRS04] (P,D)

• User-friendly, open-source, free implementation using Fairplay[MNPS04], suitable for use with CRA’s Taulbee salary survey.Not adopted.

• CRA’s reasons:

– Need for data cleaning, multiyear comparisons, unanticipated use

– “Perhaps most member departments will trust us.”

• Provost Offices’ reasons:

– No legal basis for using this privacy-preserving protocol on data thatwe otherwise don’t disclose

– Correctness and security claims are hard and expensive to assess,despite open-source implementation.

– All-or-none adoption by Ivy+ peer group. Can’t make decisionunilaterally.

Future Directions in Experimentation

• Combine these and others into a general-purpose privacy-preserving data mining experimental platform. Usefulfor:

– fast prototyping of new protocols

– efficiency, accuracy comparisons of different approaches

• Experiment with real data and real uses.

– need to find a user community that has explicitly expressedinterest, and that could potentially accomplish something viaPPDM that it currently cannot accomplish.

– [Scha04]: genetics researchers may form such a community

Other Future Directions

• Preprocessing of data for PPDM.

• Privacy-preserving data solutions that use bothrandomization and cryptography in order to gainsome of the advantages of both.

• Policies for privacy-preserving data mining:languages, reconciliation, and enforcement.

• Incentive-compatible privacy-preserving datamining.

Conclusions

• Increasing use of computers and networks has led to aproliferation of sensitive data.

• Without proper precautions, this data could be misused.

• Many technologies exist for supporting proper data handling,but much work remains, and some barriers must be overcomein order for them to be deployed.

• Cryptography is a useful component, but not the wholesolution.

• Technology, policy, and education must work together.

Privacy-Preserving Data Mining - Yale University · •Policies for privacy-preserving data mining: languages, reconciliation, and enforcement. •Incentive-compatible privacy-preserving

Documents