Privacy Preserving Data Mining: Challenges

Data Mining TechnologiesData Mining Technologies for Digital Libraries for Digital Libraries

& Web Information Systems& Web Information Systems

Ramakrishnan SrikantRamakrishnan Srikant

Talk OutlineTalk Outline

Taxonomy Integration (WWW 2001, with R. Taxonomy Integration (WWW 2001, with R. Agrawal)Agrawal)

Searching with Numbers Searching with Numbers Privacy-Preserving Data MiningPrivacy-Preserving Data Mining

Taxonomy IntegrationTaxonomy Integration

B2B electronics portal: 2000 categories, 200K B2B electronics portal: 2000 categories, 200K datasheetsdatasheets

Master Catalog New Catalog

DSP Mem. Logic

ICs

a b c d e f

Cat1 Cat2

ICs

x y z w

Taxonomy Integration (2)Taxonomy Integration (2)

After integration:After integration:

DSP Mem. Logic

ICs

a b c d e fx y z w

GoalGoal

Use affinity information in new catalog.Use affinity information in new catalog.– Products in same category are similar.Products in same category are similar.

Accuracy boost depends on match between two Accuracy boost depends on match between two categorizations.categorizations.

Problem StatementProblem Statement

Given Given – master categorization master categorization M: M: categories categories CC11, , CC22, …, , …, CCnn

set of documents in each categoryset of documents in each category

– new categorization new categorization N:N: categories categories SS11, , SS22, …, , …, SSnn

set of documents in each categoryset of documents in each category Find the category in Find the category in MM for each document in for each document in NN

– Standard Alg: Estimate Pr(Standard Alg: Estimate Pr(CCii | d | d))

– Enhanced Alg: Estimate Pr(Enhanced Alg: Estimate Pr(CCii | | dd, , SS))

Naive Bayes ClassifierNaive Bayes Classifier

Estimate probability of document Estimate probability of document dd belonging to belonging to class class CCii

WhereWhere)Pr(

)|Pr()Pr()|Pr(d

CdCdC iii

documents ofnumber Total in documents ofNumber )Pr( i

iCC

dt

ii CtCd )|Pr()|Pr(

i

ii C

CCt in wordsTotal

in t of soccurrence of #)|Pr(

Enhanced Naïve BayesEnhanced Naïve Bayes

Standard:Standard:

Enhanced:Enhanced:

How do we estimate Pr(How do we estimate Pr(CCii|S|S)?)?

)|Pr()|Pr(),|Pr( iii CdSCSdC

)|Pr()Pr()|Pr( iii CdCdC

– Apply standard Naïve Bayes to get number of Apply standard Naïve Bayes to get number of documents in S that are classified into documents in S that are classified into CCii

– Incorporate weight Incorporate weight ww reflecting match between reflecting match between two taxonomies.two taxonomies.

Only affect classification of borderline documents.Only affect classification of borderline documents.

– For For ww = 0, default to standard classifier. = 0, default to standard classifier.

Enhanced Naïve Bayes (2)Enhanced Naïve Bayes (2)

Use tuning set to determine w.Use tuning set to determine w.

j jj

iii CSC

CSCSC)) in be topredicted in docs(#|(|

) in be topredicted in docs(#||)|Pr( w

w

w) in be topredicted in docs(#||)|Pr( iii CSCSC

Intuition behind AlgorithmIntuition behind Algorithm

StandardStandard

AlgorithmAlgorithmComputer Peripheral

Digital Camera

P1 20% 80%P2 40% 60%P3 60% 40%

Computer Peripheral

Digital Camera

P1 15% 85%P2 30% 70%P3 45% 55%

EnhancedEnhanced

AlgorithmAlgorithm

Electronic Parts DatasetElectronic Parts Dataset

Accuracy Improvement on Pangea Data

60

70

80

90

100

1 2 5 10 25 50 100 200

Weight

Acc

urac

y

Perfect90-1080-20GaussianAGaussianBBase

1150 categories; 37,000 documents

Yahoo & OpenDirectoryYahoo & OpenDirectory

5 slices of the hierarchy: Autos, Movies, Outdoors, 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, SoftwarePhotography, Software– Typical match: 69%, 15%, 3%, 3%, 1%, ….Typical match: 69%, 15%, 3%, 3%, 1%, ….

Merging Yahoo into OpenDirectoryMerging Yahoo into OpenDirectory– 30% fewer errors (14.1% absolute difference in 30% fewer errors (14.1% absolute difference in

accuracy)accuracy) Merging OpenDirectory into YahooMerging OpenDirectory into Yahoo

– 26% fewer errors (14.3% absolute difference)26% fewer errors (14.3% absolute difference)

SummarySummary

New algorithm for taxonomy integration.New algorithm for taxonomy integration.– Exploits affinity information in the new (source) Exploits affinity information in the new (source)

taxonomy categorizations.taxonomy categorizations.– Can do substantially better, and never does Can do substantially better, and never does

significantly worse than standard Naïve Bayes.significantly worse than standard Naïve Bayes. Open Problems: SVM, Decision Tree, ...Open Problems: SVM, Decision Tree, ...


Taxonomy Integration Taxonomy Integration Searching with Numbers (WWW 2002, with R. Searching with Numbers (WWW 2002, with R.

Agrawal)Agrawal) Privacy-Preserving Data MiningPrivacy-Preserving Data Mining

MotivationMotivation

A large fraction of useful web consists of specification A large fraction of useful web consists of specification documents.documents.– <attribute name, value> pairs embedded in text.<attribute name, value> pairs embedded in text.

Examples:Examples:– Data sheets for electronic parts.Data sheets for electronic parts.– Classified ads.Classified ads.– Product catalogs.Product catalogs.

Search Engines treat Search Engines treat Numbers as StringsNumbers as Strings Search for 6798.32 (lunar nutation cycle)Search for 6798.32 (lunar nutation cycle)

– Returns 2 pages on GoogleReturns 2 pages on Google– However, search for 6798.320 yielded no page However, search for 6798.320 yielded no page

on Google (and all other search engines) on Google (and all other search engines) Current search technology is inadequate for Current search technology is inadequate for

retrieving specification documents.retrieving specification documents.

Data Extraction is hardData Extraction is hard

Synonyms for attribute Synonyms for attribute names and units.names and units.– "lb" and "pounds", but no "lb" and "pounds", but no

"lbs" or "pound"."lbs" or "pound". Attribute names are often Attribute names are often

missing.missing.– No "Speed", just "MHz No "Speed", just "MHz

Pentium III" Pentium III" – No "Memory", just "MB No "Memory", just "MB

SDRAM"SDRAM"

• 850 MHz Intel Pentium III

• 192 MB RAM

• 15 GB Hard Disk

• DVD Recorder: Included;

• Windows Me

• 14.1 inch display

• 8.0 pounds

Searching with NumbersSearching with Numbers

IBM ThinkPad750 MHz Pentium 3,196 MB DRAM, …

Dell Computer700 MHz Celeron,

256 MB SDRAM, …

Database

IBM ThinkPad (750 MHz, 196 MB)

… Dell (700 MHz, 256 MB)800 200 3 lb

800 200

ReflectivityReflectivity

If we get a close match on numbers, how likely is it If we get a close match on numbers, how likely is it that we have correctly matched attribute names?that we have correctly matched attribute names?– Likelihood Likelihood Non-reflectivity (of data) Non-reflectivity (of data)

Non-overlapping attributes Non-overlapping attributes Non-reflective. Non-reflective.– Memory: 64- 512 Mb, Disk: 10 - 40 GbMemory: 64- 512 Mb, Disk: 10 - 40 Gb

Correlations or Clustering Correlations or Clustering Low reflectivity. Low reflectivity.– Memory: 64 - 512 Mb, Disk: 10 - 100 GbMemory: 64 - 512 Mb, Disk: 10 - 100 Gb

Low Reflectivity

0

10

20

30

40

50

0 10 20 30 40 50

Reflectivity: ExamplesReflectivity: Examples

Non-Reflective

0

10

20

30

40

50

0 10 20 30 40 50

High Reflectivity

0

10

20

30

40

50

0 10 20 30 40 50

Reflectivity: DefinitionReflectivity: Definition

Let Let – DD: dataset: dataset– nni i : co-ordinates of point : co-ordinates of point xxi i

– reflections(reflections(xxi i ): permutations of ): permutations of nnii ((nnii ): # of points within distance ): # of points within distance rr of of nnii

((nnii ): # of reflections within distance ): # of reflections within distance rr of of nnii

Dx i

i

inn

)()(

|D|1tyReflectivi-Non

AlgorithmAlgorithm

How to compute match score (rank) of a document How to compute match score (rank) of a document for a given query?for a given query?

How to limit the number of documents for which the How to limit the number of documents for which the match score is computed?match score is computed?

Match Score of a DocumentMatch Score of a Document

Select Select kk numbers from numbers from DD yielding minimum distance yielding minimum distance between between QQ and and DD..

Relative distance for each term:Relative distance for each term:

Euclidean distance (Euclidean distance (LLpp norm) to combine term norm) to combine term distances:distances:

ppk

i ji inqfDQF /1

1)),((),(

|ε|||

),(

i

jiji q

nqnqf

Bipartite Graph MatchingBipartite Graph Matching

Map problem to Bipartite Graph Matching Map problem to Bipartite Graph Matching – kk source nodes: corr. to query numbers source nodes: corr. to query numbers– mm target nodes: corr. to document numbers target nodes: corr. to document numbers– An edge from each source to An edge from each source to kk nearest targets. nearest targets.

Assign weight Assign weight f(qf(qii ,n ,njj))pp to the edge to the edge (q(qii ,n ,njj).).

20 60

10 25 75

.5 .25 .58 .25

Query:

Doc:

Limiting the Set of DocumentsLimiting the Set of Documents

Similar to the score aggregation problem [Fagin, Similar to the score aggregation problem [Fagin, PODS 96]PODS 96]

Proposed algorithm is an adaptation of the TA Proposed algorithm is an adaptation of the TA algorithm in [Fagin-Lotem-Naor, PODS 01]algorithm in [Fagin-Lotem-Naor, PODS 01]

Let Let nnii := number last looked at for query term := number last looked at for query term qqii

Let Let Halt when t documents found whose distance <= Halt when t documents found whose distance <= t is lower bound on distance of unseen documentst is lower bound on distance of unseen documents

Limiting the set of documents Limiting the set of documents

k conceptual sorted lists, one for each query term k conceptual sorted lists, one for each query term Do round robin access to the lists. For each Do round robin access to the lists. For each

document found, compute its distance F(D,Q)document found, compute its distance F(D,Q)

20

D4 D6 D8

D2 D3

25/.25 D9D1 D5 D7

60

D6 D8 D9

D5D1 D3 D4

D2 D7

10/.5

35/.75 25/.58

75/.25

66/.1

ppii

k

inqf /1

1)),((:τ

Empirical ResultsEmpirical Results

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5

Query Size

Pre

cisi

on

DRAM LCD Proc Trans AutoCredit Glass Housing Wine

Empirical Results (2)Empirical Results (2)

Screen ShotScreen Shot

Incorporating HintsIncorporating Hints

Use simple data extraction techniques to get hints, Use simple data extraction techniques to get hints,

Names/Units in query matched against Hints.Names/Units in query matched against Hints.

• 256 MB SDRAM memory

Unit Hint:MB

Attribute Hint:SDRAM, memory

SummarySummary

Allows querying using only numbers or numbers + Allows querying using only numbers or numbers + hints.hints.

Data can come from raw text (e.g. product Data can come from raw text (e.g. product descriptions) or databases.descriptions) or databases.

End run around data extraction.End run around data extraction.– Use simple extractor to generate hints.Use simple extractor to generate hints.

Open Problems: integration with keyword search.Open Problems: integration with keyword search.


Taxonomy Integration Taxonomy Integration Searching with Numbers Searching with Numbers Privacy-Preserving Data MiningPrivacy-Preserving Data Mining

– MotivationMotivation– ClassificationClassification– AssociationsAssociations

Growing Privacy ConcernsGrowing Privacy Concerns

Popular Press:Popular Press:– Economist: The End of Privacy (May 99)Economist: The End of Privacy (May 99)– Time: The Death of Privacy (Aug 97)Time: The Death of Privacy (Aug 97)

Govt. legislation:Govt. legislation:– European directive on privacy protection (Oct 98)European directive on privacy protection (Oct 98)– Canadian Personal Information Protection Act (Jan 2001)Canadian Personal Information Protection Act (Jan 2001)

Special issue on internet privacy, CACM, Feb 99Special issue on internet privacy, CACM, Feb 99 S. Garfinkel, "Database Nation: The Death of S. Garfinkel, "Database Nation: The Death of

Privacy in 21st Century", O' Reilly, Jan 2000Privacy in 21st Century", O' Reilly, Jan 2000

Privacy Concerns (2)Privacy Concerns (2)

Surveys of web usersSurveys of web users– 17% privacy fundamentalists, 56% pragmatic 17% privacy fundamentalists, 56% pragmatic

majority, 27% marginally concerned majority, 27% marginally concerned (Understanding net users' attitude about online (Understanding net users' attitude about online privacy, April 99)privacy, April 99)

– 82% said having privacy policy would matter 82% said having privacy policy would matter (Freebies & Privacy: What net users think, July (Freebies & Privacy: What net users think, July 99)99)

Technical QuestionTechnical Question

Fear:Fear:– "Join" (record overlay) was the original sin."Join" (record overlay) was the original sin.– Data mining: new, powerful adversary?Data mining: new, powerful adversary?

The primary task in data mining: development of The primary task in data mining: development of models about aggregated data.models about aggregated data.

Can we develop accurate models without access to Can we develop accurate models without access to precise information in individual data records?precise information in individual data records?



– MotivationMotivation– Private Information RetrievalPrivate Information Retrieval– Classification (SIGMOD 2000, with R. Agrawal)Classification (SIGMOD 2000, with R. Agrawal)– AssociationsAssociations

Web DemographicsWeb Demographics

Volvo S40 website targets people in 20sVolvo S40 website targets people in 20s– Are visitors in their 20s or 40s?Are visitors in their 20s or 40s?– Which demographic groups like/dislike the Which demographic groups like/dislike the

website?website?

Solution OverviewSolution Overview

50 | 40K | ... 30 | 70K | ... ...

...

Randomizer Randomizer

Reconstructdistribution

of Age

Reconstructdistributionof Salary

Data MiningAlgorithms Model

65 | 20K | ... 25 | 60K | ... ...

Reconstruction ProblemReconstruction Problem

Original values xOriginal values x11, x, x22, ..., x, ..., xnn

– from probability distribution X (unknown)from probability distribution X (unknown) To hide these values, we use yTo hide these values, we use y11, y, y22, ..., y, ..., ynn

– from probability distribution Yfrom probability distribution Y GivenGiven

– xx11+y+y11, x, x22+y+y22, ..., x, ..., xnn+y+ynn

– the probability distribution of Ythe probability distribution of Y Estimate the probability distribution of X.Estimate the probability distribution of X.

Intuition (Reconstruct single Intuition (Reconstruct single point) point) Use Bayes' rule for density functionsUse Bayes' rule for density functions

10 90Age

V

Original distribution for AgeProbabilistic estimate of original value of V

Intuition (Reconstruct single Intuition (Reconstruct single point)point)

Original Distribution for AgeProbabilistic estimate of original value of V

10 90Age

V

Use Bayes' rule for density functionsUse Bayes' rule for density functions

Reconstructing the Reconstructing the DistributionDistribution Combine estimates of where point came from for all Combine estimates of where point came from for all

the points:the points:– Gives estimate of original distribution.Gives estimate of original distribution.

10 90Age

Reconstruction: Reconstruction: BootstrappingBootstrapping ffXX

00 := Uniform distribution := Uniform distribution j := 0 // Iteration numberj := 0 // Iteration number repeatrepeat

– (Bayes' rule)(Bayes' rule)

– j := j+1j := j+1 until (stopping criterion met)until (stopping criterion met)

Converges to maximum likelihood estimate.Converges to maximum likelihood estimate.– D. Agrawal & C.C. Aggarwal, PODS 2001.D. Agrawal & C.C. Aggarwal, PODS 2001.

n

i jXiiY

jXiiYj

xafayxf

afayxfn

af1

1

)())((

)())((1:)(

Seems to work well!Seems to work well!

0

200

400

600

800

1000

1200

20 60

Age

Num

ber

of P

eopl

e

OriginalRandomizedReconstructed

Recap: Why is privacy Recap: Why is privacy preserved?preserved? Cannot reconstruct individual values accurately.Cannot reconstruct individual values accurately. Can only reconstruct distributions.Can only reconstruct distributions.



– MotivationMotivation– Private Information RetrievalPrivate Information Retrieval– ClassificationClassification – Associations (KDD 2002, with A. Evfimievski, R. Associations (KDD 2002, with A. Evfimievski, R.

Agrawal & J. Gehrke)Agrawal & J. Gehrke)

Association RulesAssociation Rules

Given:Given:– a set of transactionsa set of transactions– each transaction is a set of itemseach transaction is a set of items

Association Rule: 30% of transactions that contain Association Rule: 30% of transactions that contain Book1 and Book5 also contain Book20; 5% of Book1 and Book5 also contain Book20; 5% of transactions contain these items.transactions contain these items.– 30% : confidence of the rule.30% : confidence of the rule.– 5% : support of the rule.5% : support of the rule.

Find all association rules that satisfy user-specified Find all association rules that satisfy user-specified minimum support and minimum confidence minimum support and minimum confidence constraints.constraints.

Can be used to generate recommendations.Can be used to generate recommendations.

Recommendation Service

Associations

Recommendations

Alice

Bob

Book 5,Book 25

Book 1,Book 11,Book 21

Recommendations OverviewRecommendations Overview

Support Recovery

Book 3,Book 25

Book 1,Book 7,Book 21

Private Information RetrievalPrivate Information Retrieval

Retrieve 1 of n documents from a digital library Retrieve 1 of n documents from a digital library without the library knowing which document was without the library knowing which document was retrieved.retrieved.

Trivial solution: Download entire library.Trivial solution: Download entire library. Can you do better?Can you do better?

– Yes, with multiple servers.Yes, with multiple servers.– Yes, with single server & computational privacy.Yes, with single server & computational privacy.

Problem introduced in [Chor et al, FOCS 95]Problem introduced in [Chor et al, FOCS 95]

Uniform RandomizationUniform Randomization

Given a transaction,Given a transaction,– keep item with 20% probability,keep item with 20% probability,– replace with a new random item with 80% replace with a new random item with 80%

probability.probability. Appears to gives around 80% privacy…Appears to gives around 80% privacy…

– 80% chance that an item in the randomized 80% chance that an item in the randomized transaction was not in the original transaction.transaction was not in the original transaction.

Privacy Breach ExamplePrivacy Breach Example

100,000 (1%)have

{x, y, z}

9,900,000 (99%)have zero

items from {x, y, z}

0.23 = .0086 * (0.8/1000)3

= 3 * 10-9

800 transactions .03 transactions (<< 1)99.99% 0.01%

80% privacy “on average,” but not for all items!80% privacy “on average,” but not for all items!

10 M transactions of size 3 with 1000 items:

SolutionSolution

“Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?”

“He grows a forest to hide it in.”

G.K. Chesterton

Insert many false items into each transaction.Insert many false items into each transaction.Hide true itemsets among false ones.Hide true itemsets among false ones.No free lunch: Need more transactions to discover No free lunch: Need more transactions to discover associations.associations.

Related WorkRelated Work

S. Rizvi, J. Haritsa, “Privacy-Preserving Association S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002.Rule Mining”, VLDB 2002.

Protecting privacy across databases:Protecting privacy across databases:– Y. Lindell and B. Pinkas, “Privacy Preserving Y. Lindell and B. Pinkas, “Privacy Preserving

Data Mining”, Crypto 2000.Data Mining”, Crypto 2000.– J. Vaidya and C.W. Clifton, “Privacy Preserving J. Vaidya and C.W. Clifton, “Privacy Preserving

Association Rule Mining in Vertically Partitioned Association Rule Mining in Vertically Partitioned Data”, KDD 2002.Data”, KDD 2002.

SummarySummary

Have your cake and mine it too!Have your cake and mine it too!– Preserve privacy at the individual level, but still Preserve privacy at the individual level, but still

build accurate models.build accurate models.– Can do both classification & association rules.Can do both classification & association rules.

Open Problems: Clustering, Open Problems: Clustering, Lower bounds on Lower bounds on discoverability versus privacy, Faster algorithms, …discoverability versus privacy, Faster algorithms, …

Slides available from ...Slides available from ...

www.almaden.ibm.com/cs/people/srikant/www.almaden.ibm.com/cs/people/srikant/talks.htmltalks.html

BackupBackup

Lowest Discoverable SupportLowest Discoverable Support

LDS is s.t., when predicted, LDS is s.t., when predicted, is 4is 4 away from zero. away from zero.

Roughly, LDS is Roughly, LDS is proportional to proportional to

LDS vs. number of transactions

0

0.2

0.4

0.6

0.8

1

1.2

1 10 100Number of transactions, millions

LDS,

%

1-itemsets 2-itemsets 3-itemsets

|t| = 5, = 50%

T1

LDS vs. Breach LevelLDS vs. Breach Level

0

0.5

1

1.5

2

2.5

30 40 50 60 70 80 90

Privacy Breach Level, %

LDS

, %

|t| = 5, |T| = 5 M

Basic 2-server SchemeBasic 2-server Scheme

Each server returns Each server returns XOR of green bits.XOR of green bits.

Client XORs bits Client XORs bits returned by server.returned by server.

Communication Communication complexity: O(n)complexity: O(n)

1234

65

78

Sqrt(n) AlgorithmSqrt(n) Algorithm

Each server returns bit-Each server returns bit-wise XOR of specified wise XOR of specified blocks.blocks.

Client XORs the 2 blocks Client XORs the 2 blocks & selects desired bits.& selects desired bits.

Each block has sqrt(n) Each block has sqrt(n) elements => 4*sqrt(n) elements => 4*sqrt(n) communication communication complexity.complexity.

Server computation time Server computation time still O(n)still O(n)

1234

65

78

Computationally Private IRComputationally Private IR

Use pseudo-random function + mask to generate Use pseudo-random function + mask to generate sets.sets.

Quadratic residuosity.Quadratic residuosity. Difficulty of deciding whether a small prime divides Difficulty of deciding whether a small prime divides

(m)(m)– m: composite integer of unknown factorizationm: composite integer of unknown factorization (m): Euler totient fn, i.e., # of positive integers (m): Euler totient fn, i.e., # of positive integers

<=m that are relatively prime to m.<=m that are relatively prime to m.

ExtensionsExtensions

Retrieve documents (blocks), not bits.Retrieve documents (blocks), not bits.– If If n <= ln <= l, comm. complexity , comm. complexity 4l4l..– If If n <= ln <= l22/4/4, comm. complexity , comm. complexity 8l8l..

Lower communication complexity.Lower communication complexity. Select documents using keywords.Select documents using keywords. Protect data privacy.Protect data privacy. Preprocessing to reduce computation time.Preprocessing to reduce computation time. Computationally-private information retrieval with Computationally-private information retrieval with

single server.single server.

Potential Privacy BreachesPotential Privacy Breaches

Distribution is a spike.Distribution is a spike.– ExampleExample: Everyone is of age 40.: Everyone is of age 40.

Some randomized values are only possible from a Some randomized values are only possible from a given range.given range.– ExampleExample: Add U[-50,+50] to age and get 125 : Add U[-50,+50] to age and get 125

True age is True age is 75. 75.– Not an issue with Gaussian.Not an issue with Gaussian.

Potential Privacy Breaches (2)Potential Privacy Breaches (2)

Most randomized values in a given interval come Most randomized values in a given interval come from a given interval.from a given interval.– ExampleExample: 60% of the people whose randomized : 60% of the people whose randomized

value is in [120,130] have their true age in value is in [120,130] have their true age in [70,80].[70,80].

– Implication: Higher levels of randomization will Implication: Higher levels of randomization will be required.be required.

Correlations can make previous effect worse.Correlations can make previous effect worse.– ExampleExample: 80% of the people whose randomized : 80% of the people whose randomized

value of age is in [120,130] and whose value of age is in [120,130] and whose randomized value of income is [...] have their randomized value of income is [...] have their true age in [70,80].true age in [70,80].

Work in Statistical DatabasesWork in Statistical Databases

Provide statistical information without compromising Provide statistical information without compromising sensitive information about individuals (surveys: sensitive information about individuals (surveys: AW89, Sho82)AW89, Sho82)

TechniquesTechniques– Query RestrictionQuery Restriction– Data PerturbationData Perturbation

Negative Results: cannot give high quality statistics Negative Results: cannot give high quality statistics and simultaneously prevent partial disclosure of and simultaneously prevent partial disclosure of individual information [AW89]individual information [AW89]

Statistical Databases: Statistical Databases: TechniquesTechniques Query RestrictionQuery Restriction

– restrict the size of query result (e.g. FEL72, DDS79)restrict the size of query result (e.g. FEL72, DDS79)– control overlap among successive queries (e.g. DJL79)control overlap among successive queries (e.g. DJL79)– suppress small data cells (e.g. CO82)suppress small data cells (e.g. CO82)

Output PerturbationOutput Perturbation– sample result of query (e.g. Den80)sample result of query (e.g. Den80)– add noise to query result (e.g. Bec80)add noise to query result (e.g. Bec80)

Data PerturbationData Perturbation– replace db with sample (e.g. LST83, LCL85, Rei84)replace db with sample (e.g. LST83, LCL85, Rei84)– swap values between records (e.g. Den82)swap values between records (e.g. Den82)– add noise to values (e.g. TYW84, War65)add noise to values (e.g. TYW84, War65)

Statistical Databases: Statistical Databases: ComparisonComparison We do not assume original data is aggregated into We do not assume original data is aggregated into

a single database.a single database. Concept of reconstructing original distribution.Concept of reconstructing original distribution.

– Adding noise to data values problematic without Adding noise to data values problematic without such reconstruction.such reconstruction.