Privacy preserving data mining – randomized response and association rule hiding Li Xiong CS573 Data Privacy and Anonymity tial slides credit: W. Du, Syracuse University, Y. Gao, Peking Unive
Dec 27, 2015
Privacy preserving data mining – randomized response and association rule hiding
Li Xiong
CS573 Data Privacy and Anonymity
Partial slides credit: W. Du, Syracuse University, Y. Gao, Peking University
Privacy Preserving Data Mining Techniques Protecting sensitive raw data
Randomization (additive noise) Geometric perturbation and projection (multiplicative
noise) Randomized response technique
Categorical data perturbation in data collection model
Protecting sensitive knowledge (knowledge hiding)
Data Collection Model
Data Publisher
Step 1: Data Collection
IndividualData
Data Miner
Step 2: Data Publishing
Data cannot be shared directly because of privacy concern
Background:Randomized Response
)5.0(
)(
YesP
P'(Yes) P(Yes) P(No)(1 )
P'(No) P(Yes)(1 ) P(No)
Do you smoke?
Head
TailNo
Yes
The true answer is “Yes”
Biased coin:
5.0
)(
HeadP
Decision Tree Mining using Randomized Response Multiple attributes encoded in bits
)5.0(
)(
YesP
Head
TailFalse answer !E: 001
True answer E: 110Biased coin:
5.0
)(
HeadP
Column distribution can be estimated for learning a decision tree!
Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003
Accuracy of Decision tree built on randomized response
Generalization for Multi-Valued Categorical Data
True Value: Si
Si
Si+1
Si+2
Si+3
q1
q2
q3
q4
P '(s1)
P '(s2)
P '(s3)
P '(s4)
q1 q4 q3 q2
q2 q1 q4 q3
q3 q2 q1 q4
q4 q3 q2 q1
P(s1)
P(s2)
P(s3)
P(s4)
M
A Generalization
RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05]
RR Matrix can be arbitrary
Can we find optimal RR matrices?
M
a11 a12 a13 a14
a21 a22 a23 a24
a31 a32 a33 a34
a41 a42 a43 a44
OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008
What is an optimal matrix?
Which of the following is better?
M1 1 0 0
0 1 0
0 0 1
M2
13
13
13
13
13
13
13
13
13
Privacy: M2 is betterUtility: M1 is better
So, what is an optimal matrix?
Optimal RR Matrix
An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M). Privacy Quantification Utility Quantification
A number of privacy and utility metrics have been proposed. Privacy: how accurately one can estimate
individual info. Utility: how accurately we can estimate aggregate
info.
Metrics
Privacy: accuracy of estimate of individual values
Utility: difference between the original probability and the estimated probability
Optimization Methods
Approach 1: Weighted sum:
w1 Privacy + w2 Utility Approach 2
Fix Privacy, find M with the optimal Utility. Fix Utility, find M with the optimal Privacy. Challenge: Difficult to generate M with a fixed
privacy or utility. Proposed Approach: Multi-Objective
Optimization
Optimization algorithm
Evolutionary Multi-Objective Optimization (EMOO) The algorithm
Start with a set of initial RR matrices Repeat the following steps in each iteration
Mating: selecting two RR matrices in the pool Crossover: exchanging several columns between the
two RR matrices Mutation: change some values in a RR matrix Meet the privacy bound: filtering the resultant matrices Evaluate the fitness value for the new RR matrices.
Note : the fitness values is defined in terms of privacy and utility metrics
Illustration
Output of Optimization
Privacy
Utility
Worse
Better
M1M2
M4
M3
M5
M7
M6
M8
The optimal set is often plotted in the objective space as Pareto front.
For First attribute of Adult data
Privacy Preserving Data Mining Techniques Protecting sensitive raw data
Randomization (additive noise) Geometric perturbation and projection (multiplicative
noise) Randomized response technique
Protecting sensitive knowledge (knowledge hiding) Frequent itemset and association rule hiding Downgrading classifier effectiveness
Frequent Itemset Mining and Association Rule Mining
Frequent itemset mining: frequent set of items in a transaction data
set
Association rules: associations between items
Frequent Itemset Mining and Association Rule Mining
First proposed by Agrawal, Imielinski, and Swami in SIGMOD 1993 SIGMOD Test of Time Award 2003
“This paper started a field of research. In addition to containing an innovative algorithm, its subject matter brought data mining to the attention of the database community … even led several years ago to an IBM commercial, featuring supermodels, that touted the importance of work such as that contained in this paper. ”
Apriori algorithm in VLDB 1994 #4 in the top 10 data mining algorithms in ICDM 2006
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in
large databases. In SIGMOD ’93.
Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94.
April 19, 2023
20
Basic Concepts: Frequent Patterns and Association Rules
Itemset: X = {x1, …, xk} (k-itemset) Frequent itemset: X with minimum support count
Support count (absolute support): count of transactions containing X
Association rule: A B with minimum support and confidence Support: probability that a transaction contains A B
s = P(A B) Confidence: conditional probability that a transaction having A
also contains B
c = P(A | B)
Association rule mining process Find all frequent patterns (more costly) Generate strong association rules
Customerbuys diaper
Customerbuys both
Customerbuys beer
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
April 19, 2023
Illustration of Frequent Itemsets and Association Rules
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
Frequent itemsets (minimum support count = 3) ?
Association rules (minimum support = 50%, minimum confidence = 50%) ?
{A:3, B:3, D:4, E:3, AD:3}
A D (60%, 100%)D A (60%, 75%)
SIGMOD Ph.D. Workshop IDAR’07
22
Association Rule Hiding: what? why??
Problem: hide sensitive association rules in data without losing non-sensitive rules
Motivations: confidential rules may have serious adverse effects
SIGMOD Ph.D. Workshop IDAR’07
Problem statement
Given a database D to be released minimum threshold “MST”, “MCT” a set of association rules R mined from D a set of sensitive rules Rh R to be hided
Find a new database D’ such that the rules in Rh cannot be mined from D’ the rules in R-Rh can still be mined as many as
possible
SIGMOD Ph.D. Workshop IDAR’07
Solutions
Data modification approaches Basic idea: data sanitization D->D’ Approaches: distortion,blocking Drawbacks
Cannot control hiding effects intuitively, lots of I/O
Data reconstruction approaches Basic idea: knowledge sanitization D->K->D’ Potential advantages
Can easily control the availability of rules and control the hiding effects directly, intuitively, handily
Distortion-based Techniques
A B C D
1 1 1 0
1 0 1 1
0 0 0 1
1 1 1 0
1 0 1 1
Rule ARule A→C has: →C has:
Support(Support(A→CA→C)=80%)=80%
Confidence(Confidence(A→CA→C)=100%)=100%
Sample DatabaseSample Database
A B C D
1 1 1 0
1 0 00 1
0 0 0 1
1 1 1 0
1 0 00 1
Distorted DatabaseDistorted Database
Rule ARule A→C has now: →C has now:
Support(Support(A→CA→C)=40%)=40%
Confidence(Confidence(A→CA→C)=50%)=50%
DistortionAlgorithm
Side Effects
Before Hiding Before Hiding ProcessProcess
After Hiding After Hiding ProcessProcess
Side EffectSide Effect
Rule Ri has had
conf(Rconf(Rii)>MCT)>MCTRule Ri has now conf(Rconf(Rii)<MCT)<MCT
Rule Eliminated(Undesirable Side Effect)
Rule Ri has had
conf(Rconf(Rii)<MCT)<MCTRule Ri has now conf(Rconf(Rii)>MCT)>MCT
Ghost Rule(Undesirable Side Effect)
Large Itemset I has had sup(I)>MSTsup(I)>MST
Itemset I has now sup(I)<MSTsup(I)<MST
Itemset Eliminated(Undesirable Side Effect)
Distortion-based Techniques
Challenges/Goals:
To minimize the undesirable Side Effects that the hiding process causes to non-sensitive rules.
To minimize the number of 1’s1’s that must be deleted in the database.
Algorithms must be linear in time as the database increases in size.
Sensitive itemsets: ABC
Data distortion [Atallah 99]
Hardness result: The distortion problem is NP Hard
Heuristic search Find items to remove and transactions to
remove the items from
Disclosure Limitation of Sensitive Rules, M. Atallah, A. Elmagarmid, M. Ibrahim, E. Bertino, V. Verykios, 1999
Heuristic Approach
A greedy bottom-up search through the ancestors (subsets) of the sensitive itemset for the parent with maximum support (why?) At the end of the search, 1-itemset is selected
Search through the common transactions containing the item and the sensitive itemset for the transaction that affects minimum number of 2-itemsets
Delete the selected item from the identified transaction
Results comparison
Blocking-based Techniques
AA BB CC DD
11 11 11 00
11 00 11 11
00 00 00 11
11 11 11 00
11 00 11 11
AA BB CC DD
11 11 11 00
11 00 ?? 11
?? 00 00 11
11 11 11 00
11 00 11 11
BlockingAlgorithm
Initial DatabaseInitial Database New DatabaseNew Database
Support and Confidence becomes marginal. Support and Confidence becomes marginal.
In New Database: 60% ≤ conf(A → C) ≤ 100%In New Database: 60% ≤ conf(A → C) ≤ 100%
SIGMOD Ph.D. Workshop IDAR’07
Data reconstruction approach
D ’
DD.1 Frequent Set Mining
FS R
R-Rh’FS
.2 Perform sanitization Algorithm
3.FP-tree - based Inverse Frequent Set Mining
FP-tree
2007-7-10 SIGMOD Ph.D. Workshop IDAR’07
36
The first two phases
1. Frequent set mining Generate all frequent itemsets with their supports and
support counts FS from original database D
2. Perform sanitization algorithm Input: FS output in phase 1, R, Rh Output: sanitized frequent itemsets FS’ Process
Select hiding strategy Identify sensitive frequent sets Perform sanitization
In best cases, sanitization algorithm can ensure from FS’ ,we can exactly get the non-sensitive rules set R-Rh
FS
FS’ R-Rh
R
2007-7-10SIGMOD Ph.D. Workshop IDAR’07
37
Example: the first two phases
TID ItemsT1 ABCET2 ABCT3 ABCDT4 ABDT5 ADT6 ACD
Oiginal Database: D
σ= 4
MST=66%MCT=75%
Frequent Itemsets: FSA:6 100%B:4 66%C:4 66%D:4 66%
AB:4 66%AC:4 66%AD:4 66%
Frequent Itemsets: FS'
A:6 100%C:4 66%D:4 66%
AC:4 66%AD:4 66%
rules confid-ence support
C A 100% 66%D A 100% 66%
Association Rules: R-Rh
rules confid-ence support
B A 100% 66%C A 100% 66%D A 100% 66%
Association Rules: R
1. Frequent set mining
2. Perform sanitization algorithm
Open research questions
Optimal solution Itemsets sanitization
The support and confidence of the rules in R- Rh should remain unchanged as much as possible
Integrating data protection and knowledge (rule) protection
Coming up
Cryptographic protocols for privacy preserving distributed data mining
Classification of current algorithms Hide rules
Hide large itemsets
Data modification
Data-Distortion
Algo1aAlgo1b Algo2aWSDAPDA
Algo2b Algo2cNaïve MinFIA
MaxFIA IGA RRA RA SWA
Border-BasedInteger-ProgramingSanitization-Matrix
Data-Blocking
CR CR2
GIH
Data reconstruction CIILM
Weight-based Sorting Distortion Algorithm (WSDA) [Pontikakis 03]Pontikakis 03]
High Level Description: Input:
Initial Database Set of Sensitive Rules Safety Margin (for example 10%)
Output: Sanitized Database Sensitive Rules no longer hold in the Database
WSDA Algorithm
High Level Description: 1st step:
Retrieve the set of transactions which support sensitive rule RRSS
For each sensitive rule RRSS find the number NN11 of transaction in which, one item that supports the rule will be deleted
WSDA Algorithm
High Level Description: 2nd step:
For each rule RRii in the Database with common items with RRSS compute a weight w w that denotes how strong is RRii
For each transaction that supports RRSS compute a priority PPii, that denotes how many strong rules this transaction supports
WSDA Algorithm
High Level Description: 3rd step:
Sort the NN11 transactions in ascending order according to their priority value PPii
4th step: For the first NN11 transactions hide an item that is
contained in RRSS
WSDA Algorithm
High Level Description: 5th step:
Update confidence and support values for other rules in the database
2007-7-10SIGMOD Ph.D. Workshop IDAR’07
46
Discussion
Sanitization algorithm Compared with early popular data sanitization :
performs sanitization directly on knowledge level of data
Inverse frequent set mining algorithm Deals with frequent items and infrequent items
separately: more efficiently, a large number of outputs
Proposed Solution
Our solution provides user with a knowledge level window to perform sanitization handily and generates a number of secure databases