Preserving Privacy in Data Preparation for Association Rule Mining Nan Zhang, Shengquan Wang, and Wei Zhao, Fellow, IEEE Abstract We address the privacy preserving association rule mining problem in a distributed system with one data miner and multiple data providers, each holds one transaction. The literature has tacitly assumed that randomization is the only effective approach to preserve privacy in such circumstances. We challenge this assumption by introducing an algebraic techniques based scheme in the data preparation phase. Compared to previous approaches, our new scheme can identify association rules more accurately but disclose less private information. Furthermore, our new scheme can be readily integrated as a middleware with existing systems. Index Terms Data mining; clustering, classification, and association rules; privacy; singular value decomposition. I. I NTRODUCTION The goal of data mining is to extract interesting knowledge from large amounts of data [1]. Traditional data mining algorithms deal with centralized data. Recently, a number of applications on the Internet lead to a need for mining distributed data. In this circumstance, a privacy concern arises from the distributed The authors are with the Department of Computer Science, Texas A&M University, College Station, TX 77840. E-mail: {nzhang, swang, zhao}@cs.tamu.edu. A preliminary version of this paper is to be presented at the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), September 2004.
32
Embed
Preserving Privacy in Data Preparation for Association Rule Mining
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Preserving Privacy in Data Preparation for
Association Rule Mining
Nan Zhang, Shengquan Wang, and Wei Zhao, Fellow, IEEE
Abstract
We address the privacy preserving association rule mining problem in a distributed system with one data miner
and multiple data providers, each holds one transaction. The literature has tacitly assumed that randomization is the
only effective approach to preserve privacy in such circumstances. We challenge this assumption by introducing
an algebraic techniques based scheme in the data preparation phase. Compared to previous approaches, our new
scheme can identify association rules more accurately but disclose less private information. Furthermore, our new
scheme can be readily integrated as a middleware with existing systems.
Index Terms
Data mining; clustering, classification, and association rules; privacy; singular value decomposition.
I. INTRODUCTION
The goal of data mining is to extract interesting knowledge from large amounts of data [1]. Traditional
data mining algorithms deal with centralized data. Recently, a number of applications on the Internet lead
to a need for mining distributed data. In this circumstance, a privacy concern arises from the distributed
The authors are with the Department of Computer Science, Texas A&M University, College Station, TX 77840. E-mail: {nzhang, swang,
zhao}@cs.tamu.edu.
A preliminary version of this paper is to be presented at the 8th European Conference on Principles and Practice of Knowledge Discovery
in Databases (PKDD), September 2004.
2
data providers. In this paper, we address issues related to production of accurate data mining results, while
preserving the private information in the data being mined.
We will focus on association rule mining, which will be briefly reviewed in the next section. Since
Agrawal, Imielinski, and Swami addressed this problem in [2], association rule mining has been an active
research area due to its wide applications and the challenges it presents. Many algorithms have been
proposed and analyzed [3]–[5]. However, few of them have addressed the issue of privacy protection.
We can classify privacy preserving association rule mining systems into two classes based on their
infrastructures, named Server to Server (S2S) and Client-to-Server (C2S), respectively. In the first category
(S2S), data are distributed across several autonomous entities (servers) [6], [7]. Each server holds a private
database that contains numerous data points (i.e., transactions). The servers collaborate with each other
to identify association rules spanning multiple databases. Since usually only a few servers are involved
in a system (e.g., less than 10), the problem can be modeled as a variation of the secured multi-party
computation [8], which has been extensively studied in cryptography [9].
In the second category (C2S), a system consists of one data miner (server) and large amounts of data
providers (clients) [10], [11]. Each data provider holds only one transaction. Association rule mining is
performed by the data miner on the aggregate transactions provided by the data providers. Online survey
is a typical example of this kind of system, as the system can be modeled as consisting of one data miner
(i.e., the survey analyzer) and thousands of data providers (i.e., the survey respondents). To ensure the
effectiveness of the survey results (e.g., to block multiple votes from a unique IP), the identity of the data
providers cannot be hidden from the data miner. Thus, privacy is of particular concern in this kind of
system. In fact, there has been wide media coverage of the public debate of protecting privacy in online
surveys [12].
Both S2S and C2S systems have wide applications. Nevertheless, we will focus on studying C2S
systems in this paper. Several studies have been carried out on privacy preserving association rule mining
in C2S systems. Most of them have tacitly assumed that an effective approach to preserving privacy is
3
randomization. If we consider the data mining as a two-phase process, which consists of the the data
prepartion phase and the data mining phase, the randomization approach is involved in both phases. We
Fig. 1. 2-phase Data Mining
challenge this assumption by introducing a new scheme that only occurs in the data preparation phase.
Our new scheme integrates algebraic techniques with random noise perturbation. It has the following
important features that distinguish it from previous approaches:
• Our scheme is easy to implement and flexible. Our scheme is involved only in the data preparation
phase and does not require a support recovery procedure in the data mining phase. Thus, our new
scheme is transparent to the data mining process. It can be readily integrated as a middleware with
existing systems.
• Our scheme can identify association rules more accurately but disclose less private information.
Roughly speaking, our scheme selects the most useful features for asssociation rule mining from the
original data and transmits only these features to the data miner. Our simulation data show that for
the same level of accuracy, our system discloses private information about five times less than the
randomization approach.
• We allow explicit negotiation between the data providers and the data miner in terms of the tradeoff
between the accuracy of data mining results and the privacy of data providers. Instead of following
the rules set by the data miner, a data provider can play a role in determining the tradeoff between
accuracy and privacy. This is an important feature because people have a wide variety of attitudes
4
towards privacy. Due to survey results in [13], the net users’ attritudes towards privacy can be
divided into three categories: 7% privacy fundamentalists who are extremely concerned about the use
of their private data, 56% pragmatic majority who are generally willing to provide data if privacy
protection measures can be offered; and 27% marginally concerned who barely care about privacy.
The negotiation feature in our scheme can help the data miner to collaborate with both hard-core
privacy fundamentalists and people comfortable with limited privacy disclosure.
The rest of this paper is organized as follows: In Section II, we brief review previous approaches. We
present our models and introduce our new scheme in Section III. The communication protocol of our
new scheme and its basic components are also provided in this section. In Section IV, we present the
theoretical analysis of the tradeoff between accuracy and privacy in our scheme. The simulation results are
presented in Section V. In Section VI, we present the experimental results on the performance evaluation
of our scheme. Implementation and overhead issues are discussed in Section VII, followed by a final
remark in Section VIII.
II. APPROACHES
In this section, we first overview the definition of association rule mining. Then, we introduce our
models of the data miners. Based on the model, we review the randomization approach, which has been
widely used in the literature of privacy preserving data mining. We will address the problems associated
with the randomization approach, which motivates us to design a new privacy preserving scheme.
A. Association Rule Mining
A motivating example for association rule mining is a survey of hobbies. Each survey respondent
chooses arbitrary number of hobbies from five options: football, soccer, beauty, video games and PC
games. These options are called items. For a survey respondent, the answer to the survey is called a
transaction. As we can see, a transaction is a set of items (e.g., t ={football, soccer}). Given a number
of transactions, association rule mining finds interesting correlation (relationship) between items [1]. For
5
example, suppose that there are 5, 000 survey respondents. The survey analyzer finds that within the 1, 000
respondents who choose video games as their hobbies, there are 800 respondents who also choose PC
games as their hobbies. Based on the survey results, the survey analyzer may infer an association rule
that can be represented as follows.
video games ⇒ PC games [support = 16%, confidence = 80%]. (1)
The support of this rule is the percentage of the respondents who choose both video games and PC
games. Roughly speaking, the confidence is the probability that a respondent who chooses video games
also chooses PC games. Both support and confidence are measures of the validity and trustworthiness of
the association rules. Technically speaking, the main task of association rule mining is to find all frequent
itemsets, which are itemsets that have support larger than a threshold determined by the data miner.
B. Model of Data Miners
Due to the privacy concern introduced to the system, we classify the data miners into two categories.
One category is legal data miners. These data miners always act legally in that they only perform regular
data mining tasks and would never intentionally invade the privacy of the data providers. The other
category is illegal data miners. These data miners would purposely compromise the privacy of the data
providers.
Like adversaries in distributed systems, illegal data miners come in many forms. In most forms, their
behavior is restricted from arbitrarily deviating from the protocol. In this paper, we focus on a particular
sub-class of illegal miners. That is, in our system, illegal data miners are honest but curious: they follow
proper protocols (i.e., they are honest), but they analyze all intermediate communications and received
transactions (i.e., they are curious) to discover private information [9]. Even though it is a relaxation from
the Byzantine behavior, this kind of honest but curious (nevertheless illegal) behavior is common and has
been widely used as the adversary model in the literature.
6
C. Randomization Approach
To prevent invasion of privacy due to the existence of illegal data miners, countermeasures must be
implemented in the data mining system. We briefly review the randomization approach, which is currently
used to preserve privacy in association rule mining.
Based on the randomization approach, the entire data mining process is a two-step process. The first
step is in the data preparation phase. In this step, a data provider first applies the randomization algorithm
to the transaction it holds. Then, the data provider transforms the randomized transaction to the data
miner. In previous studies, several randomization algorithms have been proposed including the cut-and-
paste operator [10] and MASK operator [11]. For example, when the cut-and-paste operator is used, the
data provider first randomly choose an integer j as the number of items that occur in both the original
transaction t and the randomized transaction R(t). After that, the data provider randomly choose j items
from the t and place these items into R(t). Then, for every item a 6∈ t, the data provider tosses a coin
with probability ρ to place a into R(t).
In the second step, the data miner performs association rule mining on the aggregate data. With the
randomization approach, the data miner must first employ a support recovery algorithm which intends to
reconstruct the support of candidate itemsets.
Also in the second step, an illegal data miner may invade privacy by using a privacy data recovery
algorithm on the randomized transactions supplied by the data providers.
Figure 1 depicts the privacy preserving association rule mining system with the randomization approach.
Clearly, any such system should be measured by its capability of both generating accurate association
rules and preventing invasion of privacy.
D. Problems of the Randomization Approach
While the randomization approach is intuitive, researchers have recently identified some problems of
the randomization approach as follows.
7
Fig. 2. randomization approach
• In [10], the authors remarked that if the cut-and-paste operator is applied to a transaction with 10
or more items, it is difficult, if not impossible, for the data provider to contribute to the association
rule mining with its privacy preserved. Furthermore, large itemsets have exceedingly high variances
on the recovered support. Similar problems exist with other randomization operators as they share
the similar scheme on the randomization of the original data.
An approach was proposed in [10] to solve this problem. In the approach, all data providers that
hold transactions with 10 or more items do not transfer the randomized transaction to the data miner.
Unfortunately, this approach prevents many frequent itemsets that contain 4 or more items from being
discovered by the data miner.
• In [14], the authors showed that the spectral properties of randomized data could help curious
data miners to separate noise from private data. Based on random matrix theory, they proposed
a filtering method to reconstruct private data from the randomized data set. They demonstrated that
the randomization approach preserves very little privacy in many cases. Although their work is based
8
on the randomization approach for privacy preserving data classification, we believe that the similarity
between randomization operators in association rule mining and data classification makes the problem
inherent in the randomization approach.
• The randomization approach also suffers in efficiency. Since the privacy preserving mechanism is
not restricted to the data preparation phase, it puts a heavy load on (legal) data miners at run time
(because of the support recovery) [15]. It is shown that the cost of mining randomized data set is
well within an order of magnitude in respect to that of mining original data set.
We explore the reasons behind these problems as given below.
• We note that previous randomization approaches are transaction-invariant. That is, the same pertur-
bation algorithm is applied to all data. Since more items in the original transaction always result in
more “real” items to be included in the randomized transaction, privacy protection on transactions
with a large size (e.g., |t| > 10) are doomed to failure.
• Previous randomization approaches are item-invariant. All items in the original transaction have the
same probability of being included in the perturbed transaction. No specific operation is performed
to preserve the correlation between different items. Thus, a lot of “real” items in the perturbed
transactions may not appear in any frequent itemset. That is, the disclosure of these items does not
contribute to the mining of association rules.
We remark that the transaction-invariant and item-invariant properties are inherent in the randomization
approach. The reason is that in a system using randomization approach, the communication is one-way:
from the data providers to the data miner. As such, a data provider cannot obtain any specific guidance
on the perturbation of its transaction from the (legal) data miner, nor can the data providers learn the
correlation between the items. Thus, a data provider has no choice but to use a transaction-invariant and
item-invariant approach.
This observation motivates us to develop a new scheme that allows two-way communication between
the data miner and the data providers. The two-way communication helps preserving privacy while not
9
incurring too much overhead. Thereby, we significantly improve the performance in terms of accuracy,
privacy, and efficiency. We describe the new scheme in the next section.
III. COMMUNICATION PROTOCOL AND RELATED COMPONENTS
In this section, we introduce our new scheme including the communication protocol and its basic
components.
A. Description of Our New Scheme
Fig. 3. our new scheme
Figure 2 depicts the infrastructure of a system using our new scheme. Our scheme has two key
components, perturbation guidance (PG) in the data miner server side and perturbation in the data provider
client side. Compared to the randomization approach, our scheme does not have the support recovery
component. Instead, the association rule mining is performed on the perturbed transactions (R(t)) directly.
Thus, our scheme is restricted to the data preparation stage and does not put a heavy load on the data
miner by recovering the support at run time.
10
Our scheme is a three-step process. In the first step, the data miner negotiates different perturbation level
k with different data providers. The larger k is, the more contribution R(t) will make to the association rule
mining task. The smaller k is, the more private information is preserved. Thus, a privacy fundamentalist
can choose a small k to preserve its privacy while a privacy unconcerned data provider can choose a large
k to contribute to the association rule mining.
The second step is to transmit the perturbed transactions from the data providers to the data miner.
Since each data provider comes at a different time (e.g., different survey respondents take the survey at
different time), this step can be considered as an iterative process. In each stage, the data miner dispatches
a reference (perturbation guidance) Vk to a data provider Pi. Here Vk depends on the perturbation level
kthat is negotiated by the data miner and the data provider Pi in the first step. Based on the received Vk,
the perturbation component of Pi computes the perturbed transaction R(t) from its original transaction
t. Then, Pi transmits R(t) to the perturbation guidance (PG) component of the data miner. The PG
component then updates Vk based on R(t) and forwards R(t) to the association rule mining process. A
curious data miner can also obtain R(t). In this case, the curious data miner uses private data recovery
algorithm to compromise privacy in R(t).
In the third step, the perturbed transactions received by the data miner are used by the association rule
mining process. Association rules are identified and delivered to the data miner.
The key here is to properly design Vk so that correct guidance can be given to the data providers on how
to perturb the transactions. In our scheme, we let Vk be an algebraic quantity derived from the currently
received, yet perturbed transactions. The details of computing Vk will be presented as a basic component
of our scheme.
B. Notions of Transactions
Before presenting the details of the communication protocol and its basic components, we first introduce
some notions of the data set. Let I be a set of n items (i.e., I = {a1, . . . , an}. Suppose that there are
m data providers in the system. Each data provider Ci holds a transaction ti, which is a subset of I . We
11
represent the data set by an m×n matrix T = [a1, . . . , an] = [t1; . . . ; tm] 1. An example of the transaction
matrix T is shown in Table I. Let 〈T 〉ij be the element of T with indices i and j. The element 〈T 〉ij
represents whether item j is in transaction ti. Suppose that transaction t1 contains items a1 and a2. The
first row of the matrix has 〈T 〉1,1 = 〈T 〉1,2 = 1.
TABLE I
AN EXAMPLE OF A TRANSACTION MATRIX
a1 a2 · · · an
t1 1 1 · · · 0
......
.... . .
...
tm 1 0 · · · 1
An itemset B ⊆ I is an h-itemset if and only if B contains h items (i.e., |B| = h). The support of B
is the percentage of the transactions in the data set that contain B. That is,
supp(B) =|{t ∈ T |B ⊆ t}|
m. (2)
An h-itemset B is frequent if supp(B) ≥ min supp, where min supp is a predefined minimum threshold
of support. Refer to the “survey of hobbies” example in the above subsection, {video games, PC games}
is a frequent 2-itemset with support 0.16. The set of frequent h-itemsets is denoted by Lh.
C. The Communication Protocol
We now describe the communication protocol of our new scheme. The negotiation between the data
miner and data providers is shown in Protocol 1. On the side of data miner server, there are two threads that
perform the operations in Protocol 2 and Protocol 3 iteratively after the negotiation. For a data provider,
it performs the operations in Protocol 4 to perturb and transmit its transaction to the data miner.
1Here ti and ai are used somewhat ambiguously. In the context of association rule mining, ti is a transaction and ai is an item. In the
context of matrix, ti represents a row vector in T and ai represents a column vector in T .
12
Protocol 1 NegotiationNM1. Based on the SVD of T ∗ (T ∗ = U∗Σ∗V ∗′), the data miner calculates S = 〈Σ∗〉211 + · · ·+ 〈Σ∗〉2nn;
NM2. Find the smallest k ∈ [1, n] such that∑k
i=1〈Σ∗〉2ii ≥ µ · S;
NM3. The data miner dispatches k to registered data providers;NP1. For a data provider Ci,
if Ci receives k ≤ Kt (Kt is the threshold of truncation level set by Ci) then
Ci sends ready message to the data miner;
end if
Protocol 2 Thread of registering data providerR1. Negotiate on the truncation level k with a data provider;
R2. Wait for a ready message from a data provider;
R3. Upon receiving the ready message from a data provider,
• Register the data provider;
• Send the data provider current Vk;
R4. Go to Step R1;
D. Basic Components
There are three key components in the communication protocol of our scheme: (a) the method of
computing Vk, (b) the perturbation function R(·), and (c) the negotiation on truncation level k.
1) Computation of Vk: Recall that Vk carries information from the data miner to data providers on how
to perturb the original transactions to preserve privacy. In our scheme, Vk is an estimate of the eigenvectors
of A = T ′T (i.e., the right singular vectors of T ). The justification of Vk on providing accurate mining
results and preserving privacy is presented in Appendix I.
As we are considering dynamic cases where the perturbed transactions are dynamically fed to the data
miner, the data miner keeps a copy of all received (perturbed) transactions and updates it when a new
perturbed transaction is received. Assume that the initial set of received (perturbed) transactions T ∗ is
13
Protocol 3 Thread of receiving transactionT1. Wait for a perturbed transaction R(t) from a data provider;
T2. Upon receiving the transaction from a registered data provider,
• Update Vk based on the recently received perturbed transaction;
• Deregister the data provider;
T3. Go to Step T1;
Protocol 4 Transaction perturbationP1. Negotiate on the truncation level k with the data miner;
P2. Send the data miner a ready message indicating that this provider is ready to contribute to the mining
process;
P3. Wait for a message that contains Vk from the data miner;
P4. Upon receiving the message from the data miner,
• Compute R(t) based on t and Vk;
P5. Transfer R(t) to the data miner;
empty 2. Every time when a perturbed transaction R(t) is received, T ∗ is updated by appending R(t)
to the bottom of T ∗. Thus, T ∗ is the matrix of currently received (perturbed) transactions. We derive Vk
from T ∗.
In particular, the computation of Vk is done in the following steps. Using singular value decomposition
(SVD) [16], we decompose T ∗ as (3), where Σ∗ = diag(s1, . . . , sn) is a diagonal matrix with s1 ≥ · · · ≥ sn.
T ∗ = U∗Σ∗V ∗′. (3)
The numbers s2i make up the eigenvalues of A∗ = T ∗′T ∗. V ∗ is an n× n unitary matrix composed of the
eigenvectors of A∗.
Vk is composed of the k eigenvectors of A∗ that are associated with the largest k eigenvalues of A∗. If
2T ∗ may also contain some transactions provided by privacy-careless data providers
14
V ∗ = [v1, . . . , vn], we have
Vk = [v1, . . . , vk]. (4)
Thus, we call Vk as the k-truncation of V ∗. Several incremental algorithms have been proposed to update
Vk when a perturbed transaction is received by the data miner [17], [18]. The computing cost of updating
Vk is addressed in Sect. VII.
As we will see in Sect. IV, k plays a critical role in balancing accuracy and privacy. We will also show
that by using Vk in conjunction with R(·), which is to be discussed next, we can achieve both desired
accuracy and privacy protection.
2) Perturbation function: Recall that once a data provider receives Vk from the data miner, the data
provider applies a perturbation function R(·) to its transaction t. The result is a perturbed transaction
R(t) that will be transmitted to the data miner. The computation of R(t) is defined as follows. First, for
a given Vk, the transaction t is transformed by t = tVkV′k . Note that the elements of the vector t may
not be integers. Algorithm 5 is employed to round t to 0 or 1. In this algorithm, ρt is a pre-defined
parameter. Finally, for the completeness of our work, we introduce an optional procedure to enhance the
privacy preserving capability of the system, which is shown in Algorithm 6. A data provider may use
the protocol to insert additional noise into R(t). In the protocol, ρm is a parameter determined by the
data provider. The higher ρm is, the more noise is inserted into the perturbed transaction. We remark that
this protocol is optional and is only needed by privacy fundamentalists. An example of the perturbation
process is provided in Appendix II.
3) Negociation on truncation level: In order to retain enough information for association rule mining
after the transformation from t to R(t), a textbook heuristic is to make the sum of the k eigenvalues
associated with the retained eigenvectors of A∗ larger than 85% of the sum of the eigenvalues of A∗ (i.e.,
µ = 85%) [16], [19]. The perturbation level k is usually large at the beginning but decreases and stabilizes
to be fairly small (e.g., less than 1% of n) soon. Thus, in the theoretical analysis of our scheme, we consider
the perturbation level k as a predetermined parameter rather than a variable updated throughout the data
15
Algorithm 5 Mapping
Let 〈t〉i be the element of vector t with index i. Similar notations apply to other vectors.
for every element 〈t〉i in t do
if 〈t〉i ≥ 1 − ρt then
〈R(t)〉i = 1
else
〈R(t)〉i = 0
end if
end for
Algorithm 6 Random-noise perturbationfor every item ai /∈ t do
Choose a real number j uniformly at random on [0, 1]
if j ≥ 1 − ρm then
〈R(t)〉i = 1
end if
end for
preparation process.
We have described the communication protocol of our scheme and its key components. We now discuss
the accuracy and privacy measures of our scheme.
IV. ANALYSIS ON ACCURACY AND PRIVACY
In this section, we analyze our new scheme. We define measures for accuracy and privacy and derive
their bounds, in order to provide guidelines for the tradeoff between these two measures and hence help
system managers setting parameter. We also show the simulation and experimental results of our scheme
on real datasets. For the simplicity of our discussion, we do not consider the optional random noise
insertion procedure in Algorithm 6.
16
A. Accuracy Measure
An accuracy measure should reflect the capability of the system that can correctly identify association
rules in a given dataset. We define the accuracy measure as the error of the support of frequent itemsets.
This is because that the main task of association rule mining is to identify frequent itemsets with support
larger than a threshold min supp. There are two kinds of errors: a) false drops, which are unidentified
frequent itemsets, and b) false positives, which are itemsets incorrectly identified to be frequent.
We now formally define our accuracy measure. Given itemset Ij , let supp(Ij) and supp′(Ij) be the
support of Ij in the original transactions T and the perturbed transactions R(T ), respectively. Recall that
the set of frequent h-itemsets in T is Lh. We define the errors on false drops and false positives as follows.
Definition 4.1: Given itemset size h, the error on false drops, ρh1 , and the error on false positives, ρh
2 ,
are defined as
ρh1 = max
Ij∈Lh
(supp(Ij) − supp′(Ij)), (5)
ρh2 = max
Ij 6∈Lh
(supp′(Ij) − supp(Ij)). (6)
We define our accuracy measure, degree of accuracy, as the maximum value of ρh1 and ρh
2 on all sizes of
itemsets.
Definition 4.2: The degree of accuracy in a privacy preserving association rule mining system is defined
as γ = maxh≥1 max(ρh1 , ρ
h2).
Based on the definition, we derive an upper bound on the accuracy measure.
Theorem 4.3: In our system, the degree of accuracy γ satisfies
γ ≤ σ2k+1
m
(
1 + max
{
1 − (1 − ρt)2
(1 − ρt)2,1 − ρt
ρt
})
, (7)
where σ2k+1
is the (k+1)th largest eigenvalue of A = T ′T . In particular, when ρt = (3−√
5)/2, γ reaches
its lowest upper bound at γ ≤ 2.618σ2k+1
/m.
The proof of Theorem 4.3 can be found in Appendix III. Our bound on accuracy measure is fairly small
when the number of transactions (m) is sufficiently large. This is usually the case in reality. Actually, our
17
scheme tends to enlarge the support of frequent itemsets and reduce the support of infrequent itemsets.
Thus, the upper bound is not always tight. We can observe this trend from the experimental results, which
will be presented in Section VI.
B. Privacy Measure
In our system, the data miner cannot infer the original transaction t from a perturbed transaction R(t)
deterministically because VkV′k is a singular matrix with its determinant det(VkV
′k) = 0 (i.e., it does not
have an inverse matrix). To measure the probability that an item in t can be identified from R(t), we need
a privacy measure.
Due to survey results in [12], a data provider (e.g., net user) always has a strong will of filtering out
“unwanted” data (i.e., data not contribute to the mining of association rules) before transmitting its private
data to the data miner. Given transaction t, an item ai in t is unwanted if ai does not appear in any frequent
itemset (i.e., ∀h ≥ 1, ai 6∈ {Lh|Lh ⊆ t}). That is, the disclosure of ai (i.e., ai ∈ R(t)) does not contribute
to the mining of association rules. We measure privacy by the probability that an “unwanted” item is
included in the perturbed transaction. Formally speaking, our privacy measure, named level of privacy, is
defined as follows.
Definition 4.4: Given transaction t, an item ai ∈ t is unwanted if and only if there does not exist any
frequent itemset I ⊆ t such that ai ∈ I . We define the level of privacy as the probability that an infrequent
item in t is included in the perturbed transaction R(t). That is, the level of privacy is defined as
δ = Pr{ai ∈ R(t)|ai is unwanted in t}. (8)
As we can see from the definition, a higher level of privacy results in a higher probability of privacy
invasion. With an approximation, we derive an upper bound on the level of privacy in our scheme.
Theorem 4.5: With properly set parameters, the level of privacy in our system is bounded by
δ<∼1 −√
σ2k+1
+ · · ·+ σ2n
σ21 + · · ·+ σ2
n
, (9)
where σ2i is the ith largest eigenvalue of A = T ′T .
18
The proof of Theorem 4.5 can be found in Appendix IV. Because of the uncertainty on the rounding off
operation in Algorithm 5, this bound is not tight.
As we can see from Theorem 4.3 and Theorem 4.5, there is a tradeoff between accuracy and privacy in
our system. The upper bound of the degree of accuracy is in proportion to σ2k+1
. The level of privacy is a
decreasing function of σ2k+1
. The larger the perturbation level k is, the more unwanted items are disclosed
to the data miner, and the more frequent itemsets can be correctly identified.
V. SIMULATION RESULTS
In this section, we present the simulation results of our scheme on a randomly generated dataset. The
experimental results on a real dataset will be presented in the next section.
The randomly generated dataset consists of 2, 000 transactions. Let there are 20 items in the dataset.
Each transaction is a subset of the 20 items. We represent the dataset by a 2, 000 × 20 matrix T . The
first 19 columns (items) of T , a1, . . . , a19, are independently generated. The 20th column is set to be the
same as the 10th column (i.e., a10 = a20). The first item a1 has a probability of 0.6 to be included in a
transaction. All other columns have probability of 0.1 to be included in a transaction. Thus, the expected
support of a1 is 0.6. The expected support of every other item is 0.1. Since a10 = a20, the expected
support of {a10, a20} is 0.1. We set min supp = 0.09 as the threshold (lower bound) on the support of a
frequent itemset.
In the randomly generated dataset, the support of {a1} and {a10, a20} are listed as follows.
support of {a1} = 0.6085, support of {a10, a20} = 0.0985. (10)
As we can see, these two itemsets have support much higher than other 1-itemsets and 2-itemsets,
respectively. The left part of Figure 3 shows the original dataset. In the original dataset, the total number
of appearance of all items is 4929 times including 1217 times of a1, 197 times of a10, 197 times of a20,
and 3318 times of the other items.
19
In our simulation, we set the parameters as µ = 0.85 (in Protocol 1), ρt = 0.8 (in Algorithm 5),
and µm = 0 (in Algorithm 6). The data miner updates Vk when every 10 transactions are received. The
truncation level k calculated from Protocol 1 is listed as follows. The right part of Fig. 3 shows the