-
Secure Mining of Association Rules inHorizontally Distributed
Databases
Tamir Tassa
AbstractWe propose a protocol for secure mining of association
rules in horizontally distributed databases. The current
leading
protocol is that of Kantarcioglu and Clifton [18]. Our protocol,
like theirs, is based on the Fast Distributed Mining (FDM)
algorithm of
Cheung et al. [8], which is an unsecured distributed version of
the Apriori algorithm. The main ingredients in our protocol are two
novel
secure multi-party algorithmsone that computes the union of
private subsets that each of the interacting players hold, and
another
that tests the inclusion of an element held by one player in a
subset held by another. Our protocol offers enhanced privacy with
respect
to the protocol in [18]. In addition, it is simpler and is
significantly more efficient in terms of communication rounds,
communication cost
and computational cost.
Index TermsPrivacy preserving data mining, distributed
computation, frequent item sets, association rules
1 INTRODUCTION
WE study here the problem of secure mining of associa-tion rules
in horizontally partitioned databases. Inthat setting, there are
several sites (or players) that holdhomogeneous databases, i.e.,
databases that share the sameschema but hold information on
different entities. The goalis to find all association rules with
support at least s andconfidence at least c, for some given minimal
support size sand confidence level c, that hold in the unified
database,while minimizing the information disclosed about the
pri-vate databases held by those players. The information thatwe
would like to protect in this context is not only individ-ual
transactions in the different databases, but also moreglobal
information such as what association rules are sup-ported locally
in each of those databases.
That goal defines a problem of secure multi-party com-putation.
In such problems, there are M players that holdprivate inputs, x1;
. . . ; xM , and they wish to securely com-pute y fx1; . . . ; xM
for some public function f . If thereexisted a trusted third party,
the players could surrender tohim their inputs and he would perform
the function evalua-tion and send to them the resulting output. In
the absenceof such a trusted third party, it is needed to devise a
proto-col that the players can run on their own in order to
arriveat the required output y. Such a protocol is considered
per-fectly secure if no player can learn from his view of the
pro-tocol more than what he would have learnt in the
idealizedsetting where the computation is carried out by a
trustedthird party. Yao [32] was the first to propose a generic
solu-tion for this problem in the case of two players. Othergeneric
solutions, for the multi-party case, were later pro-posed in [3],
[5], [15].
In our problem, the inputs are the partial databases, andthe
required output is the list of association rules that holdin the
unified database with support and confidence nosmaller than the
given thresholds s and c, respectively. Asthe above mentioned
generic solutions rely upon a descrip-tion of the function f as a
Boolean circuit, they can beapplied only to small inputs and
functions which are realiz-able by simple circuits. In more complex
settings, such asours, other methods are required for carrying out
this com-putation. In such cases, some relaxations of the notion
ofperfect security might be inevitable when looking for practi-cal
protocols, provided that the excess information isdeemed benign
(see examples of such protocols in, e.g., [18],[28], [29], [31],
[34]).
Kantarcioglu and Clifton studied that problem in [18]and devised
a protocol for its solution. The main part of theprotocol is a
sub-protocol for the secure computation of theunion of private
subsets that are held by the different play-ers. (The private
subset of a given player, as we explainbelow, includes the item
sets that are s-frequent in his par-tial database.) That is the
most costly part of the protocoland its implementation relies upon
cryptographic primi-tives such as commutative encryption, oblivious
transfer,and hash functions. This is also the only part in the
protocolin which the players may extract from their view of the
pro-tocol information on other databases, beyond what isimplied by
the final output and their own input. While suchleakage of
information renders the protocol not perfectlysecure, the perimeter
of the excess information is explicitlybounded in [18] and it is
argued there that such informationleakage is innocuous, whence
acceptable from a practicalpoint of view.
Herein we propose an alternative protocol for the
securecomputation of the union of private subsets. The
proposedprotocol improves upon that in [18] in terms of
simplicityand efficiency as well as privacy. In particular, our
protocoldoes not depend on commutative encryption and
oblivioustransfer (what simplifies it significantly and
contributestowards much reduced communication and
computationalcosts). While our solution is still not perfectly
secure, it leaks
The author is with the Department of Mathematics and Computer
Science,The Open University, 1 University Road, Raanana 43537,
Israel.
Manuscript received 18 Feb. 2012; revised 23 May 2012; accepted
4 Mar.2013; date of publication 6 Mar. 2013; date of current
version 18 Mar. 2014.Recommended for acceptance by Y. Koren.For
information on obtaining reprints of this article, please send
e-mail to:[email protected], and reference the Digital Object
Identifier below.Digital Object Identifier no.
10.1109/TKDE.2013.41
970 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 4, APRIL 2014
1041-4347 2013 IEEE. Personal use is permitted, but
republication/redistribution requires IEEE permission.See
http://www.ieee.org/publications_standards/publications/rights/index.html
for more information.
-
excess information only to a small number (three) of possi-ble
coalitions, unlike the protocol of [18] that discloses infor-mation
also to some single players. In addition, we claimthat the excess
information that our protocol may leak isless sensitive than the
excess information leaked by the pro-tocol of [18].
The protocol that we propose here computes a parameter-ized
family of functions, which we call threshold functions,in which the
two extreme cases correspond to the problemsof computing the union
and intersection of private subsets.Those are in fact
general-purpose protocols that can be usedin other contexts as
well. Another problem of secure multi-party computation that we
solve here as part of our discus-sion is the set inclusion problem;
namely, the problem whereAlice holds a private subset of some
ground set, and Bobholds an element in the ground set, and they
wish to deter-mine whether Bobs element is within Alices subset,
withoutrevealing to either of them information about the
otherpartys input beyond the above described inclusion.
1.1 Preliminaries
1.1.1 Definitions and Notations
Let D be a transaction database. As in [18], we view D as
abinary matrix of N rows and L columns, where each row isa
transaction over some set of items, A fa1; . . . ; aLg, andeach
column represents one of the items in A. (In otherwords, the i; jth
entry of D equals 1 if the ith transactionincludes the item aj, and
0 otherwise.) The database D ispartitioned horizontally between M
players, denotedP1; . . . ; PM . Player Pm holds the partial
database Dm thatcontains Nm jDmj of the transactions in D, 1 m
M.The unified database is D D1 [ [DM , and it includesN :PMm1Nm
transactions.
An item set X is a subset of A. Its global support,suppX, is the
number of transactions in D that contain it.Its local support,
suppmX, is the number of transactions inDm that contain it.
Clearly, suppX
PMm1 suppmX. Let
s be a real number between 0 and 1 that stands for arequired
support threshold. An item set X is called s-fre-quent if suppX sN
. It is called locally s-frequent at Dmif suppmX sNm.
For each 1 k L, let Fks denote the set of all k-item
sets(namely, item sets of size k) that are s-frequent, and Fk;ms
bethe set of all k-item sets that are locally s-frequent at Dm,1 m
M. Our main computational goal is to find, for agiven threshold
support 0 < s 1, the set of all s-frequentitem sets, Fs :
SLk1 F
ks . We may then continue to find all
s; c-association rules, i.e., all association rules of supportat
least sN and confidence at least c. (Recall that if X and Yare two
disjoint subsets of A, the support of the correspond-ing
association rule X ) Y is suppX [ Y and its confi-dence is suppX [
Y =suppX.)
1.1.2 The Fast Distributed Mining Algorithm
The protocol of [18], as well as ours, are based on the
FastDistributed Mining (FDM) algorithm of Cheung et al. [8],which
is an unsecured distributed version of the Apriorialgorithm. Its
main idea is that any s-frequent item set mustbe also locally
s-frequent in at least one of the sites. Hence,in order to find all
globally s-frequent item sets, each player
reveals his locally s-frequent item sets and then the
playerscheck each of them to see if they are s-frequent also
globally.The FDM algorithm proceeds as follows:
1. Initialization: It is assumed that the players havealready
jointly calculated Fk1s . The goal is to pro-ceed and calculate Fks
.
2. Candidate Sets Generation: Each player Pm com-putes the set
of all k 1-item sets that are locallyfrequent in his site and also
globally frequent;namely, Pm computes the set F
k1;ms \ Fk1s . He then
applies on that set the Apriori algorithm in order togenerate
the set Bk;ms of candidate k-item sets.
3. Local Pruning: For each X 2 Bk;ms , Pm computessuppmX. He
then retains only those item sets thatare locally s-frequent. We
denote this collection ofitem sets by Ck;ms .
4. Unifying the candidate item sets: Each playerbroadcasts his
Ck;ms and then all players computeCks :
SMm1 C
k;ms .
5. Computing local supports. All players compute thelocal
supports of all item sets in Cks .
6. Broadcast mining results: Each player broadcaststhe local
supports that he computed. From that,everyone can compute the
global support of everyitem set in Cks . Finally, F
ks is the subset of C
ks that con-
sists of all globally s-frequent k-item sets.In the rst
iteration, when k 1, the set C1;ms that the mthplayer computes
(Steps 2-3) is just F 1;ms , namely, the set ofsingle items that
are s-frequent in Dm. The complete FDMalgorithm starts by nding all
single items that are globallys-frequent. It then proceeds to nd
all 2-item sets that areglobally s-frequent, and so forth, until it
nds the longestglobally s-frequent item sets. If the length of such
item setsis K, then in the K 1th iteration of the FDM it willnd no
K 1-item sets that are globally s-frequent, inwhich case it
terminates.
1.1.3 A Running Example
Let D be a database of N 18 item sets over a set of L 5items, A
f1; 2; 3; 4; 5g. It is partitioned between M 3players, and the
corresponding partial databases are:
D1 f12; 12345; 124; 1245; 14; 145; 235; 24; 24g;D2 f1234; 134;
23; 234; 2345g;D3 f1234; 124; 134; 23g :
For example, D1 includes N1 9 transactions, the third ofwhich
(in lexicographic order) consists of three items1, 2and 4.
Setting s 1=3, an item set is s-frequent in D if it is
sup-ported by at least 6 sN of its transactions. In this case,
F 1s f1; 2; 3; 4g;F 2s f12; 14; 23; 24; 34g;F 3s f124g;F 4s F 5s
; ;
andFs F 1s [ F 2s [ F 3s . For example, the item set 34 is
indeedglobally s-frequent since it is contained in 7 transactions
ofD.However, it is locally s-frequent only inD2 andD3.
TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY
DISTRIBUTED DATABASES 971
-
In the first round of the FDM algorithm, the three
playerscompute the sets C1;ms of all 1-item sets that are locally
fre-quent at their partial databases:
C1;1s f1; 2; 4; 5g ; C1;2s f1; 2; 3; 4g ; C1;3s f1; 2; 3; 4g
:Hence, C1s f1; 2; 3; 4; 5g. Consequently, all 1-item sets haveto
be checked for being globally frequent; that check revealsthat the
subset of globally s-frequent 1-item sets isF 1s f1; 2; 3; 4g.
In the second round, the candidate item sets are:
C2;1s f12; 14; 24g;C2;2s f13; 14; 23; 24; 34g;C2;3s f12; 13; 14;
23; 24; 34g :
(Note that 15; 25; 45 are locally s-frequent at D1 but they
arenot included in C2;1s since 5 was already found to be
globallyinfrequent.) Hence, C2s f12; 13; 14; 23; 24; 34g. Then,
afterverifying global frequency, we are left with F 2s f12; 14;
23;24; 34g.
In the third round, the candidate item sets are:
C3;1s f124g ; C3;2s f234g ; C3;3s f124g :So, C3s f124; 234g and,
then, F 3s f124g. There are nomore frequent item sets.
1.2 Overview and Organization of the Paper
The FDM algorithm violates privacy in two stages: In Step
4,where the players broadcast the item sets that are
locallyfrequent in their private databases, and in Step 6,
wherethey broadcast the sizes of the local supports of
candidateitem sets. Kantarcioglu and Clifton [18] proposed
secureimplementations of those two steps. Our improvement iswith
regard to the secure implementation of Step 4, which isthe more
costly stage of the protocol, and the one in whichthe protocol of
[18] leaks excess information. In Section 2 wedescribe Kantarcioglu
and Cliftons secure implementationof Step 4. We then describe our
alternative implementationand proceed to analyze the two
implementations in terms ofprivacy and efficiency and compare them.
We show thatour protocol offers better privacy and that it is
simpler andis significantly more efficient in terms of
communicationrounds, communication cost and computational cost.
In Sections 3 and 4 we discuss the implementation ofthe two
remaining steps of the distributed protocol: Theidentification of
those candidate item sets that are globallys-frequent, and then the
derivation of all s; c-associationrules. In Section 5 we describe
shortly an alternative proto-col, that was already considered in
[9], [18], which offersfull security at enhanced costs. Section 6
describes ourexperimental evaluation which illustrates the
significantadvantages of our protocol in terms of communication
andcomputational costs. Section 7 includes a review of relatedwork.
We conclude the paper in Section 8.
Like in [18], we assume that the players are semi-honest;namely,
they follow the protocol but try to extract as muchinformation as
possible from their own view. (See [17], [26],[34] for a discussion
and justification of that assumption.)We too, like [18], assume
that M > 2. (The case M 2 is
discussed in [18], Section 5]; the conclusion is that the
prob-lem of secure computation of frequent item sets and
associa-tion rules in the two-party case is unlikely to be of any
use.)
2 SECURE COMPUTATION OF ALL LOCALLYFREQUENT ITEM SETS
Here we discuss the secure implementation of Step 4 in theFDM
algorithm, namely, the secure computation of theunion Cks
SMm1 C
k;ms . We describe the protocol of [18]
(Section 2.1) and then our protocol (Sections 2.2 and 2.3).We
analyze the privacy of the two protocols in Section 2.4,their
communication cost in Section 2.5, and their computa-tional cost in
Section 2.6.
2.1 The Protocol of Kantarcioglu and Clifton for theSecure
Computation of All Locally FrequentItem Sets
2.1.1 Overview
Protocol 1 is the protocol that was suggested by Kantarciogluand
Clifton [18] for computing the unified list of all locallyfrequent
item sets, Cks
SMm1 C
k;ms , without disclosing the
sizes of the subsets Ck;ms nor their contents. The protocol
isapplied when the players already know Fk1s the set of allk 1-item
sets that are globally s-frequent, and they wishto proceed and
compute Fks . We refer to it hereinafter as Pro-tocol UNIFI-KC
(Unifying lists of locally Frequent Item setsKantarcioglu and
Clifton).
The input that each player Pm has at the beginning ofProtocol
UNIFI-KC is the collection Ck;ms , as defined in Steps2-3 of the
FDM algorithm. Let ApFk1s denote the set of allcandidate k-item
sets that the Apriori algorithm generatesfrom Fk1s . Then, as
implied by the definition of C
k;ms (see
Section 1.1.2), Ck;ms , 1 m M, are all subsets of ApFk1s .The
output of the protocol is the union Cks
SMm1 C
k;ms . In
the first iteration of this computation k 1, and the
playerscompute all s-frequent 1-item sets (here F 0s f;g). In
thenext iteration they compute all s-frequent 2-item sets,and so
forth, until the first k L in which they find no s-fre-quent k-item
sets.
After computing that union, the players proceed toextract from
Cks the subset F
ks that consists of all k-item sets
that are globally s-frequent; this is done using the
protocolthat we describe later on in Section 3. Finally, by
applyingthe above described procedure from k 1 until the firstvalue
of k L for which the resulting set Fks is empty, theplayers may
recover the full set Fs :
SLk1 F
ks of all globally
s-frequent item sets.Protocol UNIFI-KC works as follows: First,
each player
adds to his private subset Ck;ms fake item sets, in order tohide
its size. Then, the players jointly compute the encryp-tion of
their private subsets by applying on those subsets acommutative
encryption,1 where each player adds, in histurn, his own layer of
encryption using his private secretkey. At the end of that stage,
every item set in each subset isencrypted by all of the players;
the usage of a commutativeencryption scheme ensures that all item
sets are, eventually,encrypted in the same manner. Then, they
compute the
1. An encryption algorithm is called commutative if EK1 EK2 EK2
EK1 for any pair of keys K1 and K2.
972 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 4, APRIL 2014
-
union of those subsets in their encrypted form. Finally,
theydecrypt the union set and remove from it item sets whichare
identified as fake. We now proceed to describe the pro-tocol in
detail.
(Notation agreement: Since all protocols that we presentherein
involve cyclic communication rounds, the indexM 1 always means 1,
while the index 0 always means M.)
2.1.2 Detailed Description
In Phase 0 (Steps 2-4), the players select the
neededcryptographic primitives: They jointly select a com-mutative
cipher, and each player selects a corre-sponding private random
key. In addition, theyselect a hash function h to apply on all item
sets priorto encryption. It is essential that h will not
experiencecollisions on ApFk1s in order to make it invertibleon
ApFk1s . Hence, if such collusions occur (anevent of a very small
probability), a different hashfunction must be selected. At the
end, the playerscompute a lookup table with the hash values of
allcandidate item sets in ApFk1s ; that table willbe used later on
to find the preimage of a givenhash value.
In Phase 1 (Steps 6-19), all players compute a com-posite
encryption of the hashed sets Ck;ms ,1 m M. First (Steps 6-12),
each player Pm hashesall item sets in Ck;ms and then encrypts them
usingthe key Km. (Hashing is needed in order to preventleakage of
algebraic relations between item sets, see[18], Appendix].) Then,
he adds to the resulting setfaked item sets until its size becomes
jApFk1s j, inorder to hide the number of locally frequent itemsets
that he has. (Since Ck;ms ApFk1s , the size ofCk;ms is bounded by
jApFk1s j, for all 1 m M.)We denote the resulting set by Xm. Then
(Steps 13-19), the players start a loop of M 1 cycles, where ineach
cycle they perform the following operation:Player Pm sends a
permutation of Xm to the nextplayer Pm1; Player Pm receives from
Pm1 a permu-tation of the set Xm1 and then computes a new Xmas Xm
EKmXm1. At the end of this loop, Pmholds an encryption of the
hashed Ck;m1s using allM keys. Due to the commutative property
ofthe selected cipher, Player Pm holds the setfEM E2E1hx : x 2
Ck;m1s g.
In Phase 2 (Steps 21-26), the players merge the listsof
encrypted item sets. At the completion of thisstage P1 holds the
union set C
ks SMm1 C
k;ms hashed
and then encrypted by all encryption keys, togetherwith some
fake item sets that were used for the sakeof hiding the sizes of
the sets Ck;ms ; those fake itemsets are not needed anymore and
will be removedafter decryption in the next phase.
The merging is done in two stages, where inthe first stage the
odd and even lists are mergedseparately. As explained in [18,
Section 3.2.1], notall lists are merged at once since if they
were,then the player who did the merging (say P1)would be able to
identify all of his own encrypteditem sets (as he would get them
from PM ) and
then learn in which of the other sites they are alsolocally
frequent.
In Phase 3 (Steps 28-34), a similar round of decryp-tions is
initiated. At the end, the last player who per-forms the last
decryption uses the lookup table T that
TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY
DISTRIBUTED DATABASES 973
-
was constructed in Step 4 in order to identify andremove the
fake item sets and then to recover Cks .Finally, he broadcastsCks
to all his peers.
Going back to the running example in Sec-tion 1.1.3, the set F
2s , consisting of all 2-item sets thatare globally s-frequent,
includes the item setsf12; 14; 23; 24; 34g. Applying on it the
Apriori algo-rithm, we find that ApF 2s f124; 234g. Therefore,each
of the three players proceed to look for 3-itemsets from ApF 2s
that are locally s-frequent in hispartial database. Since C3;1s
f124g, P1 will hash andencrypt the item set 124 and will add to it
one fakeitem set, since jApF 2s j 2. As C3;2s f234g andC3;3s f124g,
also P2 and P3 will each use one fakeitem set. At the completion of
the protocol, the threeplayers will conclude that C3s f124; 234g.
Then, byapplying the protocol in Section 3, they will find outthat
only the first of these two candidate item sets isglobally
frequent, whence F 3s f124g.
2.2 A Secure Multiparty Protocol for Computing theOR of Private
Binary Vectors
Protocol UNIFI-KC securely computes of the union of pri-vate
subsets of some publicly known ground set (ApFk1s ).Such a problem
is equivalent to the problem of computingthe OR of private vectors.
Indeed, if the ground set isV fv1; . . . ;vng, then any subset B of
V may be describedby the characteristic binary vector b b1; . . . ;
bn 2 ZZn2where bi 1 if and only if vi 2 B. Let bm be the binary
vec-tor that characterizes the private subset held by player Pm,1 m
M. Then the union of the private subsets isdescribed by the OR of
those private vectors, b : WMm1 bm.
Such a simple function can be evaluated securely by thegeneric
solutions suggested in [3], [5], [15]. We present herea protocol
for computing that function which is much sim-pler to understand
and program and much more efficientthan those generic solutions. It
is also much simpler thanProtocol UNIFI-KC and employs less
cryptographic primi-tives. Our protocol (Protocol 2) computes a
wider range offunctions, which we call threshold functions.
Definition 2.1 Let b1; . . . ; bM be M bits and 1 t M be
aninteger. Then
Ttb1; . . . ; bM 1 if
PMm1 bm t
0 ifPM
m1 bm < t
8 2, there is no need to invoke neither of thesecure protocols
of [13] or [12]. Indeed, as M > 2, the exis-tence of other
semi-honest players can be used to verify theinclusion in Eq. (4)
much more easily. This is done in Proto-col 3 (SETINC) which we
proceed to describe next.
Protocol SETINC involves three players: P1 has a vectors s1; . .
. ; sn of elements in some ground set V; PM ,on the other hand, has
a vector Q Q1; . . . ;Qn of sub-sets of that ground set. The
required output is a vectorb b1; . . . ; bn that describes the
corresponding setinclusions in the following manner: bi 0 if si 2
Qiand bi 1 if si =2 Qi, 1 i n. The computation in theprotocol
involves a third player P2. (When Protocol SETINCis called from
Protocol THRESHOLD, the ground set isV ZZM1 and the inputs si and
Qi of the two playersare as in Eq. (4), 1 i n.)
The protocol starts with players P1 and PM agreeing on akeyed
hash function hK (e.g., HMAC [4]), and a corre-sponding secret key
K (Step 1). Consequently (Steps 2-3), P1converts his sequence of
elements s s1; . . . ; sn into asequence of corresponding
signatures s0 s01; . . . ;s0n, where s0i hKi; si and PM does a
similar con-versions to the subsets that he holds. Then, in Steps
4-5, P1sends s0 to P2, and PM sends to P2 the subsets Q0i,1 i n,
where the elements within each subset are ran-domly permuted.
Finally (Steps 6-7), P2 performs the rele-vant inclusion
verifications on the signature values. If hefinds out that for a
given 1 i n, s0i 2 Q0i, he mayinfer, with high probability, that si
2 Qi (see more onthat below), whence he sets bi 0. If, on the other
hand,s0i =2 Q0i, then, with certainty, si =2 Qi, and thus hesets bi
1.
Two comments are in order:
1. If the index i had not been part of the input to thehash
function (Steps 2-3), then two equal compo-nents in P1s input
vector, say si sj, wouldhave been mapped to two equal
signatures,s0i s0j. Hence, in that case player P2 wouldhave learnt
that in P1s input vector the ith and jthcomponents are equal. To
prevent such leakage ofinformation, we include the index i in the
input tothe hash function.
2. An event in which s0i 2 Q0i while si =2 Qiindicates a
collusion; specifically, it implies thatthere exist u0 2 Qi and u00
2 V nQi for whichhKi; u0 hKi; u00. Hash functions are designedso
that the probability of such collusions is negli-gible, whence the
risk of a collusion can beignored. However, it is possible for
player PM tocheck upfront the selected random key K inorder to
verify that for all 1 i n, the setsQ0i fhKi; u : u 2 Qig and Q00i
fhKi; u :u 2 V nQig are disjoint.
We refer hereinafter to the combination of ProtocolsTHRESHOLD
and SETINC as Protocol THRESHOLD-C; namely, it isProtocol THRESHOLD
where the verifications of the inequal-ities in Steps 6-8, which
are equivalent to the verification ofthe set inclusions in Eq. (4),
are carried out by ProtocolSETINC. Then our claims are as
follows:
Theorem 2.2. Protocol THRESHOLD-C is correct, i.e., it
computesthe threshold function.
Proof. Protocol THRESHOLD operates correctly if the inequal-ity
verifications in Step 7 are carried out correctly, sincesi sMi mod
M 1 equals the ith component aiin the sum vector a PMm1 bm. The
inequality verifica-tion is correct if Protocol SETINC is correct.
The latter pro-tocol is indeed correct if the randomly selected key
K issuch that for all 1 i n, the sets Q0i fhKi; u :u 2 Qig and Q00i
fhKi; u : u 2 V nQig are dis-joint. As discussed earlier, such a
verification can be car-ried out upfront, and most all selections
of K areexpected to pass that test. tu
Theorem 2.3. Assume that the M > 2 players are
semi-honest.Let C fP1; P2; . . . ; PMg be a coalition of
players.
a. If P2 =2 C and at least one of P1 and PM is not in Ceither,
then Protocol THRESHOLD-C is perfectly privatewith respect to
C.
b. If P2 2 C but P1; PM =2 C, the protocol is computation-ally
private with respect to C.
c. Otherwise, the coalition C includes at least two of thethree
players P1; P2; PM . In such cases, it may learnthe sum a PMm1 bm,
but no information on the pri-vate vectors bm, 1 m M, beyond what
is impliedby that sum and the coalitions input vectors.
(A multiparty computation is perfectly private withrespect to a
subset of players if it does not enable those play-ers to learn
information on the inputs of other playersbeyond what is implied by
the final output and their owninputs, even if they are
computationally unbounded. Such acomputation is computationally
private if it achieves thesame goal when the players are
polynomially-bounded.)
TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY
DISTRIBUTED DATABASES 975
-
Proof. The view of each single player P consists of bm; forall 1
m 6 M, where bm; is Ps additive share in anM-out-of-M secret
sharing scheme for Pms private inputbm. In addition, P1s view
includes s2; . . . ; sM1 (Step 4in Protocol THRESHOLD), which are
additive shares in anM-out-of-M secret sharing for the sum a; and
P2sview includes the signatures s0 and Q0 (Steps 4 and 5 inProtocol
SETINC).
(a) If the coalition C does not include P2 and, in addi-tion, it
does not include at least one of P1 and PM , thenCs collaborative
view consists of incomplete sets of addi-tive shares in b1; . . .
;bM and a. As the M-out-of-M secretsharing scheme is perfect, those
additive shares are inde-pendent and each one is a uniformly
distributed randomvector in ZZnM1. Hence, the coalition C may
simulate itsview during the protocol, whence the protocol is
per-fectly private with respect to C.
(b) If P2 2 C and P1; PM =2 C, we may repeat the samearguments
as before, since P2s additional view of s
0 andQ0 can also be simulated by independent and
uniformlydistributed random vectors; indeed, assuming that thehash
function h is secure and that the key K is chosenuniformly at
random, then s0 and Q0 are indistinguish-able from vectors chosen
uniformly at random from Hn
and Htn, respectively, where H is the hash range.However, while
the coalitions discussed above in
(a) cannot learn any information about the inputs ofother
players, even if their members are computation-ally unbounded, here
the privacy guarantee assumesthat P2 is polynomially bounded. If P2
is computation-ally unbounded, he may scan the exponential numberof
possible keys K in order to find what key P1 andPM used. To do so,
he will compute for each possibleK the hashed values hKi; u for all
1 i n andu 2 V f0; 1; . . . ;Mg. Then, he will check whether
thesignature values that he got from P1 and PM (namely,s0 and Q0)
are consistent with the values which hecomputed. After finding the
true K (assuming that itis the only one that will pass the check),
P2 will beable to recover si from s0i, and Qi from Q0i, forall 1 i
n. Since Qi reveals sMi (see Eq. (4)), P2may proceed to compute ai
si sMi, 1 i n.Hence, if P2 is computationally unbounded, he maybe
able to deduce the value of the sum a. (If duringhis check, P2
finds more than one possible K, he willcompute the vector a that
corresponds to each ofthem, and then infer that the true a is one
of thosevectors.)
(c) If P1; PM 2 C, then by adding s (known to P1) andsM (known
to PM ), they will get the sum a. No furtherinformation on the
input vectors b1; . . . ;bM may bededuced from the inputs of the
players in such a coali-tion; specifically, every set of vectors
b1; . . . ;bM that isconsistent with the sum a is equally likely.
(Put differ-ently, such a coalition may simulate its view by
selectingat random any set of vectors b1; . . . ;bM whose sumequals
a and then generate for each such vector randomadditive
shares.)
Coalitions C that include either P1 and P2 or P2 andPM can also
recover the vector a. Indeed, P2 knows s
0
and Q0, while P1 or PM knows hK , and K. Hence, if P2
colludes with either P1 or PM , he may recover from s0
and Q0 the preimages s and Q. Thus, such a coalition canrecover
s and sM , and consequently, it can recover a. Asargued before, the
shares available for such coalitions donot reveal any further
information about the input vec-tors b1; . . . ;bM . tuThe
susceptibility of Protocol THRESHOLD-C to coalitions is
not very significant because of two reasons:
The entries of the sum vector a do not revealinformation about
specific input vectors. Namely,knowing that ai p only indicates
that p out ofthe M bits bmi, 1 m M, equal 1, but it revealsno
information regarding which of the M bitsare those.
There are only three players that can collude in orderto learn
information beyond the intention of the pro-tocol. Such a situation
is far less severe than a situa-tion in which any player may
participate in acoalition, since if it is revealed that a collusion
tookplace, there is a small set of suspects.
2.3 An Improved Protocol for the SecureComputation of All
Locally Frequent Item Sets
As before, we denote by Fk1s the set of all globally frequentk
1-item sets, and by ApFk1s the set of k-item sets thatthe Apriori
algorithm generates when applied on Fk1s . Allplayers can compute
the set ApFk1s and decide on anordering of it. (Since all item sets
are subsets ofA fa1; . . . ; aLg, they may be viewed as binary
vectors inf0; 1gL and, as such, they may be ordered
lexicographi-cally.) Then, since the sets of locally frequent
k-item sets,Ck;ms , 1 m M, are subsets of ApFk1s , they may
beencoded as binary vectors of length nk : jApFk1s j. Thebinary
vector that encodes the union Cks :
SMm1 C
k;ms is the
OR of the vectors that encode the sets Ck;ms , 1 m M.Hence, the
players can compute the union by invoking Pro-tocol THRESHOLD-C on
their binary input vectors. Thisapproach is summarized in Protocol
4 (UNIFI).
In the running example in Section 1.1.3, F 2s f12; 14;23; 24;
34g and ApF 2s f124; 234g. The private sets oflocally frequent item
sets are C3;1s f124g, C3;2s f234g,and C3;3s f124g. Those private
sets will be encoded asb1 1; 0, b2 0; 1, and b3 1; 0. The OR of
these vec-tors is b 1; 1 and, therefore, C3s f124; 234g.
Comment. Replacing T1 with TM in Step 2 of Protocol 4will result
in computing the intersection of the private sub-sets rather than
their union.
976 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 4, APRIL 2014
-
2.4 Privacy
We begin by analyzing the privacy offered by ProtocolUNIFI-KC.
That protocol does not respect perfect privacysince it reveals to
the players information that is not impliedby their own input and
the final output. In Step 11 of Phase1 of the protocol, each player
augments the set Xm by fakeitem sets. To avoid unnecessary hash and
encryption com-putations, those fake item sets are random strings
in theciphertext domain of the chosen commutative cipher.
Theprobability of two players selecting random strings that
willbecome equal at the end of Phase 1 is negligible; so is
theprobability of Player Pm to select a random string thatequals
EKmhx for a true item set x 2 ApFk1s . Hence,every encrypted item
set that appears in two different listsindicates with high
probability a true item set that is locallys-frequent in both of
the corresponding sites. Therefore,Protocol UNIFI-KC reveals the
following excess information:
1. P1 may deduce for any subset of the odd players, thenumber of
item sets that are locally supported by allof them.
2. P2 may deduce for any subset of the even players,the number
of item sets that are locally supported byall of them.
3. P1 may deduce the number of item sets that are sup-ported by
at least one odd player and at least oneeven player.
4. If P1 and P2 collude, they reveal for any subset of
theplayers the number of item sets that are locally sup-ported by
all of them.
As for the privacy offered by Protocol UNIFI, we considertwo
cases: If there are no collusions, then, by Theorem 2.3,Protocol
UNIFI offers perfect privacy with respect to allplayers Pm, m 6 2,
and computational privacy with respectto P2. This is a privacy
guarantee better than that offered byProtocol UNIFI-KC, since the
latter protocol does revealinformation to P1 and P2 even if they do
not collude withany other player.
If there are collusions, both Protocols UNIFI-KC andUNIFI allow
the colluding parties to learn forbidden infor-mation. In both
cases, the number of suspects is smallinProtocol UNIFI-KC only P1
and P2 may benefit from a collu-sion while in Protocol UNIFI only
P1, P2 and PM can extractadditional information if two of them
collude (see Theo-rem 2.3). In Protocol UNIFI-KC, the excess
informationwhich may be extracted by P1 and P2 is about the
numberof common frequent item sets among any subset of the
play-ers. Namely, they may learn that, say, P2 and P3 have manyitem
sets that are frequent in both of their databases (butnot which
item sets), while P2 and P4 have very few itemsets that are
frequent in their corresponding databases. Theexcess information in
Protocol UNIFI is different: If any twoout of P1, P2 and PM
collude, they can learn the sum of allprivate vectors. That sum
reveals for each specific item setin ApFk1s the number of sites in
which it is frequent, butnot which sites. Hence, while the
colluding players in Proto-col UNIFI-KC can distinguish between the
different playersand learn about the similarity or dissimilarity
betweenthem, Protocol UNIFI leaves the partial databases
totallyindistinguishable, as the excess information that it leaks
iswith regard to the item sets only.
To summarize, given that Protocol UNIFI reveals noexcess
information when there are no collusions, and, inaddition, when
there are collusions, the excess informationstill leaves the
partial databases indistinguishable, it offersenhanced privacy
preservation in comparison to ProtocolUNIFI-KC.
2.5 Communication Cost
Here and in the next section we analyze the communicationand
computational costs of Protocols UNIFI-KC and UNIFI.In doing so, we
use the following notations and terms:
K Ks is the size of the longest s-frequent item setin D.
The kth iteration refers to the iteration in which Fks
iscomputed from Fk1s . Both protocols have K 1 iter-ations (where
in the last iteration, the players findthat FK1s ; and consequently
terminate theprotocol).
nk is the number of candidate item sets of size k inthe kth
iteration. n1 L and nk : jApFk1s j for all1 < k K 1.
n :PK1k1 nk. k is the number of k-item sets that were s-frequent
in
at least one of the sites.
:PK1k1 k. B is the size in bits of representing one frequent
item
set. (We use the same notation for frequent item setsof any
length for simplicity. In practice, the length offrequent item sets
is typically a small number.)
In evaluating the communication cost, we consider
threeparameters: Total number of communication rounds, totalnumber
of messages sent, and the overall size of the mes-sages sent. For
example, in Step 15 of Protocol UNIFI-KC,every player Pm sends a
message to Pm1. Those M mes-sages are sent simultaneously. Hence,
each time this step isexecuted, the counter of communication rounds
is increasedby 1, the number of messages sent is increased by M,
andthe total message size is increased by the sum of sizes ofthose
messages.
2.5.1 Communication Cost of Protocol UNIFI-KC
Let t denote the number of bits required to represent an
itemset. Clearly, t must be at least log2 nk for all 1 k L.
How-ever, as Protocol UNIFI-KC hashes the item sets and
thenencrypts them, t should be at least the recommendedciphertext
length in commutative ciphers. RSA [25], Pohlig-Hellman [24] and
ElGamal [10] ciphers are examples ofcommutative encryption schemes.
As the recommendedlength of the modulus in all of them is at least
1,024 bits, wetake t 1;024.
We begin by analyzing the communication costs of Pro-tocol
UNIFI-KC in each of the K 1 iterations separately.Each iteration,
consists of four phases.
During Phase 1 of Protocol UNIFI-KC, there are M 1rounds of
communication. In each such round, each of theM players sends to
the next player a message; the length ofthat message in the kth
iteration is tnk. Hence, the communi-cation cost of this phase in
the kth iteration is M 1Mmessages of total size of M 1Mtnk
bits.
TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY
DISTRIBUTED DATABASES 977
-
During Phase 2 of the protocol all odd players send
theirencrypted item sets to P1 and all even players send
theirencrypted item sets to P2. Then P2 unifies the item sets hegot
and sends them to P1. Hence, this phase takes two morerounds. The
communication cost, in the kth iteration, of thefirst of those two
rounds is M 2 messages of total size ofM 2tnk. The communication
cost of the second round isa single message whose size is m1tnk
where m1 2 1;M=2.(The size of the unified list will equal the lower
bound inthat range, i.e., tnk bits, if all lists of all even
players coin-cide; it will equal the upper bound of Mtnk=2 if all
thoselists are pairwise disjoint.)
In Phase 3, a similar round of decryptions is initiated.The
unified list of all encrypted true and fake item sets maycontain in
the kth iteration at least nk item sets but no morethan Mnk item
sets. Hence, that phase involves M 1rounds with communication cost
of M 1 messages with atotal size of m2M 1tnk, where m2 2 1;M.
Finally, in Step 34, PM broadcasts Cks to all other players.
That step adds one more communication round with M 1messages of
total size of M 1kB, where k jCks j.
Adding up the above costs over all iterations,1 k K 1, we find
that Protocol UNIFI-KC entails:
2M 1K 1 communication rounds. M2 2M 3K 1 messages. gMtn M 1B
bits of communication, where
M2 M 2 gM 2M2 M2 2 : (5)
2.5.2 Communication Cost of Protocol UNIFI
Protocol UNIFI consists of four communication rounds (ineach of
the iterations): One for Step 2 of Protocol THRESHOLDthat it
invokes; one for Step 4 of that protocol; one forSteps 4-5 in
Protocol SETINC which is used for the inequalityverifications in
Protocol THRESHOLD; and one for Step 7 inProtocol SETINC.
In the kth iteration, the length of the vectors in
ProtocolTHRESHOLD is nk; each entry in those vectors represents
anumber between 0 and M 1, whence it may be encodedby log2M bits.
Therefore:
The communication cost of Step 2 in Protocol THRESH-OLD is M 1M
messages of total size of M1Mlog2Mnk bits. (Since each of the M
playerssends a vector of size log2Mnk bits to each of theother M 1
players.)
The communication cost of Step 4 in Protocol THRESH-OLD is M 2
messages of total size of M 2log2Mnk bits.
The communication cost of Step 4 in Protocol SETINCis a single
message of size jhjnk, where jhj is the sizein bits of the hash
functions output. The communi-cation cost of Step 5 in that
protocol is also a singlemessage of size jhjnk; indeed, when
Protocol SETINCis called from Protocol THRESHOLD-C, the size of
thesets Qi and Q0i is t 1 (see Eq. (4)), because theOR function
corresponds to t 1.
The communication cost of Step 7 in Protocol SETINCis M 1
messages of total size of M 1kB, sinceP2 can send to all his peers
the actual k k-item setsthat were s-frequent in at least one of the
sites.
Adding up the above costs over all iterations, 1 k K 1, we find
that Protocol UNIFI entails:
4K 1 communication rounds. M2 M 1K 1 messages. M2 2log2 Mn 2njhj
M 1B bits of
communication.
2.5.3 Comparison
Comparing the costs of the two protocols as derived in Sec-tions
2.5.1 and 2.5.2 we find that Protocol UNIFI reduces thenumber of
rounds by a factor of 2M 1=4 with respect toProtocol UNIFI-KC. The
number of messages in the two pro-tocols is roughly the same. As
for the bit communicationcost, Protocol UNIFI offers a significant
improvement. Theimprovement factor in the bit communication cost,
as offeredby Protocol UNIFI with respect to Protocol UNIFI-KC,
is
gMtn M 1BM2 2log2Mn 2njhj M 1B
; (6)
where the range of possible values of gM is given in Eq.(5). The
communication cost of the fourth phase, which isM 1B in both
protocols, may be neglected, as vali-dated in our experimental
evaluation. The reason for thisis that it depends on (the overall
number of item setsthat were s-frequent in at least one site),
which is muchsmaller than n (the overall number of
Apriori-generatedcandidate item sets), and, in addition, it depends
onM 1 rather than QM2 as the other costs. Therefore, theratio in
Eq. (6) may be approximated by
gMtM2 2log2M 2jhj
:
As discussed earlier, a plausible setting of t would bet 1;024.
A typical value of jhj is 160. Hence, For M 4 weget an improvement
factor that ranges between 53 and 82,while for M 8 we get an
improvement factor that rangesbetween 142 and 247.
2.6 Computational Cost
In Protocol UNIFI-KC each of the players needs to performhash
evaluations as well as encryptions and decryptions.As the cost of
hash evaluations is significantly smallerthan the cost of
commutative encryption, we focus on thecost of the latter
operations. In Steps 8-10 of the protocol,player Pm performs jCk;ms
j nk jApFk1s j encryptions(in the kth iteration). Then, in Steps
13-19, each player per-forms M 1 encryptions of sets that include
nk items.Hence, in Phase 1 in the kth iteration, each player
performsbetween M 1nk and Mnk encryptions. In Phase 3, eachplayer
decrypts the set of items ECks . EC
ks is the union of
the encrypted sets from all M players, where each of thosesets
has nk itemstrue and fake ones. Clearly, the size ofECks is at
least nk. On the other hand, since most of theitems in the M sets
are expected to be fake ones, and theprobability of collusions
between fake items is negligible,it is expected that the size of
ECks would be close to Mnk.So, in all its iterations (1 k K 1),
Protocol UNIFI-KCrequires each player to perform an overall number
of closeto 2Mn (but no less than Mn) encryptions or
decryptions,
978 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 4, APRIL 2014
-
where, as before n PK1k1 nk. Since commutative encryp-tion is
typically based on modular exponentiation, theoverall computational
cost of the protocol is QMt3n bitoperations per player.
In Protocol THRESHOLD, which Protocol UNIFI calls,each player
needs to generate M 1n (pseudo)randomlog2 M-bit numbers (Step 1).
Then, each player per-forms M 1n additions of such numbers in Step
1 aswell as in Step 3. Player P1 has to perform also M 2nadditions
in Step 5. Therefore, the computational cost foreach player is QMn
log2 M bit operations. In addition,Players 1 and M need to perform
n hash evaluations.Compared to a computational cost of QMt3n bit
opera-tions per player, we see that Protocol UNIFI offers a
sig-nificant improvement with respect to Protocol UNIFI-KCalso in
terms of computational cost.
3 IDENTIFYING THE GLOBALLY s-FREQUENT ITEMSETS
Protocols UNIFI-KC and UNIFI yield the set Cks that consistsof
all item sets that are locally s-frequent in at least onesite.
Those are the k-item sets that have potential to be alsoglobally
s-frequent. In order to reveal which of those itemsets is globally
s-frequent there is a need to securely com-pute the support of each
of those item sets. That computa-tion must not reveal the local
support in any of the sites.Let x be one of the candidate item sets
in Cks . Then x isglobally s-frequent if and only if
Dx : suppx sN XM
m1suppmx sNm 0 : (7)
We describe here the solution that was proposed by Kantar-cioglu
and Clifton. They considered two possible settings. Ifthe required
output includes all globally s-frequent itemsets, as well as the
sizes of their supports, then the values ofDx can be revealed for
all x 2 Cks . In such a case, those val-ues may be computed using a
secure summation protocol(e.g., [6]), where the private addend of
Pm issuppmx sNm. The more interesting setting, however, isthe one
where the support sizes are not part of the requiredoutput. We
proceed to discuss it.
As jDxj N , an item set x 2 Cks is s-frequent if and onlyif Dx
mod q N , for q 2N 1. The idea is to verify thatinequality by
starting an implementation of the secure sum-mation protocol of [6]
on the private inputs Dmx :suppmx sNm, modulo q. In that protocol,
all playersjointly compute random additive shares of the
requiredsum Dx and then, by sending all shares to, say, P1, he
mayadd them and reveal the sum. If, however, PM withholdshis share
of the sum, then P1 will have one random share,s1x, of Dx, and PM
will have a corresponding share,sMx; namely, s1x sMx Dx mod q. It
is then pro-posed that the two players execute the generic secure
circuitevaluation of [32] in order to verify whether
s1x sMx mod q N : (8)Those circuit evaluations may be
parallelized for all x 2 Cks .
We observe that inequality (8) holds if and only if
s1x 2 Qx : fj sMx mod q : 0 j Ng : (9)
As s1x is known only to P1 while Qx is known only toPM , the
verification of the set inclusion in (9) can also be car-ried out
by means of Protocol SETINC. However, the groundset V in this case
is ZZq2N1, which is typically a large set.(Recall that when
Protocol SETINC is invoked from UNIFI, theground set V is ZZM1,
which is usually a small set.) Hence,Protocol SETINC is not useful
in this case, and, consequently,Yaos generic protocol remains, for
the moment, the proto-col of choice to securely verify inequality
(8). Yaos protocolis designed for the two-party case. In our
setting, as M > 2,there exist additional semi-honest players. An
interestingquestion which arises in this context is whether the
exis-tence of such additional semi-honest players may be used
toverify inequalities like (8), even when the modulus is
large,without resorting to costly protocols such as
oblivioustransfer.
4 IDENTIFYING ALL s; cs; c-ASSOCIATION RULESOnce the set Fs of
all s-frequent item sets is found, we mayproceed to look for all s;
c-association rules (rules withsupport at least sN and confidence
at least c), as describedin [18]. For X;Y 2 Fs, where X \ Y ;, the
correspondingassociation rule X ) Y has confidence at least c if
and onlyif suppX [ Y =suppX c, or, equivalently,
CX;Y :XM
m1suppmX [ Y c suppmX 0 : (10)
Since jCX;Y j N , then by taking q 2N 1, the players canverify
inequality (10), in parallel, for all candidate associa-tion rules,
as described in Section 3.
In order to derive from Fs all s; c-association rules in
anefficient manner we rely upon the following straightfor-ward
lemma.
Lemma 4.1. If X ) Y is an s; c-rule and Y 0 Y , thenX ) Y 0 is
also an s; c-rule.
Proof. The rule X ) Y 0 has the required support countsince
suppX [ Y 0 suppX [ Y sN . It is also c-confident since suppX[Y
0suppX suppX[Y suppX c. Hence, it is an
s; c-rule too. tuWe first find all s; c-rules with
1-consequents;
namely, all s; c-rules X ) Y with a consequent (righthand side)
Y of size 1. To that end, we scan all item setsZ 2 Fs of size jZj
2, and for each such item set wescan all jZj partitions Z X [ Y
where jY j 1 andX Z n Y . The association rule X ) Y that
correspondsto such a given partition Z X [ Y is tested to
seewhether it satisfies inequality (10). We may test all
thosecandidate rules in parallel and at the end we get the fulllist
of all s; c-rules with 1-consequents.
We then proceed by induction; assume that we found alls; c-rules
with j-consequents for all 1 j 1. To findall s; c-rules with
-consequents, we rely upon Lemma 4.1.Namely, if Z 2 Fs and Z X [ Y
where X \ Y ; andjY j , then X ) Y is an s; c-rule only if X ) Y 0
werefound to be s; c-rules for all Y 0 Y . Hence, we may createall
candidate rules with -consequents and test them againstinequality
(10) in parallel.
TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY
DISTRIBUTED DATABASES 979
-
It should be noted that in practice, one usually aims atfinding
association rules of the form X ) Y where jY j 1,or at least jY j
for some small constant . However, theabove procedure may be
continued until all candidate asso-ciation rules, with no upper
bounds on the consequent size,are found.
5 A FULLY SECURE PROTOCOL
As noted in [18, Section 6], the players may dispense thelocal
pruning and union computation in the FDM algo-rithm (Steps 2-4)
and, instead, test all candidate item setsin ApFk1s to see which of
them are globally s-frequent.Such a protocol is fully secure, as it
reveals only the set ofglobally s-frequent item sets but no further
informationabout the partial databases. However, as discussed in
[18],such a protocol would be much more costly since itrequires
each player to compute the local support ofjApFk1s j item sets (in
the kth round) instead of only jCks jitem sets (where Cks
SMm1 C
k;ms ). In addition, the players
will have to execute the secure comparison protocol of [32]to
verify inequality (8) for jApFk1s j rather than only jCks jitem
sets. Both types of added operations are very costly:the time to
compute the support size depends linearly onthe size of the
database, while the secure comparison pro-tocol entails a costly
oblivious transfer sub-protocol. Since,as shown in [9], jApFk1s j
is much larger than jCks j, theadded computing time in such a
protocol is expected todominate the cost of the secure computation
of the unionof all locally s-frequent item sets. Hence, the
enhancedsecurity offered by such a protocol is accompanied
byincreased implementation costs.
6 EXPERIMENTAL EVALUATION
In Section 6.1 we describe the synthetic database that weused
for our experimentation. In Section 6.2 we explain howthe database
was split horizontally into partial databases. InSection 6.3 we
describe the experiments that we conducted.The results are given in
Section 6.4.
6.1 Synthetic Database Generation
The databases that we used in our experimental evaluationare
synthetic databases that were generated using the sametechniques
that were introduced in [1] and then used also insubsequent studies
such as [8], [18], [23]. Table 1 gives theparameter values that
were used in generating the syntheticdatabase. The reader is
referred to [8], [18], [23] for adescription of the synthetic
generation method and themeaning of each of those parameters. The
parameter valuesthat we used here are similar to those used in [8],
[18], [23].
6.2 Distributing the Database
Given a generated synthetic database D of N transactionsand a
number of players M, we create an artificial split of Dinto M
partial databases, Dm, 1 m M, in the followingmanner: For each 1 m
M we draw a random numberwm from a normal distribution with mean 1
and variance0.1, where numbers outside the interval 0:1; 1:9
areignored. Then, we normalize those numbers so thatPM
m1 wm 1. Finally, we randomly split D into m partialdatabases of
expected sizes of wmN , 1 m M, as follows:Each transaction t 2 D is
assigned at random to one of thepartial databases, so that Prt 2 Dm
wm, 1 m M.6.3 Experimental Setup
We compared the performance of two secure implementa-tions of
the FDM algorithm (Section 1.1.2). In the first imple-mentation
(denoted FDM-KC), we executed the unificationstep (Step 4 in FDM)
using Protocol UNIFI-KC, where thecommutative cipher was 1,024-bit
RSA [25]; in the secondimplementation (denoted FDM) we used our
ProtocolUNIFI, where the keyed-hash function was HMAC [4]. Inboth
implementations, we implemented Step 5 of the FDMalgorithm in the
secure manner that was described in Sec-tion 3. We tested the two
implementations with respect tothree measures:
1. Total computation time of the complete protocols(FDM-KC and
FDM) over all players. That measureincludes the Apriori computation
time, and the timeto identify the globally s-frequent item sets,
asdescribed in Section 3. (The latter two proceduresare implemented
in the same way in both ProtocolsFDM-KC and FDM.)
2. Total computation time of the unification protocolsonly
(UNIFI-KC and UNIFI) over all players.
3. Total message size.We ran three experiment sets, where each
set tested the
dependence of the above measures on a different parameter:
Nthe number of transactions in the unifieddatabase,
Mthe number of players, and sthe threshold support size.In our
basic configuration, we took N 500;000, M 10,
and s 0:1. In the first experiment set, we kept M and sfixed and
tested several values of N . In the second experi-ment set, we kept
N and s fixed and varied M. In the thirdset, we kept N and M fixed
and varied s. The results in eachof those experiment sets are shown
in Section 6.4.
All experiments were implemented in C# (.net 4) andwere executed
on an Intel(R) Core(TM)i7-2620M personalcomputer with a 2.7 GHz
CPU, 8 GB of RAM, and the 64-bitoperating system Windows 7
Professional SP1.
6.4 Experimental Results
Fig. 1 shows the values of the three measures that werelisted in
Section 6.3 as a function of N . In all of those experi-ments, the
value of M and s remained unchangedM 10and s 0:1. Fig. 2 shows the
values of the three measures asa function of M; here, N 500;000 and
s 0:1. Fig. 3 showsthe values of the three measures as a function
of s; here,N 500;000 and M 10.
TABLE 1Parameters for Generating the Synthetic Database
980 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 4, APRIL 2014
-
From the first set of experiments, we can see that N haslittle
effect on the runtime of the unification protocols,UNIFI-KC and
UNIFI, nor on the bit communication cost.However, since the time to
identify the globally s-frequentitem sets (see Section 3) does grow
linearly with N , and thatprocedure is carried out in the same
manner in FDM-KCand FDM, the advantage of Protocol FDM over FDM-KC
interms of runtime decreases with N . While for N 100;000,Protocol
FDM is 22 times faster than Protocol FDM-KC, forN 500;000 it is
five times faster. (The total computationtimes for larger values of
N retain the same pattern thatemerges from Fig. 1; for example,
with N 106 the totalcomputation times for FDM-KC and FDM were 744.1
and238.5 seconds, respectively, which gives an improvementfactor of
3.1.)
The second set of experiments shows how the computa-tion and
communication costs increase with M. In particu-lar, the
improvement factor in the bit communication cost,as offered by
Protocol UNIFI with respect to Protocol UNIFI-KC, is in accord with
our analysis in Section 2.5.3. Finally,the third set of experiments
shows that higher supportthresholds entail smaller computation and
communicationcosts since the number of frequent item sets
decreases.
7 RELATED WORK
Previous work in privacy preserving data mining has con-sidered
two related settings. One, in which the data ownerand the data
miner are two different entities, and another,in which the data is
distributed among several parties who
Fig. 1. Computation and communication costs versus the number
oftransactions N.
Fig. 2. Computation and communication costs versus the number
ofplayers M.
TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY
DISTRIBUTED DATABASES 981
-
aim to jointly perform data mining on the unified corpus ofdata
that they hold.
In the first setting, the goal is to protect the datarecords
from the data miner. Hence, the data owneraims at anonymizing the
data prior to its release. Themain approach in this context is to
apply data perturba-tion [2], [11]. The idea is that the perturbed
data can beused to infer general trends in the data, without
reveal-ing original record information.
In the second setting, the goal is to perform data miningwhile
protecting the data records of each of the data ownersfrom the
other data owners. This is a problem of securemulti-party
computation. The usual approach here is cryp-tographic rather than
probabilistic. Lindell and Pinkas [22]showed how to securely build
an ID3 decision tree whenthe training set is distributed
horizontally. Lin et al. [21]
discussed secure clustering using the EM algorithm
overhorizontally distributed data. The problem of
distributedassociation rule mining was studied in [19], [31], [33]
in thevertical setting, where each party holds a different set
ofattributes, and in [18] in the horizontal setting. Also thework
of [26] considered this problem in the horizontal set-ting, but
they considered large-scale systems in which, ontop of the parties
that hold the data records (resources) thereare also managers which
are computers that assist theresources to decrypt messages; another
assumption made in[26] that distinguishes it from [18] and the
present study isthat no collusions occur between the different
networknodesresources or managers.
The problem of secure multiparty computation of theunion of
private sets was studied in [7], [14], [20], as wellas in [18].
Freedman et al. [14] present a privacy-preserv-ing protocol for set
intersections. It may be used to com-pute also set unions through
set complements, sinceA [B A \B. Kissner and Song [20] present a
methodfor representing sets as polynomials, and give several
pri-vacy-preserving protocols for set operations using
theserepresentations. They consider the threshold set unionproblem,
which is closely related to the threshold func-tion (Definition
2.1). The communication overhead of thesolutions in those two
works, as well as in [18]s and inour solutions, depends linearly on
the size of the groundset. However, as the protocols in [14], [20]
use homomor-phic encryption, while that of [18] uses
commutativeencryption, their computational costs are
significantlyhigher than ours. The work of Brickell and Shmatikov
[7]is an exception, as their solution entails a
communicationoverhead that is logarithmic in the size of the ground
set.However, they considered only the case of two players,and the
logarithmic communication overhead occurs onlywhen the size of the
intersection of the two sets isbounded by a constant.
The problem of set inclusion can be seen as a simplifiedversion
of the privacy-preserving keyword search. In that prob-lem, the
server holds a set of pairs fxi; pigni1, where xi aredistinct
keywords, and the client holds a single value w. Ifw is one of the
servers keywords, i.e., w xi for some1 i n, the client should get
the corresponding pi. In casew differs from all xi, the client
should get notified of that.The privacy requirements are that the
server gets no infor-mation about w and that the client gets no
information aboutother pairs in the servers database. This problem
wassolved by Freedman et al. [13]. If we take all pi to be theempty
string, then the only information the client gets iswhether or not
w is in the set fx1; . . . ; xng. Hence, in thatcase the
privacy-preserving keyword search problemreduces to the set
inclusion problem. Another solution forthe set inclusion problem
was recently proposed in [30],using a protocol for oblivious
polynomial evaluation.
8 CONCLUSION
We proposed a protocol for secure mining of associationrules in
horizontally distributed databases that improvessignificantly upon
the current leading protocol [18] in termsof privacy and
efficiency. One of the main ingredients inour proposed protocol is
a novel secure multi-party
Fig. 3. Computation and communication costs versus the
supportthreshold s.
982 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 4, APRIL 2014
-
protocol for computing the union (or intersection) of
privatesubsets that each of the interacting players hold.
Anotheringredient is a protocol that tests the inclusion of an
elementheld by one player in a subset held by another. Those
proto-cols exploit the fact that the underlying problem is of
inter-est only when the number of players is greater than two.
One research problem that this study suggests wasdescribed in
Section 3; namely, to devise an efficient proto-col for inequality
verifications that uses the existence of asemi-honest third party.
Such a protocol might enable tofurther improve upon the
communication and computa-tional costs of the second and third
stages of the protocol of[18], as described in Sections 3 and 4.
Other research prob-lems that this study suggests is the
implementation of thetechniques presented here to the problem of
distributedassociation rule mining in the vertical setting [31],
[33], theproblem of mining generalized association rules [27],
andthe problem of subgroup discovery in horizontally parti-tioned
data [16].
ACKNOWLEDGMENTS
The author thanks Keren Mendiuk for her help in conduct-ing the
experiments.
REFERENCES[1] R. Agrawal and R. Srikant, Fast Algorithms for
Mining Associa-
tion Rules in Large Databases, Proc 20th Intl Conf. Very
LargeData Bases (VLDB), pp. 487-499, 1994.
[2] R. Agrawal and R. Srikant, Privacy-Preserving Data
Mining,Proc. ACM SIGMOD Conf., pp. 439-450, 2000.
[3] D. Beaver, S. Micali, and P. Rogaway, The Round Complexity
ofSecure Protocols, Proc. 22nd Ann. ACM Symp. Theory of
Computing(STOC), pp. 503-513, 1990.
[4] M. Bellare, R. Canetti, and H. Krawczyk, Keying Hash
Functionsfor Message Authentication, Proc. 16th Ann. Intl
Cryptology Conf.Advances in Cryptology (Crypto), pp. 1-15,
1996.
[5] A. Ben-David, N. Nisan, and B. Pinkas, FairplayMP - A
Systemfor Secure Multi-Party Computation, Proc. 15th ACM Conf.
Com-puter and Comm. Security (CCS), pp. 257-266, 2008.
[6] J.C. Benaloh, Secret Sharing Homomorphisms: Keeping Shares
ofa Secret Secret, Proc. Advances in Cryptology (Crypto), pp.
251-260,1986.
[7] J. Brickell and V. Shmatikov, Privacy-Preserving Graph
Algo-rithms in the Semi-Honest Model, Proc. 11th Intl Conf. Theory
andApplication of Cryptology and Information Security
(ASIACRYPT),pp. 236-252, 2005.
[8] D.W.L. Cheung, J. Han, V.T.Y. Ng, A.W.C. Fu, and Y. Fu, A
FastDistributed Algorithm for Mining Association Rules, Proc.
FourthIntl Conf. Parallel and Distributed Information Systems
(PDIS),pp. 31-42, 1996.
[9] D.W.L Cheung, V.T.Y. Ng, A.W.C. Fu, and Y. Fu, Efficient
Min-ing of Association Rules in Distributed Databases, IEEE
Trans.Knowledge and Data Eng., vol. 8, no. 6, Dec. 1996.
[10] T. ElGamal, A Public Key Cryptosystem and a Signature
SchemeBased on Discrete Logarithms, IEEE Trans. Information
Theory,vol. IT-31, no. 4, July 1985.
[11] A.V. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke,
PrivacyPreserving Mining of Association Rules, Proc. Eighth
ACMSIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD),pp.
217-228, 2002.
[12] R. Fagin, M. Naor, and P. Winkler, Comparing Information
with-out Leaking It, Comm. ACM, vol. 39, pp. 77-85, 1996.
[13] M. Freedman, Y. Ishai, B. Pinkas, and O. Reingold,
KeywordSearch and Oblivious Pseudorandom Functions, Proc. Second
IntlConf. Theory of Cryptography (TCC), pp. 303-324, 2005.
[14] M.J. Freedman, K. Nissim, and B. Pinkas, Efficient Private
Match-ing and Set Intersection, Proc. Intl Conf. Theory and
Applications ofCryptographic Techniques (EUROCRYPT), pp. 1-19,
2004.
[15] O. Goldreich, S. Micali, and A. Wigderson, How to Play
AnyMental Game or a Completeness Theorem for Protocols with Hon-est
Majority, Proc. 19th Ann. ACM Symp. Theory of Computing(STOC), pp.
218-229, 1987.
[16] H. Grosskreutz, B. Lemmen, and S. Ruping, Secure
DistributedSubgroup Discovery in Horizontally Partitioned Data,
Trans.Data Privacy, vol. 4, no. 3, pp. 147-165, 2011.
[17] W. Jiang and C. Clifton, A Secure Distributed Framework
forAchieving k-Anonymity, The VLDB J., vol. 15, pp. 316-333,
2006.
[18] M. Kantarcioglu and C. Clifton, Privacy-Preserving
DistributedMining of Association Rules on Horizontally Partitioned
Data,IEEE Trans. Knowledge and Data Eng., vol. 16, no. 9, pp.
1026-1037,Sept. 2004.
[19] M. Kantarcioglu, R. Nix, and J. Vaidya, An Efficient
Approxi-mate Protocol for Privacy-Preserving Association Rule
Mining,Proc. 13th Pacific-Asia Conf. Advances in Knowledge
Discovery andData Mining (PAKDD), pp. 515-524, 2009.
[20] L. Kissner and D.X. Song, Privacy-Preserving Set
Operations,Proc. 25th Ann. Intl Cryptology Conf. (CRYPTO), pp.
241-257, 2005.
[21] X. Lin, C. Clifton, and M.Y. Zhu, Privacy-Preserving
Clusteringwith Distributed EM Mixture Modeling, Knowledge and
Informa-tion Systems, vol. 8, pp. 68-81, 2005.
[22] Y. Lindell and B. Pinkas, Privacy Preserving Data Mining,
Proc.Crypto, pp. 36-54, 2000.
[23] J.S. Park, M.S. Chen, and P.S. Yu, An Effective Hash Based
Algo-rithm for Mining Association Rules, Proc. ACM SIGMOD Conf.,pp.
175-186, 1995.
[24] S.C. Pohlig and M.E. Hellman, An Improved Algorithm for
Com-puting Logarithms over gfp and Its Cryptographic Signifi-cance,
IEEE Trans. Information Theory, vol. IT-24, no. 1, pp. 106-110,
Jan. 1978.
[25] R.L. Rivest, A. Shamir, and L.M. Adleman, A Method for
Obtain-ing Digital Signatures and Public-Key Cryptosystems,
Comm.ACM, vol. 21, no. 2, pp. 120-126, 1978.
[26] A. Schuster, R. Wolff, and B. Gilburd, Privacy-Preserving
Associ-ation Rule Mining in Large-Scale Distributed Systems, Proc.
IEEEIntl Symp. Cluster Computing and the Grid (CCGRID), pp.
411-418,2004.
[27] R. Srikant and R. Agrawal, Mining Generalized
AssociationRules, Proc. Intl Conf. Very Large Data Bases (VLDB),
pp. 407-419,1995.
[28] T. Tassa and D. Cohen, Anonymization of Centralized and
Dis-tributed Social Networks by Sequential Clustering, IEEE
Trans.Knowledge and Data Eng., vol. 25, no. 2, pp. 311-324, Feb.
2013.
[29] T. Tassa and E. Gudes, Secure Distributed Computation of
Ano-nymized Views of Shared Databases, Trans. Database Systems,vol.
37, article 11, 2012.
[30] T. Tassa, A. Jarrous, and J. Ben-Yaakov, Oblivious
Evaluation ofMultivariate Polynomials, J. Mathematical Cryptology,
vol. 7, pp.1-29, 2013.
[31] J. Vaidya and C. Clifton, Privacy Preserving Association
RuleMining in Vertically Partitioned Data, Proc. Eighth ACM
SIGKDDIntl Conf. Knowledge Discovery and Data Mining (KDD), pp.
639-644, 2002.
[32] A.C. Yao, Protocols for Secure Computation, Proc. 23rd
Ann.Symp. Foundations of Computer Science (FOCS), pp. 160-164,
1982.
[33] J. Zhan, S. Matwin, and L. Chang, Privacy Preserving
Collabora-tive Association Rule Mining, Proc. 19th Ann. IFIP WG
11.3 Work-ing Conf. Data and Applications Security, pp. 153-165,
2005.
[34] S. Zhong, Z. Yang, and R.N. Wright, Privacy-Enhancing
k-Ano-nymization of Customer Data, Proc. ACM SIGMOD-SIGACT-SIGART
Symp. Principles of Database Systems (PODS), pp. 139-147,2005.
Tamir Tassa received the PhD degree in appliedmathematics from
the Tel Aviv University in1993. He is an associate professor at the
Depart-ment of Mathematics and Computer Science atThe Open
University of Israel. Previously, heserved as a lecturer and a
researcher in theSchool of Mathematical Sciences at Tel Aviv
Uni-versity, and in the Department of Computer Sci-ence at Ben
Gurion University. During the years1993-1996, he served as an
assistant professorof computational and applied mathematics at
the
University of California, Los Angeles. His research interests
includecryptography, privacy-preserving data publishing and data
mining.
TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY
DISTRIBUTED DATABASES 983
/ColorImageDict > /JPEG2000ColorACSImageDict >
/JPEG2000ColorImageDict > /AntiAliasGrayImages false
/CropGrayImages true /GrayImageMinResolution 150
/GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true
/GrayImageDownsampleType /Bicubic /GrayImageResolution 300
/GrayImageDepth -1 /GrayImageMinDownsampleDepth 2
/GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true
/GrayImageFilter /DCTEncode /AutoFilterGrayImages false
/GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict >
/GrayImageDict > /JPEG2000GrayACSImageDict >
/JPEG2000GrayImageDict > /AntiAliasMonoImages false
/CropMonoImages true /MonoImageMinResolution 1200
/MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true
/MonoImageDownsampleType /Bicubic /MonoImageResolution 600
/MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000
/EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode
/MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None
] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false
/PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000
0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true
/PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ]
/PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier ()
/PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped
/False
/Description >>> setdistillerparams>
setpagedevice