Security in Outsourced Association Rule Mining
Jan 18, 2016
Security in Outsourced Association Rule Mining
Agenda
Introduction Approximate randomized technique Encryption Summary and future work
Introduction
Data mining in company know about the past activities of their
customers make strategic decisions
Types of data mining Association rules mining Clustering Classification
Association rules
“X => Y” If a transaction contains itemset X, the
transaction will probably contain itemset Y
Support: number of supporting transactions
Confidence: proportion of transactions containing X which also contains Y
Performing data mining
Build application Development cost? Time?
Buy software Fit requirements? Maintenance?
Outsource
Concerns in outsourcing
Output Execution Assurance Correctness
Security Privacy of records Information of the company
Company
DB
Data Miner
Approximate randomized technique
Approximate solution
Privacy Preserving Mining of Association Rules SIGKDD 2002 Authors: Alexandre Evfimievski,
Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke
Problem formulation
Let the set of transactions be T = {t1, t2, … tN}
Transform T to T’ = {t’1, t’2, … t’N} Mine in T’ Privacy breaches
Itemset A cause a privacy breach of level p if for some item a in A
P[a in ti|A in t’i] >= p
Select-a-size randomization
For each transaction ti in T m = length of ti
Select (non-uniformly) randomly an integer j from [0, m]
Copy uniformly at random j items in ti to t’i
Consider every item a not in ti, add a to t’i with a given probability pm
Run on real data
Privacy breach of level <= 50% P[a in ti|A in t’i] <= 50%
Accuracy = # true positive / (# found itemsets)
Set 1
Itemset Size
True Itemset
TruePositive
False Drops
False Positive
Accuracy
1 65 65 0 0 100%
2 228 212 16 28 88%
3 22 18 4 5 78%
Accuracy
Set 2:
Itemset Size
True Itemset
TruePositive
False Drops
False Positive
Accuracy
1 266 254 12 31 89%
2 217 195 22 45 81%
3 48 43 5 26 62%
Problems
Estimated counts of large itemsets varies Lower accuracy of association rules
"beer and diaper" story customers who buy diapers tend also to buy
beer hard to believe some strange rules
Expensive to make wrong decision Supermarket: layout design Health center: identify new disease
Security concerns
Individual transaction is protected Private association rules can be
estimated by other parties Adversary actions may be based on
found association rules
Encryption
Problem formulation
Let the set of transactions be T = {t1, t2, … tN}
I is the entire set of items All ti is a subset of I
Transform T to T’ = {t’1, t’2, … t’N} A third party mines in T’ and gets
AR’ Transform AR’ to AR
Architecture
DB DBTransformer
AssociationRules
AssociationRules
Mappings
Encryption
To protect a message, simple encryption can be applied “GOOD DOG” can be encrypted as “PLLX XLP”
Association rule encryption 752 => 891? Milk => Bread
Transaction encryption <8, 69, 153, 756>? <Cheese, Fork, Ice-cream, Clock>
Simple scheme
Encryption For every transaction ti
For every item x in ti
Add f(x) to t’i where f is a bi-jective function
Decryption For every association rule ri
For every item y in r Replace y by f-1(y)
Problems with simple encryption
They are easy to crack “PLLX XLP”
26P3 combinations, with at least one vowel
Association rules # Bread > # Car
# association rules, # large itemsets are disclosed
Solution Use a more complex scheme
Fake items
Probability to make a correct guess of a single mapping = 1 / |I|
Randomly add some fake items to each transaction Decrease the above probability to 1 / (|
I| + |F|)
One-to-n Mapping
Originally, we are “one-to-one” mapping One item One item A 1 B 2 C 3
We form “one-to-n” mapping A 1, 4, 5 B 2 C 3, 5 Greatly increase the number of possible
mapping of an item
|I|+|F|C1 + |I|+|F|C2 + … |I|+|F|C|F|
Example transformation
T = {A} {B} {C} {A, B} {A, C} {B, C} {A, B, C}
T’ = {1, 4, 5} {2} {3, 5} {1, 2, 4, 5} {1, 3, 5} {2, 3, 5} {1, 2, 3, 4, 5}
A 1, 4, 5B 2C 3, 5
Limitation on the mapping f
For any item x, there does not exist items y1, y2, …, yk (x ≠ y1 ≠ … ≠ yk ) Such that f(x) subset in f(y1) U f(y2) U…f(yk)
Consider an example A 1, 2 B 2, 3 C 3, 4 AC 1, 2, 3, 4 ABC 1, 2, 3, 4
Limitation on the mapping f
For any item x f(x) – Ui != x, i in I f(i) != empty
Every item must map to something unique
Mapping generation – Item Extend
Initialize every item to map to something unique I’
For every item x in IE Randomly pick some mappings Extend each mapping by x
Example run
A 1 B 2 C 3 IE = {4, 5}
Considering item 4
A 1 B 2 C 3
A 1, 4 B 2 C 3
Pick A
Considering item 5
A 1 B 2 C 3
A 1, 4, 5 B 2 C 3, 5
Pick A, C
Item Extend
Every item must map to something unique Say 1 is unique to f(A)
suppT(A) = suppT’(1) For a transaction t without item A
Add a subset of unique mapping set to t’ with some probability
{1, 4} is unique mapping set in f(A) {}, {1}, {4}, {1, 4} may be added
A 1, 4, 5B 2C 3, 5
Fake items again
Now, every item in t’i must be in some mappings
Randomly add some fake items in |F| to each transaction
Mapping f: I -> |I’| U |IE| U |F| |I’|: core “unique” items |IE|: expanding items |F|: fake items
Basic transformation framework
For each transaction t For each item x in t
Add f(x) to t’ For item i in I - t
Add randomly subset of unique mapping set of f(i) to t’
For item f in F Toss a biased coin for each item, add f to
t’ if head (probability should be difference)
Recovering association rules
Given an encrypted rule in AR’ r’: X => Y
If there exists i1, i2, …, im in I Uk=1
m f(ik) = X And there exists j1, j2, …, jn in I
Uk=1n f(jk) = XUY
r: {i1, i2, … im} => {j1, j2, …, jn} – {i1, i2, … im} is a rule in AR
Otherwise, the rule is not correct
Example
Given 1 => 4 (rejected) 2 => 1, 5 (rejected) 2 => 1, 3, 5 (rejected) 2 => 1, 3, 4, 5 (B => AC) 2, 3, 5 => 1, 4 (BC => A)
2, 3, 5 => BC 1, 2, 3, 4, 5 => ABC
Mapping fA 1, 4, 5B 2C 3, 5
Correctness
Proposition For any item x, y, f is transformation
mapping suppT(x) = suppT’(f(x)) suppT(xUy) = suppT’(f(x) U f(y))
For any itemset X, Y, F is the transformation mapping
suppT(X) = suppT’(F(X)) suppT(XUY) = suppT’(F(X) U F(Y))
No false drops and false positives
Summary
Generation of mappings One-to-n mappings Item Extend
Transformation of transactions Mapping f(x) Subsets of unique mapping set Fake items
Recovering association rules Reverse mappings and filtering
Test run
# Items = 1k, |T| = 1k Without transformation
One rule Time: 8s
Item Extend 147 rules Total times: 26s Mappings generation and
transformation: 219ms
Future Work
Define parameters to the problem Size of |IE| Size of |F|
Give a clear measure of security Give a clear measure of overhead Correctness of association rules
Query execution proof Result verification
The End
Choosing probability
Uniform distribution or any fixed distribution give patterns which may be easily identified
Random probability distribution {}: 70%, {1}: 5%, {4}: 15%, {1, 4}:
20% Storage: need additional storage
Back
Algorithm for transformation
Transformation is the most costly process
Execution time linear to database size |T|
Should be as fast as possible
Optimization
Mapping Retrieval For an item x, use a hash table to retrieve the
mapping, h(x) Adding fake items
First randomly (according to the probability of adding items) determine the number of items to add
Randomly pick in the set (non-uniform distribution)
Gives a much shorter runtime in average
Choice of mapped items
1 2 … |I|+|IE|+|F| * (1+ δ)
Acceptable as long as it is not easy to identify I’, IE, F
One way is to use random permutation of first |I| + |IE| + |F| natural numbers
First |I| numbers are mapped to |I’| Next |IE| numbers are IE
Cut and paste randomization
One case of select-a-size randomization The way to perform selection of j
Given an integer Km > 0
Randomly choose j in [0, Km] If (j > m)
Set j = m
Overall input parameters Km
pm
Effects on support
Support of A in T’ A in t, without replaced A’ in t, randomly add A
Support of AB in T’ AB in t, without replaced A and B AB’ in t, randomly add B A’B in t, randomly add A A’B’ in t, randomly add A and B
Estimating original support
Support of A in T, x Support of A in T’, y x * P(A remains in original transaction)
+ (|DB| - x) * pm = y
Support of AB in T Support of AB in T’ Support of AB’, A’B in T’ Support of A’B’ in T’
Apriori property
Suppose m = 2 for all t in T |T| = 10, |I| = {A, B} pm= 0, j = 1, Support of B in T’ suppT’ (B)= 0
E(suppT(B)) = 0 suppT’ (A)= 10 suppT’ (AB)= 0 E(suppT(AB)) = suppT’ (A) * 1 = 10
Apriori property
An expected large itemset may have an expected small sub-set
But generally the support of subsets are not too small
Instead of using the support threshold to filter all small candidates, use a smaller value
Apriori algorithm
Generate candidate sets Scan database for counts Recover the predicted support Discard candidates with support
smaller than <= candidate limit Save for output candidates with
support >= support threshold Apriori_gen(remaining candidate)
Candidate limit
A high value Increase numbers of false drops Poor correctness
A small value Increase number of candidate sets High running time
Experiment Support threshold: smin
estimated s.d.: δ smin – δ is found to be a good value
Other applications
Outsourced transaction database (secure) storage
Outsourced association rule mining using data stream
Secure distributed association rule mining with third party miner
Outsourced database with association rule mining service
DB
Transformer
AssociationRules
AssociationRules
Mappings
Transactions
Query