Security in Outsourced Association Rule Mining. Agenda Introduction Approximate randomized technique Encryption Summary and future work.

Security in Outsourced Association Rule Mining

Agenda

Introduction Approximate randomized technique Encryption Summary and future work

Introduction

Data mining in company know about the past activities of their

customers make strategic decisions

Types of data mining Association rules mining Clustering Classification

Association rules

“X => Y” If a transaction contains itemset X, the

transaction will probably contain itemset Y

Support: number of supporting transactions

Confidence: proportion of transactions containing X which also contains Y

Performing data mining

Build application Development cost? Time?

Buy software Fit requirements? Maintenance?

Outsource

Concerns in outsourcing

Output Execution Assurance Correctness

Security Privacy of records Information of the company

Company

DB

Data Miner

Approximate randomized technique

Approximate solution

Privacy Preserving Mining of Association Rules SIGKDD 2002 Authors: Alexandre Evfimievski,

Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke

Problem formulation

Let the set of transactions be T = {t1, t2, … tN}

Transform T to T’ = {t’1, t’2, … t’N} Mine in T’ Privacy breaches

Itemset A cause a privacy breach of level p if for some item a in A

P[a in ti|A in t’i] >= p

Select-a-size randomization

For each transaction ti in T m = length of ti

Select (non-uniformly) randomly an integer j from [0, m]

Copy uniformly at random j items in ti to t’i

Consider every item a not in ti, add a to t’i with a given probability pm

Run on real data

Privacy breach of level <= 50% P[a in ti|A in t’i] <= 50%

Accuracy = # true positive / (# found itemsets)

Set 1

Itemset Size

True Itemset

TruePositive

False Drops

False Positive

Accuracy

1 65 65 0 0 100%

2 228 212 16 28 88%

3 22 18 4 5 78%

Accuracy

Set 2:

Itemset Size

True Itemset

TruePositive

False Drops

False Positive

Accuracy

1 266 254 12 31 89%

2 217 195 22 45 81%

3 48 43 5 26 62%

Problems

Estimated counts of large itemsets varies Lower accuracy of association rules

"beer and diaper" story customers who buy diapers tend also to buy

beer hard to believe some strange rules

Expensive to make wrong decision Supermarket: layout design Health center: identify new disease

Security concerns

Individual transaction is protected Private association rules can be

estimated by other parties Adversary actions may be based on

found association rules

Encryption

Problem formulation

Let the set of transactions be T = {t1, t2, … tN}

I is the entire set of items All ti is a subset of I

Transform T to T’ = {t’1, t’2, … t’N} A third party mines in T’ and gets

AR’ Transform AR’ to AR

Architecture

DB DBTransformer

AssociationRules

AssociationRules

Mappings

Encryption

To protect a message, simple encryption can be applied “GOOD DOG” can be encrypted as “PLLX XLP”

Association rule encryption 752 => 891? Milk => Bread

Transaction encryption <8, 69, 153, 756>? <Cheese, Fork, Ice-cream, Clock>

Simple scheme

Encryption For every transaction ti

For every item x in ti

Add f(x) to t’i where f is a bi-jective function

Decryption For every association rule ri

For every item y in r Replace y by f-1(y)

Problems with simple encryption

They are easy to crack “PLLX XLP”

26P3 combinations, with at least one vowel

Association rules # Bread > # Car

# association rules, # large itemsets are disclosed

Solution Use a more complex scheme

Fake items

Probability to make a correct guess of a single mapping = 1 / |I|

Randomly add some fake items to each transaction Decrease the above probability to 1 / (|

I| + |F|)

One-to-n Mapping

Originally, we are “one-to-one” mapping One item One item A 1 B 2 C 3

We form “one-to-n” mapping A 1, 4, 5 B 2 C 3, 5 Greatly increase the number of possible

mapping of an item

|I|+|F|C1 + |I|+|F|C2 + … |I|+|F|C|F|

Example transformation

T = {A} {B} {C} {A, B} {A, C} {B, C} {A, B, C}

T’ = {1, 4, 5} {2} {3, 5} {1, 2, 4, 5} {1, 3, 5} {2, 3, 5} {1, 2, 3, 4, 5}

A 1, 4, 5B 2C 3, 5

Limitation on the mapping f

For any item x, there does not exist items y1, y2, …, yk (x ≠ y1 ≠ … ≠ yk ) Such that f(x) subset in f(y1) U f(y2) U…f(yk)

Consider an example A 1, 2 B 2, 3 C 3, 4 AC 1, 2, 3, 4 ABC 1, 2, 3, 4

Limitation on the mapping f

For any item x f(x) – Ui != x, i in I f(i) != empty

Every item must map to something unique

Mapping generation – Item Extend

Initialize every item to map to something unique I’

For every item x in IE Randomly pick some mappings Extend each mapping by x

Example run

A 1 B 2 C 3 IE = {4, 5}

Considering item 4

A 1 B 2 C 3

A 1, 4 B 2 C 3

Pick A

Considering item 5

A 1 B 2 C 3

A 1, 4, 5 B 2 C 3, 5

Pick A, C

Item Extend

Every item must map to something unique Say 1 is unique to f(A)

suppT(A) = suppT’(1) For a transaction t without item A

Add a subset of unique mapping set to t’ with some probability

{1, 4} is unique mapping set in f(A) {}, {1}, {4}, {1, 4} may be added

A 1, 4, 5B 2C 3, 5

Fake items again

Now, every item in t’i must be in some mappings

Randomly add some fake items in |F| to each transaction

Mapping f: I -> |I’| U |IE| U |F| |I’|: core “unique” items |IE|: expanding items |F|: fake items

Basic transformation framework

For each transaction t For each item x in t

Add f(x) to t’ For item i in I - t

Add randomly subset of unique mapping set of f(i) to t’

For item f in F Toss a biased coin for each item, add f to

t’ if head (probability should be difference)

Recovering association rules

Given an encrypted rule in AR’ r’: X => Y

If there exists i1, i2, …, im in I Uk=1

m f(ik) = X And there exists j1, j2, …, jn in I

Uk=1n f(jk) = XUY

r: {i1, i2, … im} => {j1, j2, …, jn} – {i1, i2, … im} is a rule in AR

Otherwise, the rule is not correct

Example

Given 1 => 4 (rejected) 2 => 1, 5 (rejected) 2 => 1, 3, 5 (rejected) 2 => 1, 3, 4, 5 (B => AC) 2, 3, 5 => 1, 4 (BC => A)

2, 3, 5 => BC 1, 2, 3, 4, 5 => ABC

Mapping fA 1, 4, 5B 2C 3, 5

Correctness

Proposition For any item x, y, f is transformation

mapping suppT(x) = suppT’(f(x)) suppT(xUy) = suppT’(f(x) U f(y))

For any itemset X, Y, F is the transformation mapping

suppT(X) = suppT’(F(X)) suppT(XUY) = suppT’(F(X) U F(Y))

No false drops and false positives

Summary

Generation of mappings One-to-n mappings Item Extend

Transformation of transactions Mapping f(x) Subsets of unique mapping set Fake items

Recovering association rules Reverse mappings and filtering

Test run

# Items = 1k, |T| = 1k Without transformation

One rule Time: 8s

Item Extend 147 rules Total times: 26s Mappings generation and

transformation: 219ms

Future Work

Define parameters to the problem Size of |IE| Size of |F|

Give a clear measure of security Give a clear measure of overhead Correctness of association rules

Query execution proof Result verification

The End

Choosing probability

Uniform distribution or any fixed distribution give patterns which may be easily identified

Random probability distribution {}: 70%, {1}: 5%, {4}: 15%, {1, 4}:

20% Storage: need additional storage

Back

Algorithm for transformation

Transformation is the most costly process

Execution time linear to database size |T|

Should be as fast as possible

Optimization

Mapping Retrieval For an item x, use a hash table to retrieve the

mapping, h(x) Adding fake items

First randomly (according to the probability of adding items) determine the number of items to add

Randomly pick in the set (non-uniform distribution)

Gives a much shorter runtime in average

Choice of mapped items

1 2 … |I|+|IE|+|F| * (1+ δ)

Acceptable as long as it is not easy to identify I’, IE, F

One way is to use random permutation of first |I| + |IE| + |F| natural numbers

First |I| numbers are mapped to |I’| Next |IE| numbers are IE

Cut and paste randomization

One case of select-a-size randomization The way to perform selection of j

Given an integer Km > 0

Randomly choose j in [0, Km] If (j > m)

Set j = m

Overall input parameters Km

pm

Effects on support

Support of A in T’ A in t, without replaced A’ in t, randomly add A

Support of AB in T’ AB in t, without replaced A and B AB’ in t, randomly add B A’B in t, randomly add A A’B’ in t, randomly add A and B

Estimating original support

Support of A in T, x Support of A in T’, y x * P(A remains in original transaction)

+ (|DB| - x) * pm = y

Support of AB in T Support of AB in T’ Support of AB’, A’B in T’ Support of A’B’ in T’

Apriori property

Suppose m = 2 for all t in T |T| = 10, |I| = {A, B} pm= 0, j = 1, Support of B in T’ suppT’ (B)= 0

E(suppT(B)) = 0 suppT’ (A)= 10 suppT’ (AB)= 0 E(suppT(AB)) = suppT’ (A) * 1 = 10

Apriori property

An expected large itemset may have an expected small sub-set

But generally the support of subsets are not too small

Instead of using the support threshold to filter all small candidates, use a smaller value

Apriori algorithm

Generate candidate sets Scan database for counts Recover the predicted support Discard candidates with support

smaller than <= candidate limit Save for output candidates with

support >= support threshold Apriori_gen(remaining candidate)

Candidate limit

A high value Increase numbers of false drops Poor correctness

A small value Increase number of candidate sets High running time

Experiment Support threshold: smin

estimated s.d.: δ smin – δ is found to be a good value

Other applications

Outsourced transaction database (secure) storage

Outsourced association rule mining using data stream

Secure distributed association rule mining with third party miner

Outsourced database with association rule mining service

DB

Transformer

AssociationRules

AssociationRules

Mappings

Transactions

Query

Security in Outsourced Association Rule Mining. Agenda Introduction Approximate randomized technique Encryption Summary and future work.

Documents

item x

association rule rifor

item y

subset of itransform

transaction tifor

set of transactions

entire set of itemsall

vowelassociation rules