Top Banner
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 1 Data Leakage Detection Panagiotis Papadimitriou, Member, IEEE, Hector Garcia-Molina, Member, IEEE Abstract—We study the following problem: A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. We propose data allocation strategies (across the agents) that improve the probability of identifying leakages. These methods do not rely on alterations of the released data (e.g., watermarks). In some cases we can also inject “realistic but fake” data records to further improve our chances of detecting leakage and identifying the guilty party. Index Terms—allocation strategies, data leakage, data privacy, fake records, leakage model 1 I NTRODUCTION In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted third parties. For example, a hospital may give patient records to researchers who will devise new treatments. Similarly, a company may have partnerships with other companies that require sharing customer data. Another enterprise may outsource its data processing, so data must be given to various other companies. We call the owner of the data the distributor and the supposedly trusted third parties the agents. Our goal is to detect when the distributor’s sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data. We consider applications where the original sensitive data cannot be perturbed. Perturbation is a very useful technique where the data is modified and made “less sensitive” before being handed to agents. For example, one can add random noise to certain attributes, or one can replace exact values by ranges [18]. However, in some cases it is important not to alter the original distributor’s data. For example, if an outsourcer is doing our payroll, he must have the exact salary and customer bank account numbers. If medical researchers will be treating patients (as opposed to simply computing statis- tics), they may need accurate data for the patients. Traditionally, leakage detection is handled by water- marking, e.g., a unique code is embedded in each dis- tributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be iden- tified. Watermarks can be very useful in some cases, but again, involve some modification of the original data. Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious. In this paper we study unobtrusive techniques for de- tecting leakage of a set of objects or records. Specifically, P. Papadimitriou is with Stanford University, Stanford, CA, 94305. E-mail: [email protected] H. Garcia-Molina is with Stanford University, Stanford, CA, 94305. E-mail: [email protected] Manuscript received October 22, 2008; revised December 14, 2009; accepted December 17, 2009. we study the following scenario: After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place. (For example, the data may be found on a web site, or may be obtained through a legal discovery process.) At this point the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. Using an analogy with cookies stolen from a cookie jar, if we catch Freddie with a single cookie, he can argue that a friend gave him the cookie. But if we catch Freddie with 5 cookies, it will be much harder for him to argue that his hands were not in the cookie jar. If the distributor sees “enough evidence” that an agent leaked data, he may stop doing business with him, or may initiate legal proceedings. In this paper we develop a model for assessing the “guilt” of agents. We also present algorithms for dis- tributing objects to agents, in a way that improves our chances of identifying a leaker. Finally, we also consider the option of adding “fake” objects to the distributed set. Such objects do not correspond to real entities but appear realistic to the agents. In a sense, the fake objects acts as a type of watermark for the entire set, without modifying any individual members. If it turns out an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty. We start in Section 2 by introducing our problem setup and the notation we use. In the first part of the paper, Sections 4 and 5, we present a model for calculating “guilt” probabilities in cases of data leakage. Then, in the second part, Sections 6 and 7, we present strategies for data allocation to agents. Finally, in Section 8, we evaluate the strategies in different data leakage scenar- ios, and check whether they indeed help us to identify a leaker. 2 PROBLEM SETUP AND NOTATION 2.1 Entities and Agents A distributor owns a set T = {t 1 ,...,t m } of valuable data objects. The distributor wants to share some of the
13

Data Leakage Detection

Sep 13, 2014

Download

Documents

Vishnu Vijayan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 1

Data Leakage DetectionPanagiotis Papadimitriou, Member, IEEE, Hector Garcia-Molina, Member, IEEE

Abstract—We study the following problem: A data distributor has given sensitive data to a set of supposedly trusted agents (thirdparties). Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor mustassess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by othermeans. We propose data allocation strategies (across the agents) that improve the probability of identifying leakages. These methodsdo not rely on alterations of the released data (e.g., watermarks). In some cases we can also inject “realistic but fake” data records tofurther improve our chances of detecting leakage and identifying the guilty party.

Index Terms—allocation strategies, data leakage, data privacy, fake records, leakage model

F

1 INTRODUCTIONIn the course of doing business, sometimes sensitivedata must be handed over to supposedly trusted thirdparties. For example, a hospital may give patient recordsto researchers who will devise new treatments. Similarly,a company may have partnerships with other companiesthat require sharing customer data. Another enterprisemay outsource its data processing, so data must be givento various other companies. We call the owner of the datathe distributor and the supposedly trusted third partiesthe agents. Our goal is to detect when the distributor’ssensitive data has been leaked by agents, and if possibleto identify the agent that leaked the data.

We consider applications where the original sensitivedata cannot be perturbed. Perturbation is a very usefultechnique where the data is modified and made “lesssensitive” before being handed to agents. For example,one can add random noise to certain attributes, or onecan replace exact values by ranges [18]. However, insome cases it is important not to alter the originaldistributor’s data. For example, if an outsourcer is doingour payroll, he must have the exact salary and customerbank account numbers. If medical researchers will betreating patients (as opposed to simply computing statis-tics), they may need accurate data for the patients.

Traditionally, leakage detection is handled by water-marking, e.g., a unique code is embedded in each dis-tributed copy. If that copy is later discovered in thehands of an unauthorized party, the leaker can be iden-tified. Watermarks can be very useful in some cases, butagain, involve some modification of the original data.Furthermore, watermarks can sometimes be destroyedif the data recipient is malicious.

In this paper we study unobtrusive techniques for de-tecting leakage of a set of objects or records. Specifically,

• P. Papadimitriou is with Stanford University, Stanford, CA, 94305.E-mail: [email protected]

• H. Garcia-Molina is with Stanford University, Stanford, CA, 94305.E-mail: [email protected]

Manuscript received October 22, 2008; revised December 14, 2009; acceptedDecember 17, 2009.

we study the following scenario: After giving a set ofobjects to agents, the distributor discovers some of thosesame objects in an unauthorized place. (For example, thedata may be found on a web site, or may be obtainedthrough a legal discovery process.) At this point thedistributor can assess the likelihood that the leaked datacame from one or more agents, as opposed to havingbeen independently gathered by other means. Using ananalogy with cookies stolen from a cookie jar, if we catchFreddie with a single cookie, he can argue that a friendgave him the cookie. But if we catch Freddie with 5cookies, it will be much harder for him to argue thathis hands were not in the cookie jar. If the distributorsees “enough evidence” that an agent leaked data, hemay stop doing business with him, or may initiate legalproceedings.

In this paper we develop a model for assessing the“guilt” of agents. We also present algorithms for dis-tributing objects to agents, in a way that improves ourchances of identifying a leaker. Finally, we also considerthe option of adding “fake” objects to the distributed set.Such objects do not correspond to real entities but appearrealistic to the agents. In a sense, the fake objects acts as atype of watermark for the entire set, without modifyingany individual members. If it turns out an agent wasgiven one or more fake objects that were leaked, then thedistributor can be more confident that agent was guilty.

We start in Section 2 by introducing our problem setupand the notation we use. In the first part of the paper,Sections 4 and 5, we present a model for calculating“guilt” probabilities in cases of data leakage. Then, inthe second part, Sections 6 and 7, we present strategiesfor data allocation to agents. Finally, in Section 8, weevaluate the strategies in different data leakage scenar-ios, and check whether they indeed help us to identifya leaker.

2 PROBLEM SETUP AND NOTATION2.1 Entities and AgentsA distributor owns a set T = {t1, . . . , tm} of valuabledata objects. The distributor wants to share some of the

Page 2: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 2

objects with a set of agents U1, U2, ..., Un, but does notwish the objects be leaked to other third parties. Theobjects in T could be of any type and size, e.g., theycould be tuples in a relation, or relations in a database.

An agent Ui receives a subset of objects Ri ⊆ T ,determined either by a sample request or an explicitrequest:• Sample request Ri = SAMPLE(T,mi): Any subset ofmi records from T can be given to Ui.

• Explicit request Ri = EXPLICIT(T, condi): Agent Ui

receives all the T objects that satisfy condi.Example. Say T contains customer records for a givencompany A. Company A hires a marketing agency U1 todo an on-line survey of customers. Since any customerswill do for the survey, U1 requests a sample of 1000customer records. At the same time, company A subcon-tracts with agent U2 to handle billing for all Californiacustomers. Thus, U2 receives all T records that satisfythe condition “state is California.”

Although we do not discuss it here, our model canbe easily extended to requests for a sample of objectsthat satisfy a condition (e.g., an agent wants any 100California customer records). Also note that we do notconcern ourselves with the randomness of a sample.(We assume that if a random sample is required, thereare enough T records so that the to-be-presented objectselection schemes can pick random records from T .)

2.2 Guilty Agents

Suppose that after giving objects to agents, the distrib-utor discovers that a set S ⊆ T has leaked. This meansthat some third party called the target, has been caughtin possession of S. For example, this target may bedisplaying S on its web site, or perhaps as part of alegal discovery process, the target turned over S to thedistributor.

Since the agents U1, . . . , Un have some of the data, it isreasonable to suspect them leaking the data. However,the agents can argue that they are innocent, and that theS data was obtained by the target through other means.For example, say one of the objects in S represents acustomer X . Perhaps X is also a customer of some othercompany, and that company provided the data to thetarget. Or perhaps X can be reconstructed from variouspublicly available sources on the web.

Our goal is to estimate the likelihood that the leakeddata came from the agents as opposed to other sources.Intuitively, the more data in S, the harder it is for theagents to argue they did not leak anything. Similarly,the “rarer” the objects, the harder it is to argue thatthe target obtained them through other means. Not onlydo we want to estimate the likelihood the agents leakeddata, but we would also like to find out if one of them inparticular was more likely to be the leaker. For instance,if one of the S objects was only given to agent U1,while the other objects were given to all agents, we may

suspect U1 more. The model we present next capturesthis intuition.

We say an agent Ui is guilty and if it contributes oneor more objects to the target. We denote the event thatagent Ui is guilty as Gi and the event that agent Ui isguilty for a given leaked set S as Gi|S. Our next step isto estimate Pr{Gi|S}, i.e., the probability that agent Ui

is guilty given evidence S.

3 RELATED WORK

The guilt detection approach we present is related tothe data provenance problem [3]: tracing the lineage ofS objects implies essentially the detection of the guiltyagents. Tutorial [4] provides a good overview on theresearch conducted in this field. Suggested solutionsare domain specific, such as lineage tracing for datawarehouses [5], and assume some prior knowledge onthe way a data view is created out of data sources.Our problem formulation with objects and sets is moregeneral and simplifies lineage tracing, since we do notconsider any data transformation from Ri sets to S.

As far as the data allocation strategies are concerned,our work is mostly relevant to watermarking that isused as a means of establishing original ownershipof distributed objects. Watermarks were initially usedin images [16], video [8] and audio data [6] whosedigital representation includes considerable redundancy.Recently, [1], [17], [10], [7] and other works have alsostudied marks insertion to relational data. Our approachand watermarking are similar in the sense of providingagents with some kind of receiver-identifying informa-tion. However, by its very nature, a watermark modifiesthe item being watermarked. If the object to be water-marked cannot be modified then a watermark cannot beinserted. In such cases methods that attach watermarksto the distributed data are not applicable.

Finally, there are also lots of other works on mecha-nisms that allow only authorized users to access sensi-tive data through access control policies [9], [2]. Such ap-proaches prevent in some sense data leakage by sharinginformation only with trusted parties. However, thesepolicies are restrictive and may make it impossible tosatisfy agents’ requests.

4 AGENT GUILT MODEL

To compute this Pr{Gi|S}, we need an estimate for theprobability that values in S can be “guessed” by thetarget. For instance, say some of the objects in S areemails of individuals. We can conduct an experimentand ask a person with approximately the expertise andresources of the target to find the email of say 100individuals. If this person can find say 90 emails, then wecan reasonably guess that the probability of finding oneemail is 0.9. On the other hand, if the objects in questionare bank account numbers, the person may only discoversay 20, leading to an estimate of 0.2. We call this estimate

Page 3: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 3

pt, the probability that object t can be guessed by thetarget.

Probability pt is analogous to the probabilities used indesigning fault-tolerant systems. That is, to estimate howlikely it is that a system will be operational throughouta given period, we need the probabilities that individualcomponents will or will not fail. A component failure inour case is the event that the target guesses an objectof S. The component failure is used to compute theoverall system reliability, while we use the probability ofguessing to identify agents that have leaked information.The component failure probabilities are estimated basedon experiments, just as we propose to estimate thept’s. Similarly, the component probabilities are usuallyconservative estimates, rather than exact numbers. Forexample, say we use a component failure probability thatis higher than the actual probability, and we design oursystem to provide a desired high level of reliability. Thenwe will know that the actual system will have at leastthat level of reliability, but possibly higher. In the sameway, if we use pt’s that are higher than the true values,we will know that the agents will be guilty with at leastthe computed probabilities.

To simplify the formulas that we present in the rest ofthe paper, we assume that all T objects have the same pt,which we call p. Our equations can be easily generalizedto diverse pt’s though they become cumbersome todisplay.

Next, we make two assumptions regarding the re-lationship among the various leakage events. The firstassumption simply states that an agent’s decision to leakan object is not related to other objects. In [14] we studya scenario where the actions for different objects arerelated, and we study how our results are impacted bythe different independence assumptions.

Assumption 1. For all t, t′ ∈ S such that t 6= t′ theprovenance of t is independent of the provenance of t′.

The term provenance in this assumption statementrefers to the source of a value t that appears in the leakedset. The source can be any of the agents who have t intheir sets or the target itself (guessing).

To simplify our formulas, the following assumptionstates that joint events have a negligible probability. Aswe argue in the example below, this assumption givesus more conservative estimates for the guilt of agents,which is consistent with our goals.

Assumption 2. An object t ∈ S can only be obtained by thetarget in one of two ways:• A single agent Ui leaked t from its own Ri set; or• The target guessed (or obtained through other means) t

without the help of any of the n agents.In other words, for all t ∈ S, the event that the target guessest and the events that agent Ui (i = 1, . . . , n) leaks object tare disjoint.

Before we present the general formula for computingthe probability Pr{Gi|S} that an agent Ui is guilty, we

provide a simple example. Assume that the distributorset T , the agent sets R’s and the target set S are:

T = {t1, t2, t3}, R1 = {t1, t2}, R2 = {t1, t3}, S = {t1, t2, t3}.In this case, all three of the distributor’s objects havebeen leaked and appear in S. Let us first consider howthe target may have obtained object t1, which was givento both agents. From Assumption 2, the target eitherguessed t1 or one of U1 or U2 leaked it. We know thatthe probability of the former event is p, so assuming thatprobability that each of the two agents leaked t1 is thesame we have the following cases:

• the target guessed t1 with probability p;• agent U1 leaked t1 to S with probability (1− p)/2;• agent U2 leaked t1 to S with probability (1− p)/2

Similarly, we find that agent U1 leaked t2 to S withprobability 1− p since he is the only agent that has t2.

Given these values, the probability that agent U1 is notguilty, namely that U1 did not leak either object is:

Pr{G1|S} = (1− (1− p)/2)× (1− (1− p)), (1)

and the probability that U1 is guilty is:

Pr{G1|S} = 1− Pr{G1}. (2)

Note that if Assumption 2 did not hold, our analysiswould be more complex because we would need toconsider joint events, e.g., the target guesses t1 and atthe same time one or two agents leak the value. In oursimplified analysis we say that an agent is not guiltywhen the object can be guessed, regardless of whetherthe agent leaked the value. Since we are “not count-ing” instances when an agent leaks information, thesimplified analysis yields conservative values (smallerprobabilities).

In the general case (with our assumptions), to findthe probability that an agent Ui is guilty given a set S,first we compute the probability that he leaks a singleobject t to S. To compute this we define the set of agentsVt = {Ui|t ∈ Ri} that have t in their data sets. Then usingAssumption 2 and known probability p, we have:

Pr{some agent leaked t to S} = 1− p. (3)

Assuming that all agents that belong to Vt can leak t toS with equal probability and using Assumption 2 weobtain:

Pr{Ui leaked t to S} =

{1−p|Vt| , if Ui ∈ Vt0, otherwise

(4)

Given that agent Ui is guilty if he leaks at least onevalue to S, with Assumption 1 and Equation 4 we cancompute the probability Pr{Gi|S} that agent Ui is guilty:

Pr{Gi|S} = 1−∏

t∈S∩Ri

(1− 1− p

|Vt|

)(5)

Page 4: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 4

0 0.5 10

0.2

0.4

0.6

0.8

1

p

Pr{

Gi}

i=1i=2

(a) |R2∩S||S| = 0.5

0 50 1000

0.2

0.4

0.6

0.8

1

|R2 ∩ S|/|S| (%)

Pr{

Gi}

i=1i=2

(b) p = 0.2

0 50 1000

0.2

0.4

0.6

0.8

1

|R2 ∩ S|/|S| (%)

Pr{

Gi}

i=1i=2

(c) p = 0.5

0 50 1000

0.2

0.4

0.6

0.8

1

|R2 ∩ S|/|S| (%)

Pr{

Gi}

i=1i=2

(d) p = 0.9

Fig. 1. Guilt probability as a function of the guessing probability p (Figure (a)) and the overlap between S and R2

(Figures (b), (c) and (d)). In all scenarios it holds that R1 ∩ S = S and |S| = 16.

5 GUILT MODEL ANALYSIS

In order to see how our model parameters interact andto check if the interactions match our intuition, in thissection we study two simple scenarios. In each scenariowe have a target that has obtained all the distributor’sobjects, i.e., T = S.

5.1 Impact of Probability pIn our first scenario, T contains 16 objects: all of themare given to agent U1 and only 8 are given to a secondagent U2. We calculate the probabilities Pr{G1|S} andPr{G2|S} for p in the range [0,1] and we present theresults in Figure 1(a). The dashed line shows Pr{G1|S}and the solid line shows Pr{G2|S}.

As p approaches 0, it becomes more and more un-likely that the target guessed all 16 values. Each agenthas enough of the leaked data that its individual guiltapproaches 1. However, as p increases in value, theprobability that U2 is guilty decreases significantly: allof U2’s 8 objects were also given to U1, so it gets harderto blame U2 for the leaks. On the other hand, U2’sprobability of guilt remains close to 1 as p increases,since U1 has 8 objects not seen by the other agent. At theextreme, as p approaches 1, it is very possible that thetarget guessed all 16 values, so the agent’s probability ofguilt goes to 0.

5.2 Impact of Overlap between Ri and S

In this subsection we again study two agents, one re-ceiving all the T = S data, and the second one receivinga varying fraction of the data. Figure 1(b) shows theprobability of guilt for both agents, as a function of thefraction of the objects owned by U2, i.e., as a function of|R2 ∩ S|/|S|. In this case, p has a low value of 0.2, andU1 continues to have all 16 S objects. Note that in ourprevious scenario, U2 has 50% of the S objects.

We see that when objects are rare (p = 0.2), it does nottake many leaked objects before we can say U2 is guiltywith high confidence. This result matches our intuition:an agent that owns even a small number of incriminatingobjects is clearly suspicious.

Figures 1(c) and 1(d) show the same scenario, exceptfor values of p equal to 0.5 and 0.9. We see clearly that

the rate of increase of the guilt probability decreases as pincreases. This observation again matches our intuition:As the objects become easier to guess, it takes more andmore evidence of leakage (more leaked objects ownedby U2) before we can have high confidence that U2 isguilty.

In [14] we study an additional scenario that showshow the sharing of S objects by agents affects theprobabilities that they are guilty. The scenario conclusionmatches our intuition: with more agents holding thereplicated leaked data, it is harder to lay the blame onany one agent.

6 DATA ALLOCATION PROBLEMThe main focus of our paper is the data allocation prob-lem: how can the distributor “intelligently” give data toagents in order to improve the chances of detecting aguilty agent? As illustrated in Figure 2, there are fourinstances of this problem we address, depending on thetype of data requests made by agents and whether “fakeobjects” are allowed.

The two types of requests we handle were defined inSection 2: sample and explicit. Fake objects are objectsgenerated by the distributor that are not in set T . Theobjects are designed to look like real objects, and aredistributed to agents together with the T objects, in orderto increase the chances of detecting agents that leak data.We discuss fake objects in more detail in Section 6.1below.

As shown in Figure 2, we represent our four probleminstances with the names EF , EF , SF and SF , whereE stands for explicit requests, S for sample requests, Ffor the use of fake objects, and F for the case where fakeobjects are not allowed.

Note that for simplicity we are assuming that in theE problem instances, all agents make explicit requests,

Data Requests

Fake Tuples

Fake Tuples

Explicit Sample

No Yes

EF EF

No Yes

SFSF

Fig. 2. Leakage Problem Instances

Page 5: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 5

while in the S instances, all agents make sample re-quests. Our results can be extended to handle mixedcases, with some explicit and some sample requests. Weprovide here a small example to illustrate how mixedrequests can be handled, but then do not elaboratefurther. Assume that we have two agents with requestsR1 = EXPLICIT(T, cond1) and R2 = SAMPLE(T ′, 1)where T ′ = EXPLICIT(T, cond2). Further, say cond1 is“state=CA” (objects have a state field). If agent U2 hasthe same condition cond2 = cond1, we can create anequivalent problem with sample data requests on setT ′. That is, our problem will be how to distribute theCA objects to two agents, with R1 = SAMPLE(T ′, |T ′|)and R2 = SAMPLE(T ′, 1). If instead U2 uses condition“state=NY,” we can solve two different problems forsets T ′ and T − T ′. In each problem we will have onlyone agent. Finally, if the conditions partially overlap,R1 ∩ T ′ 6= ∅, but R1 6= T ′ we can solve three differentproblems for sets R1 − T ′, R1 ∩ T ′ and T ′ −R1.

6.1 Fake Objects

The distributor may be able to add fake objects to thedistributed data in order to improve his effectivenessin detecting guilty agents. However, fake objects mayimpact the correctness of what agents do, so they maynot always be allowable.

The idea of perturbing data to detect leakage is notnew, e.g., [1]. However, in most cases, individual objectsare perturbed, e.g., by adding random noise to sensitivesalaries, or adding a watermark to an image. In ourcase, we are perturbing the set of distributor objectsby adding fake elements. In some applications, fakeobjects may cause fewer problems that perturbing realobjects. For example, say the distributed data objectsare medical records and the agents are hospitals. In thiscase, even small modifications to the records of actualpatients may be undesirable. However, the addition ofsome fake medical records may be acceptable, since nopatient matches these records, and hence no one willever be treated based on fake records.

Our use of fake objects is inspired by the use of “trace”records in mailing lists. In this case, company A sells tocompany B a mailing list to be used once (e.g., to sendadvertisements). Company A adds trace records thatcontain addresses owned by company A. Thus, each timecompany B uses the purchased mailing list, A receivescopies of the mailing. These records are a type of fakeobjects that help identify improper use of data.

The distributor creates and adds fake objects to thedata that he distributes to agents. We let Fi ⊆ Ri be thesubset of fake objects that agent Ui receives. As discussedbelow, fake objects must be created carefully so thatagents cannot distinguish them from real objects.

In many cases, the distributor may be limited inhow many fake objects he can create. For example,objects may contain email addresses, and each fake emailaddress may require the creation of an actual inbox

(otherwise the agent may discover the object is fake). Theinboxes can actually be monitored by the distributor: ifemail is received from someone other than the agent whowas given the address, it is evidence that the address wasleaked. Since creating and monitoring email accountsconsumes resources, the distributor may have a limit offake objects. If there is a limit, we denote it by B fakeobjects.

Similarly, the distributor may want to limit the numberof fake objects received by each agent, so as to notarouse suspicions and to not adversely impact the agentsactivities. Thus, we say that the distributor can send upto bi fake objects to agent Ui.

Creation. The creation of fake but real-looking objectsis a non-trivial problem whose thorough investigationis beyond the scope of this paper. Here, we model thecreation of a fake object for agent Ui as a black-boxfunction CREATEFAKEOBJECT(Ri, Fi, condi) that takes asinput the set of all objects Ri, the subset of fake objectsFi that Ui has received so far and condi, and returns anew fake object. This function needs condi to produce avalid object that satisfies Ui’s condition. Set Ri is neededas input so that the created fake object is not only validbut also indistinguishable from other real objects. For ex-ample, the creation function of a fake payroll record thatincludes an employee rank and a salary attribute maytake into account the distribution of employee ranks, thedistribution of salaries as well as the correlation betweenthe two attributes. Ensuring that key statistics do notchange by the introduction of fake objects is importantif the agents will be using such statistics in their work.Finally, function CREATEFAKEOBJECT() has to be awareof the fake objects Fi added so far, again to ensure properstatistics.

The distributor can also use functionCREATEFAKEOBJECT() when it wants to send thesame fake object to a set of agents. In this case,the function arguments are the union of the Ri and Fi

tables respectively, and the intersection of the conditionscondi’s.

Although we do not deal with the implementation ofCREATEFAKEOBJECT() we note that there are two maindesign options. The function can either produce a fakeobject on demand every time it is called, or it can returnan appropriate object from a pool of objects created inadvance.

6.2 Optimization Problem

The distributor’s data allocation to agents has one con-straint and one objective. The distributor’s constraint isto satisfy agents’ requests, by providing them with thenumber of objects they request or with all availableobjects that satisfy their conditions. His objective is to beable to detect an agent who leaks any portion of his data.

We consider the constraint as strict. The distributormay not deny serving an agent request as in [13] and

Page 6: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 6

may not provide agents with different perturbed ver-sions of the same objects as in [1]. We consider fake objectdistribution as the only possible constraint relaxation.

Our detection objective is ideal and intractable. Detec-tion would be assured only if the distributor gave nodata object to any agent (Mungamuru et al. [11] discussthat to attain “perfect” privacy and security we have tosacrifice utility). We use instead the following objective:maximize the chances of detecting a guilty agent thatleaks all his data objects.

We now introduce some notation to state formallythe distributor’s objective. Recall that Pr{Gj |S = Ri}or simply Pr{Gj |Ri} is the probability that agent Uj isguilty if the distributor discovers a leaked table S thatcontains all Ri objects. We define the difference functions∆(i, j) as:

∆(i, j) = Pr{Gi|Ri} − Pr{Gj |Ri} i, j = 1, . . . , n(6)

Note that differences ∆ have non-negative values: giventhat set Ri contains all the leaked objects, agent Ui is atleast as likely to be guilty as any other agent. Difference∆(i, j) is positive for any agent Uj , whose set Rj doesnot contain all data of S. It is zero, if Ri ⊆ Rj . In thiscase the distributor will consider both agents Ui andUj equally guilty since they have both received all theleaked objects. The larger a ∆(i, j) value is, the easier itis to identify Ui as the leaking agent. Thus, we want todistribute data so that ∆ values are large:

Problem Definition. Let the distributor have data requestsfrom n agents. The distributor wants to give tables R1, . . . , Rn

to agents U1, . . . , Un respectively, so that:• he satisfies agents’ requests; and• he maximizes the guilt probability differences ∆(i, j) for

all i, j = 1, . . . , n and i 6= j.Assuming that the Ri sets satisfy the agents’ requests, wecan express the problem as a multi-criterion optimizationproblem:

maximize(over R1, . . . , Rn)

(. . . ,∆(i, j), . . . ) i 6= j (7)

If the optimization problem has an optimal solu-tion, that means that there exists an allocation D∗ ={R∗1, . . . , R∗n} such that any other feasible allocation D ={R1, . . . , Rn} yields ∆(i, j) ≥ ∆∗(i, j) for all i, j. Thatmeans that allocation D∗ allows the distributor to discernany guilty agent with higher confidence than any otherallocation, since it maximizes the probability Pr{Gi|Ri}with respect to any other probability Pr{Gi|Rj} withj 6= i.

Even if there is no optimal allocation D∗, a multi-criterion problem has Pareto optimal allocations. IfDpo = {Rpo

1 , . . . , Rpon } is a Pareto optimal allocation,

that means that there is no other allocation that yields∆(i, j) ≥ ∆po(i, j) for all i, j. In other words, if anallocation yields ∆(i, j) ≥ ∆po(i, j) for some i, j, thenthere is some i′, j′ such that ∆(i′, j′) ≤ ∆po(i′, j′). Thechoice among all the Pareto optimal allocations implic-itly selects the agent(s) we want to identify.

6.3 Objective Approximation

We can approximate the objective of Equation 7 withEquation 8 that does not depend on agents’ guilt prob-abilities and therefore on p.

minimize(over R1, . . . , Rn)

(. . . ,

|Ri∩Rj ||Ri| , . . .

)i 6= j (8)

This approximation is valid if minimizing the relativeoverlap |Ri∩Rj |

|Ri| maximizes ∆(i, j). The intuitive argu-ment for this approximation is that the fewer leakedobjects set Rj contains, the less guilty agent Uj willappear compared to Ui (since S = Ri). The example ofSection 5.2 supports our approximation. In Figure 1 wesee that if S = R1 the difference Pr{G1|S} − Pr{G2|S}decreases as the relative overlap |R2∩S|

|S| increases. The-orem 1 shows that a solution to problem 7 yields thesolution to problem 8 if each T object is allocated to thesame number of agents, regardless of who these agentsare. The proof of the theorem is in [14].

Theorem 1. If a distribution D = {R1, . . . , Rn} that satisfiesagents’ requests minimizes |Ri∩Rj |

|Ri| and |Vt| = |Vt′ | for allt, t′ ∈ T , then D maximizes ∆(i, j).

The approximate optimization problem has still multi-ple criteria and it can yield either an optimal or multiplePareto optimal solutions. Pareto optimal solutions letus detect a guilty agent Ui with high confidence, atthe expense of an inability to detect some other guiltyagent or agents. Since the distributor has no a prioriinformation for the agents’ intention to leak their data,he has no reason to bias the object allocation against aparticular agent. Therefore, we can scalarize the problemobjective by assigning the same weights to all vectorobjectives. We present two different scalar versions ofour problem in Equations 9a and 9b. In the rest of thepaper we will refer to objective 9a as the sum-objectiveand to objective 9b as the max-objective.

minimize(over R1, . . . , Rn)

n∑

i=1

1

|Ri|n∑

j=1j 6=i

|Ri ∩Rj | (9a)

minimize(over R1, . . . , Rn)

maxi,j=1,...,n

j 6=i

|Ri ∩Rj ||Ri|

(9b)

Both scalar optimizations problems yield the optimalsolution of the problem of Equation 8, if such solutionexists. If there is no global optimal solution, the sum-objective yields the Pareto optimal solution that allowthe distributor to detect the guilty agent on average(over all different agents) with higher confidence thanany other distribution. The max-objective yields the so-lution that guarantees that the distributor will detectthe guilty agent with a certain confidence in the worstcase. Such guarantee may adversely impact the averageperformance of the distribution.

Page 7: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 7

7 ALLOCATION STRATEGIES

In this section we describe allocation strategies that solveexactly or approximately the scalar versions of Equa-tion 8 for the different instances presented in Figure 2.We resort to approximate solutions in cases where it isinefficient to solve accurately the optimization problem.

In Section 7.1 we deal with problems with explicit datarequests and in Section 7.2 with problems with sampledata requests.

The proofs of theorems that are stated in the followingsections are available in [14].

7.1 Explicit Data RequestsIn problems of class EF the distributor is not allowedto add fake objects to the distributed data. So, the dataallocation is fully defined by the agents’ data requests.Therefore, there is nothing to optimize.

In EF problems, objective values are initialized byagents’ data requests. Say, for example, that T = {t1, t2}and there are two agents with explicit data requests suchthat R1 = {t1, t2} and R2 = {t1}. The value of the sum-objective is in this case:

2∑

i=1

1

|Ri|2∑

j=1j 6=i

|Ri ∩Rj | =1

2+

1

1= 1.5.

The distributor cannot remove or alter the R1 or R2

data to decrease the overlap R1 ∩ R2. However, say thedistributor can create one fake object (B = 1) and bothagents can receive one fake object (b1 = b2 = 1). In thiscase, the distributor can add one fake object to either R1

or R2 to increase the corresponding denominator of thesummation term. Assume that the distributor creates afake object f and he gives it to agent R1. Agent U1 hasnow R1 = {t1, t2, f} and F1 = {f} and the value of thesum-objective decreases to 1

3 + 11 = 1.33 < 1.5.

If the distributor is able to create more fake objects,he could further improve the objective. We present inAlgorithm 1 and 2 a strategy for randomly allocatingfake objects. Algorithm 1 is a general “driver” thatwill be used by other strategies, while Algorithm 2

Algorithm 1 Allocation for Explicit Data Requests (EF )Input: R1, . . . , Rn, cond1, . . . , condn, b1, . . . ,bn, BOutput: R1, . . . , Rn, F1,. . . ,Fn

1: R← ∅ . Agents that can receive fake objects2: for i = 1, . . . , n do3: if bi > 0 then4: R← R ∪ {i}5: Fi ← ∅6: while B > 0 do7: i← SELECTAGENT(R, R1, . . . , Rn)8: f ← CREATEFAKEOBJECT(Ri, Fi, condi)9: Ri ← Ri ∪ {f}

10: Fi ← Fi ∪ {f}11: bi ← bi − 112: if bi = 0 then13: R← R\{Ri}14: B ← B − 1

Algorithm 2 Agent Selection for e-random1: function SELECTAGENT(R, R1, . . . , Rn)2: i← select at random an agent from R3: return i

Algorithm 3 Agent selection for e-optimal1: function SELECTAGENT(R, R1, . . . , Rn)2: i← argmax

i′:Ri′∈R

(1

|Ri′ |− 1

|Ri′ |+1

)∑j |Ri′ ∩Rj |

3: return i

actually performs the random selection. We denote thecombination of Algorithm 1 with 2 as e-random. We usee-random as our baseline in our comparisons with otheralgorithms for explicit data requests.

In lines 1-5, Algorithm 1 finds agents that are eligibleto receiving fake objects in O(n) time. Then, in the mainloop in lines 6-14, the algorithm creates one fake objectin every iteration and allocates it to random agent. Themain loop takes O(B) time. Hence, the running time ofthe algorithm is O(n+B).

If B ≥ ∑ni=1 bi, the algorithm minimizes every term

of the objective summation by adding the maximumnumber bi of fake objects to every set Ri, yielding theoptimal solution. Otherwise, if B <

∑ni=1 bi (as in our

example where B = 1 < b1 + b2 = 2) the algorithm justselects at random the agents that are provided with fakeobjects. We return back to our example and see how theobjective would change if the distributor adds fake objectf to R2 instead of R1. In this case the sum-objectivewould be 1

2 + 12 = 1 < 1.33.

The reason why we got a greater improvement is thatthe addition of a fake object to R2 has greater impact onthe corresponding summation terms, since

1

|R1|− 1

|R1|+ 1=

1

6<

1

|R2|− 1

|R2|+ 1=

1

2.

The left-hand side of the inequality corresponds tothe objective improvement after the addition of a fakeobject to R1 and the right-hand side to R2. We can usethis observation to improve Algorithm 1 with the use ofprocedure SELECTAGENT() of Algorithm 3. We denotethe combination of Algorithms 1 and 3 as e-optimal.

Algorithm 3 makes a greedy choice by selecting theagent that will yield the greatest improvement in thesum-objective. The cost of this greedy choice in O(n2) inevery iteration. The overall running time of e-optimal isO(n+n2B) = O(n2B). Theorem 2 shows that this greedyapproach finds an optimal distribution with respect toboth optimization objectives defined in Equation 9.

Theorem 2. Algorithm e-optimal yields an object allocationminimizes both sum- and max-objective in problem instancesof class EF .

7.2 Sample Data RequestsWith sample data requests, each agent Ui may receiveany T subset out of

(|T |mi

)different ones. Hence, there are

Page 8: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 8

Algorithm 4 Allocation for Sample Data Requests (SF )Input: m1, . . . , mn, |T | . Assuming mi ≤ |T |Output: R1, . . . , Rn

1: a← 0|T | . a[k]:number of agents who have received object tk2: R1 ← ∅, . . . , Rn ← ∅3: remaining ←

∑ni=1 mi

4: while remaining > 0 do5: for all i = 1, . . . , n : |Ri| < mi do6: k ← SELECTOBJECT(i, Ri) . May also use additional

parameters7: Ri ← Ri ∪ {tk}8: a[k]← a[k] + 19: remaining ← remaining − 1

∏ni=1

(|T |mi

)different object allocations. In every allocation,

the distributor can permute T objects and keep thesame chances of guilty agent detection. The reason isthat the guilt probability depends only on which agentshave received the leaked objects and not on the identityof the leaked objects.Therefore, from the distributor’s

perspective there are∏n

i=1 (|T |mi

)|T |! different allocations. The

distributor’s problem is to pick one out so that heoptimizes his objective. In [14] we formulate the problemas a non-convex QIP that is NP-hard to solve [15].

Note that the distributor can increase the number ofpossible allocations by adding fake objects (and increas-ing |T |) but the problem is essentially the same. So, in therest of this subsection we will only deal with problemsof class SF , but our algorithms are applicable to SFproblems as well.

7.2.1 RandomAn object allocation that satisfies requests and ignoresthe distributor’s objective is to give each agent Ui arandomly selected subset of T of size mi. We denote thisalgorithm as s-random and we use it as our baseline. Wepresent s-random in two parts: Algorithm 4 is a generalallocation algorithm that is used by other algorithms inthis section. In line 6 of Algorithm 4 there is a call tofunction SELECTOBJECT() whose implementation differ-entiates algorithms that rely on Algorithm 4. Algorithm 5shows function SELECTOBJECT() for s-random.

In s-random we introduce vector a ∈ N|T | that showsthe object sharing distribution. In particular, element a[k]shows the number of agents who receive object tk.

Algorithm s-random allocates objects to agents in around-robin fashion. After the initialization of vectorsd and a in lines 1-2 of Algorithm 4, the main loop inlines 4-9 is executed while there are still data objects(remaining > 0) to be allocated to agents. In each itera-tion of this loop (lines 5-9) the algorithm uses functionSELECTOBJECT() to find a random object to allocate toagent Ui. This loop iterates over all agents who have notreceived the number of data objects they have requested.

The running time of the algorithm is O(τ∑n

i=1mi) anddepends on the running time τ of the object selectionfunction SELECTOBJECT(). In case of random selectionwe can have τ = O(1) by keeping in memory a set{k′ | tk′ /∈ Ri} for each agent Ui.

Algorithm 5 Object Selection for s-random1: function SELECTOBJECT(i, Ri)2: k ← select at random an element from set {k′ | tk′ /∈ Ri}3: return k

Algorithm 6 Object Selection for s-overlap1: function SELECTOBJECT(i, Ri, a)2: K ← {k | k = argmin

k′a[k′]}

3: k ← select at random an element from set {k′ | k′ ∈ K ∧ tk′ /∈Ri}

4: return k

Algorithm s-random may yield a poor data allocation.Say, for example that the distributor set T has threeobjects and there are three agents who request one objecteach. It is possible that s-random provides all threeagents with the same object. Such an allocation maxi-mizes both objectives 9a and 9b instead of minimizingthem.

7.2.2 Overlap MinimizationIn the last example, the distributor can minimize bothobjectives by allocating distinct sets to all three agents.Such an optimal allocation is possible, since agents re-quest in total fewer objects than the distributor has.

We can achieve such an allocation by using Algo-rithm 4 and SELECTOBJECT() of Algorithm 6. We denotethe resulting algorithm as s-overlap. Using Algorithm 6,in each iteration of Algorithm 4 we provide agent Ui

with an object that has been given to the smallest numberof agents. So, if agents ask for fewer objects than |T |,Algorithm 6 will return in every iteration an object thatno agent has received so far. Thus, every agent willreceive a data set with objects that no other agent has.

The running time of Algorithm 6 is O(1) if we keep inmemory the set {k|k = argmin

k′a[k′]}. This set contains

initially all indices {1, . . . , |T |}, since a[k] = 0 for all k =1, . . . , |T |. After every Algorithm 4 main loop iteration,we remove from this set the index of the object that wegive to an agent. After |T | iterations this set becomesempty and we have to reset it again to {1, . . . , |T |}, sinceat this point a[k] = 1 for all k = 1, . . . , |T |. The totalrunning time of algorithm s-random is thus O(

∑ni=1mi).

Let M =∑n

i=1mi. If M ≤ |T |, algorithm s-overlapyields disjoint data sets and is optimal for both ob-jectives 9a and 9b. If M > |T |, it can be shown thatalgorithm s-random yields an object sharing distributionsuch that:

a[k] =

{M ÷ |T |+ 1 for M mod |T | entries of vector aM ÷ |T | for the rest

Theorem 3. In general, Algorithm s-overlap does not min-imize sum-objective. However, s-overlap does minimize thesum of overlaps, i.e.,

∑ni,j=1j 6=i|Ri ∩ Rj |. If requests are all of

the same size (m1 = · · · = mn) then s-overlap minimizesthe sum-objective.

Page 9: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 9

To illustrate that s-overlap does not minimize the sum-objective assume that set T has four objects and there arefour agents requesting samples with sizes m1 = m2 = 2and m3 = m4 = 1. A possible data allocation from s-overlap is

R1 = {t1, t2}, R2 = {t3, t4}, R3 = {t1}, R4 = {t3}. (10)

Allocation 10 yields:4∑

i=1

1

|Ri|4∑

j=1j 6=i

|Ri ∩Rj | =1

2+

1

2+

1

1+

1

1= 3.

With this allocation, we see that if agent U3 leaks his datawe will equally suspect agents U1 and U3. Moreover, ifagent U1 leaks his data we will suspect U3 with highprobability, since he has half of the leaked data. Thesituation is similar for agents U2 and U4.

However, the following object allocation

R1 = {t1, t2}, R2 = {t1, t2}, R3 = {t3}, R4 = {t4} (11)

yields a sum-objective equal to 22 + 2

2 + 0 + 0 = 2 < 3,which shows that the first allocation is not optimal. Withthis allocation, we will equally suspect agents U1 andU2 if either of them leaks his data. However, if eitherU3 or U4 leaks his data we will detect him with highconfidence. Hence, with the second allocation we haveon average better chances of detecting a guilty agent.

7.2.3 Approximate Sum-Objective Minimization

The last example showed that we can minimize thesum-objective and, therefore, increase the chances ofdetecting a guilty agent on average by providing agentswho have small requests with the objects shared amongthe fewest agents. This way, we improve our chancesof detecting guilty agents with small data requests, atthe expense of reducing our chances of detecting guiltyagents with large data requests. However, this expense issmall, since the probability to detect a guilty agent withmany objects is less affected by the fact that other agentshave also received his data (See Section 5.2). In [14] weprovide an algorithm that implements this intuition andwe denote it s-sum. Although we evaluate this algorithmin Section 8, we do not present the pseudocode here dueto space limitations.

7.2.4 Approximate Max-Objective Minimization

Algorithm s-overlap is optimal for the max-objectiveoptimization only if

∑ni mi ≤ |T |. Note also that s-

sum as well as s-random ignore this objective. Say, forexample, that set T contains for objects and there are 4agents, each requesting a sample of 2 data objects. Theaforementioned algorithms may produce the followingdata allocation:

R1 = {t1, t2}, R2 = {t1, t2}, R3 = {t3, t4} and R4 = {t3, t4}Although such an allocation minimizes the sum-objective, it allocates identical sets to two agent pairs.

Algorithm 7 Object Selection for s-max1: function SELECTOBJECT(i, R1, . . . , Rn,m1, . . . ,mn)2: min overlap← 1 . the minimum out

of the maximum relative overlaps that the allocations of differentobjects to Ui yield

3: for k ∈ {k′ | tk′ /∈ Ri} do4: max rel ov ← 0 . the maximum relative overlap between

Ri and any set Rj that the allocation of tk to Ui yields5: for all j = 1, . . . , n : j 6= i and tk ∈ Rj do6: abs ov ← |Ri ∩Rj |+ 17: rel ov ← abs ov/min(mi,mj)8: max rel ov ← MAX(max rel ov, rel ov)

9: if max rel ov ≤ min overlap then10: min overlap← max rel ov11: ret k ← k12: return ret k

Consequently, if an agent leaks his values, he will beequally guilty with an innocent agent.

To improve the worst-case behavior we present anew algorithm that builds upon Algorithm 4 that weused in s-random and s-overlap. We define a newSELECTOBJECT() procedure in Algorithm 7. We denotethe new algorithm s-max. In this algorithm we allocate toan agent the object that yields the minimum increase ofthe maximum relative overlap among any pair of agents.If we apply s-max to the example above, after the firstfive main loop iterations in Algorithm 4 the Ri sets are:

R1 = {t1, t2}, R2 = {t2}, R3 = {t3} and R4 = {t4}In the next iteration function SELECTOBJECT() must de-cide which object to allocate to agent U2. We see that onlyobjects t3 and t4 are good candidates, since allocating t1to U2 will yield a full overlap of R1 and R2. FunctionSELECTOBJECT() of s-max returns indeed t3 or t4.

The running time of SELECTOBJECT() is O(|T |n), sincethe external loop in lines 3-12 iterates over all objectsthat agent Ui has not received and the internal loop inlines 5-8 over all agents. This running time calculationimplies that we keep the overlap sizes Ri ∩ Rj for allagents in a two-dimensional array that we update afterevery object allocation.

It can be shown that algorithm s-max is optimal for thesum-objective and the max-objective in problems whereM ≤ |T |. It is also optimal for the max-objective if |T | ≤M ≤ 2|T | or m1 = m2 = · · · = mn.

8 EXPERIMENTAL RESULTS

We implemented the presented allocation algorithms inPython and we conducted experiments with simulateddata leakage problems to evaluate their performance.In Section 8.1 we present the metrics we use for thealgorithm evaluation and in Sections 8.2 and 8.3 wepresent the evaluation for sample requests and explicitdata requests respectively.

8.1 MetricsIn Section 7 we presented algorithms to optimize theproblem of Equation 8 that is an approximation to the

Page 10: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 10

original optimization problem of Equation 7. In this sec-tion we evaluate the presented algorithms with respectto the original problem. In this way we measure notonly the algorithm performance, but also we implicitlyevaluate how effective the approximation is.

The objectives in Equation 7 are the ∆ differencefunctions. Note that there are n(n − 1) objectives, sincefor each agent Ui there are n − 1 differences ∆(i, j) forj = 1, . . . , n and j 6= i. We evaluate a given allocationwith the following objective scalarizations as metrics:

∆ :=

∑i,j=1,...,n

i6=j∆(i, j)

n(n− 1)(12a)

min ∆ := mini,j=1,...,n

i 6=j

∆(i, j) (12b)

Metric ∆ is the average of ∆(i, j) values for a given al-location and it shows how successful the guilt detectionis on average for this allocation. For example, if ∆ = 0.4,then on average the probability Pr{Gi|Ri} for the actualguilty agent will be 0.4 higher than the probabilitiesof non-guilty agents. Note that this scalar version ofthe original problem objective is analogous to the sum-objective scalarization of the problem of Equation 8.Hence, we expect that an algorithm that is designed tominimize the sum-objective will maximize ∆.

Metric min ∆ is the minimum ∆(i, j) value and itcorresponds to the case where agent Ui has leaked hisdata and both Ui and another agent Uj have very similarguilt probabilities. If min ∆ is small, then we will beunable to identify Ui as the leaker, versus Uj . If min ∆is large, say 0.4, then no matter which agent leaks hisdata, the probability that he is guilty will be 0.4 higherthan any other non-guilty agent. This metric is analogousto the max-objective scalarization of the approximateoptimization problem.

The values for these metrics that are considered ac-ceptable will of course depend on the application. Inparticular, they depend on what might be consideredhigh confidence that an agent is guilty. For instance,say that Pr{Gi|Ri} = 0.9 is enough to arouse oursuspicion that agent Ui leaked data. Furthermore, say thedifference between Pr{Gi|Ri} and any other Pr{Gj |Ri}is at least 0.3. In other words, the guilty agent is (0.9−0.6)/0.6×100% = 50% more likely to be guilty comparedto the other agents. In this case we may be willing to takeaction against Ui (e.g., stop doing business with him, orprosecute him). In the rest of this section we will usevalue 0.3 as an example of what might be desired in ∆values.

To calculate the guilt probabilities and ∆ differenceswe use throughout this section p = 0.5. Although notreported here, we experimented with other p values andobserved that the relative performance of our algorithmsand our main conclusions do not change. If p approachesto 0, it becomes easier to find guilty agents and algo-rithm performance converges. On the other hand, if p

approaches 1, the relative differences among algorithmsgrow since more evidence is need to find an agent guilty.

8.2 Explicit Requests

In the first place, the goal of these experiments was tosee whether fake objects in the distributed data sets yieldsignificant improvement in our chances of detecting aguilty agent. In the second place, we wanted to evaluateour e-optimal algorithm relative to a random allocation.

We focus on scenarios with a few objects that areshared among multiple agents. These are the most inter-esting scenarios, since object sharing makes it difficultto distinguish a guilty from non-guilty agents. Scenarioswith more objects to distribute or scenarios with objectsshared among fewer agents are obviously easier to han-dle. As far as scenarios with many objects to distributeand many overlapping agent requests are concerned,they are similar to the scenarios we study, since we canmap them to the distribution of many small subsets.

In our scenarios we have a set of |T | = 10 objectsfor which there are requests by n = 10 different agents.We assume that each agent requests 8 particular objectsout of these 10. Hence, each object is shared on averageamong

∑ni=1 |Ri||T | = 8 agents. Such scenarios yield very

similar agent guilt probabilities and it is important toadd fake objects. We generated a random scenario thatyielded ∆ = 0.073 and min ∆ = 0.35 and we appliedthe algorithms e-random and e-optimal to distribute fakeobjects to the agents (See [14] for other randomly gener-ated scenarios with the same parameters). We varied thenumber B of distributed fake objects from 2 to 20 andfor each value of B we ran both algorithms to allocatethe fake objects to agents. We ran e-optimal once foreach value of B, since it is a deterministic algorithm.Algorithm e-random is randomized and we ran it 10times for each value of B. The results we present arethe average over the 10 runs.

Figure 3(a) shows how fake object allocation can affect∆. There are three curves in the plot. The solid curve isconstant and shows the ∆ value for an allocation withoutfake objects (totally defined by agents’ requests). Theother two curves look at algorithms e-optimal and e-random. The y-axis shows ∆ and the x-axis shows theratio of the number of distributed fake objects to the totalnumber of objects that the agents explicitly request.

We observe that distributing fake objects can signif-icantly improve on average the chances of detecting aguilty agent. Even the random allocation of approxi-mately 10% to 15% fake objects yields ∆ > 0.3. Theuse of e-optimal improves ∆ further, since the e-optimalcurve is consistently over the 95% confidence intervals ofe-random. The performance difference between the twoalgorithms would be greater if the agents did not requestthe same number of objects, since this symmetry allowsnon-smart fake object allocations to be more effectivethan in asymmetric scenarios. However, we do not study

Page 11: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 11

0 0.05 0.1 0.15 0.2 0.250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

B/∑n

i=1 |Ri − Fi|

Δ±

95%

confiden

ce

no fake e−random e−optimal

(a) Average ∆

0 0.05 0.1 0.15 0.2 0.25−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

B/∑n

i=1 |Ri − Fi|

min

Δ±

95%

confiden

ce

no fake e−random e−optimal

(b) Average min ∆

Fig. 3. Evaluation of Explicit Data Request Algorithms

more this issue here, since the advantages of e-optimalbecome obvious when we look at our second metric.

Figure 3(b) shows the value of min ∆, as a function ofthe fraction of fake objects. The plot shows that randomallocation will yield an insignificant improvement inour chances of detecting a guilty agent in the worstcase scenario. This was expected, since e-random doesnot take into consideration which agents “must” receivea fake object to differentiate their requests from otheragents. On the contrary, algorithm e-optimal can yieldmin ∆ > 0.3 with the allocation of approximately 10%fake objects. This improvement is very important takinginto account that without fake objects values min ∆ and∆ are close to 0. That means that by allocating 10% offake objects, the distributor can detect a guilty agenteven in the worst case leakage scenario, while withoutfake objects he will be unsuccessful not only in the worstbut also in average case.

Incidentally, the two jumps in the e-optimal curveare due to the symmetry of our scenario. Algorithm e-optimal allocates almost one fake object per agent beforeallocating a second fake object to one of them.

The presented experiments confirmed that fake ob-jects can have a significant impact on our chances ofdetecting a guilty agent. Note also that the algorithmevaluation was on the original objective. Hence, the su-perior performance of e-optimal (which is optimal for theapproximate objective) indicates that our approximationis effective.

8.3 Sample RequestsWith sample data requests agents are not interested inparticular objects. Hence, object sharing is not explicitlydefined by their requests. The distributor is “forced” toallocate certain objects to multiple agents only if thenumber of requested objects

∑ni=1mi exceeds the num-

ber of objects in set T . The more data objects the agentsrequest in total, the more recipients on average an objecthas; and the more objects are shared among differentagents, the more difficult it is to detect a guilty agent.Consequently, the parameter that primarily defines thedifficulty of a problem with sample data requests is

the ratio∑n

i=1 mi

|T | . We call this ratio the load. Note also,that the absolute values of m1, . . . ,mn and |T | play aless important role than the relative values mi/|T |. Say,for example, that T = 99 and algorithm X yields agood allocation for the agents’ requests m1 = 66 andm2 = m3 = 33. Note that for any |T | and m1/|T | = 2/3,m2/|T | = m3/|T | = 1/3 the problem is essentially similarand algorithm X would still yield a good allocation.

In our experimental scenarios, set T has 50 objectsand we vary the load. There are two ways to vary thisnumber: (a) assume that the number of agents is fixedand vary their sample sizes mi, (b) vary the number ofagents who request data. The latter choice captures howa real problem may evolve. The distributor may act toattract more or fewer agents for his data, but he does nothave control upon agents’ requests. Moreover, increasingthe number of agents allows us also to increase arbitrar-ily the value of the load, while varying agents’ requestsposes an upper bound n|T |.

Our first scenario includes two agents with requestsm1 and m2 that we chose uniformly at random from theinterval 6, . . . , 15. For this scenario we ran each of thealgorithms s-random (baseline), s-overlap, s-sum and s-max 10 different times, since they all include randomizedsteps. For each run of every algorithm we calculated∆ and min∆ and the average over the 10 runs. Thesecond scenario adds agent U3 with m3 ∼ U [6, 15] tothe two agents of the first scenario. We repeated the10 runs for each algorithm to allocate objects to threeagents of the second scenario and calculated the twometrics values for each run. We continued adding agentsand creating new scenarios to reach the number of 30different scenarios. The last one had 31 agents. Notethat we create a new scenario by adding an agent witha random request mi ∼ U [6, 15] instead of assumingmi = 10 for the new agent. We did that to avoid studyingscenarios with equal agent sample request sizes, wherecertain algorithms have particular properties, e.g., s-overlap optimizes the sum-objective if requests are allthe same size, but this does not hold in the general case.

In Figure 4(a) we plot the values ∆ that we foundin our scenarios. There are four curves, one for each

Page 12: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 12

0 1 2 3 4 5 60.4

0.5

0.6

0.7

0.8

0.9

1

Σni=1|Ri|/|T |

Δ±

95%

confiden

ce

s−randoms−overlaps−sums−max

(a) Average ∆

0 1 2 3 4 5 60.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Σni=1|Ri|/|T |

Pr{

Gi|R

i}

s−randoms−overlaps−sums−max

(b) Average Pr{Gi|Si}

0 1 2 3 4 5 60

0.2

0.4

0.6

0.8

1

Σni=1|Ri|/|T |

min

Δ±

95%

confiden

ce

s−randoms−overlaps−sums−max

(c) Average min ∆

Fig. 4. Evaluation of Sample Data Request Algorithms

algorithm. The x-coordinate of a curve point shows theratio of the total number of requested objects to thenumber of T objects for the scenario. The y-coordinateshows the average value of ∆ over all 10 runs. Thus, theerror bar around each point shows the 95% confidenceinterval of ∆ values in the ten different runs. Note thatalgorithms s-overlap, s-sum and s-max yield ∆ valuesthat are close to 1 if agents request in total fewer objectsthan |T |. This was expected since in such scenarios, allthree algorithms yield disjoint set allocations which isthe optimal solution. In all scenarios algorithm s-sumoutperforms the other ones. Algorithms s-overlap ands-max yield similar ∆ values that are between s-sumand s-random. All algorithms have ∆ around 0.5 forload = 4.5 which we believe is an acceptable value.

Note that in Figure 4(a), the performances of all al-gorithms appear to converge as the load increases. Thisis not true and we justify that using Figure 4(b) whichshows the average guilt probability in each scenario forthe actual guilty agent. Every curve point shows themean over all 10 algorithm runs and we have omittedconfidence intervals to make the plot easy to read.Note that the guilt probability for the random allocationremains significantly higher than the other algorithmsfor large values of the load. For example, if load ≈ 5.5algorithm s-random yields on average guilt probability0.8 for a guilty agent and 0.8 −∆ = 0.35 for non-guiltyagent. Their relative difference is 0.8−0.35

0.35 ≈ 1.3. The cor-responding probabilities that s-sum yields are 0.75 and0.25 with relative difference 0.75−0.25

0.25 = 2. Despite thefact that the absolute values of ∆ converge the relativedifferences in the guilt probabilities between a guilty andnon-guilty agents are significantly higher for s-max ands-sum compared to s-random. By comparing the curvesin both figures we conclude that s-sum outperformsother algorithms for small load values. As the number ofobjects that the agents request increases its performancebecomes comparable to s-max. In such cases both algo-rithm yield very good chances on average of detectinga guilty agents. Finally, algorithm s-overlap is inferior tothem, but it still yields a significant improvement withrespect to the baseline.

In Figure 4(c) we show the performance of all fouralgorithms with respect to min∆ metric. This figure is issimilar to Figure 4(a) and the only change is the y-axis.Algorithm s-sum now has the worst performance amongall algorithms. It allocates all highly shared objects toagents who request a large sample and, consequently,these agents receive the same object sets. Two agents Ui

and Uj who receive the same set have ∆(i, j) = ∆(j, i) =0. So, if either of Ui and Uj leaks his data we cannotdistinguish which of them is guilty. Random allocationhas also poor performance, since as the number of agentsincrease, the probability that at least two agents receivemany common objects becomes higher. Algorithm s-overlap limits the random allocation selection among theallocations who achieve the minimum absolute overlapsummation. This fact improves on average the min ∆values, since the smaller absolute overlap reduces objectsharing and, consequently, the chances that any twoagents receive sets with many common objects.

Algorithm s-max, which greedily allocates objects tooptimize max-objective outperforms all other algorithmsand is the only that yields min ∆ > 0.3 for high valuesof∑n

i=1mi. Observe that the algorithm that targets atsum-objective minimization proved to be the best forthe ∆ maximization and the algorithm that targets atmax-objective minimization was the best for min∆ max-imization. These facts confirm that the approximation ofobjective 7 with 8 is effective.

9 CONCLUSIONSIn a perfect world there would be no need to handover sensitive data to agents that may unknowingly ormaliciously leak it. And even if we had to hand over sen-sitive data, in a perfect world we could watermark eachobject so that we could trace its origins with absolutecertainty. However, in many cases we must indeed workwith agents that may not be 100% trusted, and we maynot be certain if a leaked object came from an agent orfrom some other source, since certain data cannot admitwatermarks.

In spite of these difficulties, we have shown it is pos-sible to assess the likelihood that an agent is responsible

Page 13: Data Leakage Detection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2010 13

for a leak, based on the overlap of his data with theleaked data and the data of other agents, and based onthe probability that objects can be “guessed” by othermeans. Our model is relatively simple, but we believeit captures the essential trade-offs. The algorithms wehave presented implement a variety of data distributionstrategies that can improve the distributor’s chances ofidentifying a leaker. We have shown that distributingobjects judiciously can make a significant difference inidentifying guilty agents, especially in cases where thereis large overlap in the data that agents must receive.

Our future work includes the investigation of agentguilt models that capture leakage scenarios that are notstudied in this paper. For example, what is the appro-priate model for cases where agents can collude andidentify fake tuples? A preliminary discussion of sucha model is available in [14]. Another open problem isthe extension of our allocation strategies so that they canhandle agent requests in an online fashion (the presentedstrategies assume that there is a fixed set of agents withrequests known in advance).

ACKNOWLEDGMENTS

This work was supported by NSF Grant CFF-0424422and an Onassis Foundation Scholarship. We would likethank Paul Heymann for his help with running thenon-polynomial guilt model detection algorithm that wepresent in the Appendix of [14] on a Hadoop cluster. Wealso thank Ioannis Antonellis for fruitful discussions andhis comments on earlier versions of this paper.

REFERENCES[1] R. Agrawal and J. Kiernan. Watermarking relational databases.

In VLDB ’02: Proceedings of the 28th international conference on VeryLarge Data Bases, pages 155–166. VLDB Endowment, 2002.

[2] P. Bonatti, S. D. C. di Vimercati, and P. Samarati. An algebra forcomposing access control policies. ACM Trans. Inf. Syst. Secur.,5(1):1–35, 2002.

[3] P. Buneman, S. Khanna, and W. C. Tan. Why and where: Acharacterization of data provenance. In J. V. den Bussche andV. Vianu, editors, Database Theory - ICDT 2001, 8th InternationalConference, London, UK, January 4-6, 2001, Proceedings, volume 1973of Lecture Notes in Computer Science, pages 316–330. Springer, 2001.

[4] P. Buneman and W.-C. Tan. Provenance in databases. In SIGMOD’07: Proceedings of the 2007 ACM SIGMOD international conferenceon Management of data, pages 1171–1173, New York, NY, USA,2007. ACM.

[5] Y. Cui and J. Widom. Lineage tracing for general data warehousetransformations. In The VLDB Journal, pages 471–480, 2001.

[6] S. Czerwinski, R. Fromm, and T. Hodes. Digital music distributionand audio watermarking.

[7] F. Guo, J. Wang, Z. Zhang, X. Ye, and D. Li. Information SecurityApplications, pages 138–149. Springer, Berlin / Heidelberg, 2006.An Improved Algorithm to Watermark Numeric Relational Data.

[8] F. Hartung and B. Girod. Watermarking of uncompressed andcompressed video. Signal Processing, 66(3):283–301, 1998.

[9] S. Jajodia, P. Samarati, M. L. Sapino, and V. S. Subrahmanian.Flexible support for multiple access control policies. ACM Trans.Database Syst., 26(2):214–260, 2001.

[10] Y. Li, V. Swarup, and S. Jajodia. Fingerprinting relationaldatabases: Schemes and specialties. IEEE Transactions on Depend-able and Secure Computing, 02(1):34–45, 2005.

[11] B. Mungamuru and H. Garcia-Molina. Privacy, preservation andperformance: The 3 p’s of distributed data management. Technicalreport, Stanford University, 2008.

[12] V. N. Murty. Counting the integer solutions of a linear equationwith unit coefficients. Mathematics Magazine, 54(2):79–81, 1981.

[13] S. U. Nabar, B. Marthi, K. Kenthapadi, N. Mishra, and R. Mot-wani. Towards robustness in query auditing. In VLDB ’06:Proceedings of the 32nd international conference on Very large databases, pages 151–162. VLDB Endowment, 2006.

[14] P. Papadimitriou and H. Garcia-Molina. Data leakage detection.Technical report, Stanford University, 2008.

[15] P. M. Pardalos and S. A. Vavasis. Quadratic programming withone negative eigenvalue is np-hard. Journal of Global Optimization,1(1):15–22, 1991.

[16] J. J. K. O. Ruanaidh, W. J. Dowling, and F. M. Boland. Watermark-ing digital images for copyright protection. I.E.E. Proceedings onVision, Signal and Image Processing, 143(4):250–256, 1996.

[17] R. Sion, M. Atallah, and S. Prabhakar. Rights protection forrelational data. In SIGMOD ’03: Proceedings of the 2003 ACMSIGMOD international conference on Management of data, pages 98–109, New York, NY, USA, 2003. ACM.

[18] L. Sweeney. Achieving k-anonymity privacy protection usinggeneralization and suppression, 2002.

Panagiotis Papadimitriou is a PhD studentin the Department of Electrical Engineering atStanford University, Stanford, California. He re-ceived a Diploma in electrical and computerengineering from National Techical University ofAthens in 2006 and a MS in electrical engi-neering from Stanford University in 2008. Hisresearch interests include internet advertising,data mining, data privacy and web search.

Hector Garcia-Molina is the Leonard Bosackand Sandra Lerner Professor in the Depart-ments of Computer Science and Electrical En-gineering at Stanford University, Stanford, Cal-ifornia. He was the chairman of the ComputerScience Department from January 2001 to De-cember 2004. From 1997 to 2001 he was amember the President’s Information Technol-ogy Advisory Committee (PITAC). From August1994 to December 1997 he was the Director ofthe Computer Systems Laboratory at Stanford.

From 1979 to 1991 he was on the faculty of the Computer ScienceDepartment at Princeton University, Princeton, New Jersey. His re-search interests include distributed computing systems, digital librariesand database systems. He received a BS in electrical engineeringfrom the Instituto Tecnologico de Monterrey, Mexico, in 1974. FromStanford University, Stanford, California, he received in 1975 a MS inelectrical engineering and a PhD in computer science in 1979. Heholds an honorary PhD from ETH Zurich (2007). Garcia-Molina is aFellow of the Association for Computing Machinery and of the AmericanAcademy of Arts and Sciences; is a member of the National Academyof Engineering; received the 1999 ACM SIGMOD Innovations Award; isa Venture Advisor for Onset Ventures, and is a member of the Board ofDirectors of Oracle.