Detecting Data Leakage Panagiotis Papadimitriou [email protected] Hector Garcia-Molina [email protected]
Dec 16, 2015
Leakage Problem
Stanford Infolab 2
App. U1 App. U2
Jeremy Sarah Mark
Other Sourcese.g. Sarah’s Network
Name: Mark
Sex: Male
….
Name: Sarah
Sex: Female….
Kathryn
Outline
• Problem Description
• Guilt Models– Pr{U1 leaked data} = 0.7
– Pr{U2 leaked data} = 0.2
• Distribution Strategies
Stanford Infolab 3
Problem Entities
Entity Dataset
Distributor Facebook
T Set of all Facebook profiles
AgentsFacebook Apps U1, …, Un
R1, …, Rn Ri: Set of people’s profiles who have
added the application Ui
Leaker S Set of leaked profiles
Stanford Infolab 5
Agents’ Data Requests
• Sample– 100 profiles of Stanford people
• Explicit– All people who added application
(example we used so far)
– All Stanford profiles
Stanford Infolab 6
Guilt Models (1/3)
Stanford Infolab 8
Other Sourcese.g. Sarah’s
Network
8
p
p: posterior probability that a leaked profile comes from other sources
p
Guilty Agent: Agent who leaks at least one profilePr{Gi|S}: probability that agent Ui is guilty, given the leaked set of profiles S
Guilt Models (2/3)
Stanford Infolab 99
or
or
Agents leak each of their data items independently
Agents leak all their data items OR nothing
or
(1-p)2
(1-p)p
p(1-p)
p2
The Distributor’s Objective (1/2)
Stanford Infolab 12
U1U1
U2U2
U3U3
U4U4
Request
Request
Request
Request
R1
Pr{G1|S}>>Pr{G2|S}
Pr{G1|S}>> Pr{G4|S}
S (leaked)
R1R1
R3R3
R2
R3
R4
The Distributor’s Objective (2/2)
• To achieve his objective the distributor has to distribute sets Ri, …, Rn that
minimize
• Intuition: Minimized data sharing among agents makes leaked data reveal the guilty agents
Stanford Infolab 13
njiRRRi ij
jii
,...,1,,1
Distribution Strategies – Sample (1/4)
• Set T has four profiles: – Kathryn, Jeremy, Sarah and Mark
• There are 4 agents: – U1, U2, U3 and U4
• Each agent requests a sample of any 2 profiles of T for a market survey
Stanford Infolab 14
Distribution Strategies – Sample (2/4)
Poor
ji
ji RR Minimize
Stanford Infolab 15
U1
U2
U3
U4
U1
U2
U3
U4
Distribution Strategies – Sample (3/4)
• Optimal Distribution
• Avoid full overlaps and minimize
Stanford Infolab 16
U1
U2
U3
U4
i ij
jii
RRR
1
Distribution Strategies
Sample Data Requests• The distributor has the
freedom to select the data items to provide the agents with
• General Idea:– Provide agents with as much
disjoint sets of data as possible
• Problem: There are cases where the distributed data must overlap E.g., |Ri|+…+|Rn|>|T|
Explicit Data Requests• The distributor must
provide agents with the data they request
• General Idea:– Add fake data to the
distributed ones to minimize overlap of distributed data
• Problem: Agents can collude and identify fake data
• NOT COVERED in this talk
Stanford Infolab 18
Conclusions
• Data Leakage
• Modeled as maximum likelihood problem
• Data distribution strategies that help identify the guilty agents
Stanford Infolab 19