Top Banner
Detecting Data Leakage Panagiotis Papadimitriou [email protected] Hector Garcia-Molina [email protected]
20

Detecting Data Leakage Panagiotis Papadimitriou [email protected] Hector Garcia-Molina [email protected].

Dec 16, 2015

Download

Documents

Rey Head
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Detecting Data Leakage

Panagiotis [email protected]

Hector [email protected]

Page 2: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Leakage Problem

Stanford Infolab 2

App. U1 App. U2

Jeremy Sarah Mark

Other Sourcese.g. Sarah’s Network

Name: Mark

Sex: Male

….

Name: Sarah

Sex: Female….

Kathryn

Page 3: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Outline

• Problem Description

• Guilt Models– Pr{U1 leaked data} = 0.7

– Pr{U2 leaked data} = 0.2

• Distribution Strategies

Stanford Infolab 3

Page 4: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

• Problem Description

• Guilt Models

• Distribution Strategies

Stanford Infolab 4

Page 5: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Problem Entities

Entity Dataset

Distributor Facebook

T Set of all Facebook profiles

AgentsFacebook Apps U1, …, Un

R1, …, Rn Ri: Set of people’s profiles who have

added the application Ui

Leaker S Set of leaked profiles

Stanford Infolab 5

Page 6: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Agents’ Data Requests

• Sample– 100 profiles of Stanford people

• Explicit– All people who added application

(example we used so far)

– All Stanford profiles

Stanford Infolab 6

Page 7: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

• Problem Description

• Guilt Models

• Distribution Strategies

Stanford Infolab 7

Page 8: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Guilt Models (1/3)

Stanford Infolab 8

Other Sourcese.g. Sarah’s

Network

8

p

p: posterior probability that a leaked profile comes from other sources

p

Guilty Agent: Agent who leaks at least one profilePr{Gi|S}: probability that agent Ui is guilty, given the leaked set of profiles S

Page 9: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Guilt Models (2/3)

Stanford Infolab 99

or

or

Agents leak each of their data items independently

Agents leak all their data items OR nothing

or

(1-p)2

(1-p)p

p(1-p)

p2

Page 10: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Guilt Models (3/3)

Independently NOT Independently

Stanford Infolab 10

Pr{G1}

Pr{G2} Pr{G2}

Pr{G1}

Page 11: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

• Problem Description

• Guilt Models

• Distribution Strategies

Stanford Infolab 11

Page 12: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

The Distributor’s Objective (1/2)

Stanford Infolab 12

U1U1

U2U2

U3U3

U4U4

Request

Request

Request

Request

R1

Pr{G1|S}>>Pr{G2|S}

Pr{G1|S}>> Pr{G4|S}

S (leaked)

R1R1

R3R3

R2

R3

R4

Page 13: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

The Distributor’s Objective (2/2)

• To achieve his objective the distributor has to distribute sets Ri, …, Rn that

minimize

• Intuition: Minimized data sharing among agents makes leaked data reveal the guilty agents

Stanford Infolab 13

njiRRRi ij

jii

,...,1,,1

Page 14: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Distribution Strategies – Sample (1/4)

• Set T has four profiles: – Kathryn, Jeremy, Sarah and Mark

• There are 4 agents: – U1, U2, U3 and U4

• Each agent requests a sample of any 2 profiles of T for a market survey

Stanford Infolab 14

Page 15: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Distribution Strategies – Sample (2/4)

Poor

ji

ji RR Minimize

Stanford Infolab 15

U1

U2

U3

U4

U1

U2

U3

U4

Page 16: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Distribution Strategies – Sample (3/4)

• Optimal Distribution

• Avoid full overlaps and minimize

Stanford Infolab 16

U1

U2

U3

U4

i ij

jii

RRR

1

Page 17: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Distribution Strategies – Sample (4/4)

Stanford Infolab 17

Page 18: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Distribution Strategies

Sample Data Requests• The distributor has the

freedom to select the data items to provide the agents with

• General Idea:– Provide agents with as much

disjoint sets of data as possible

• Problem: There are cases where the distributed data must overlap E.g., |Ri|+…+|Rn|>|T|

Explicit Data Requests• The distributor must

provide agents with the data they request

• General Idea:– Add fake data to the

distributed ones to minimize overlap of distributed data

• Problem: Agents can collude and identify fake data

• NOT COVERED in this talk

Stanford Infolab 18

Page 19: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Conclusions

• Data Leakage

• Modeled as maximum likelihood problem

• Data distribution strategies that help identify the guilty agents

Stanford Infolab 19

Page 20: Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu.

Thank You!