Privacy through Accountability Anupam Datta Associate Professor CSD, ECE, CyLab Carnegie Mellon University
Privacy through Accountability
Anupam Datta
Associate Professor
CSD, ECE, CyLab
Carnegie Mellon University
2
Personal Information is Everywhere
Research Challenge
Ensure organizations respect privacy expectations,
regulations, and organizational policies in the collection,
use, and disclosure of personal information
3
Programs and People
Web Advertising
Example privacy policies:
Not use detailed location (full IP address) for advertising
Not use health information for advertising
4
5
Privacy through Accountability:
An Emerging Research Area
Privacy as a right to restrictions on
personal information flow
Computational accountability mechanisms
for enforcement
http://www.andrew.cmu.edu/user/danupam/privacy.html
Today: Focus on Web Privacy
1. Bootstrapping Privacy Compliance in Big Data Systems
Methodology
Tool and application to Bing’s advertising system
Focus on current policies
2. Information Flow Experiments
Methodology
Tool and application to Google’s advertising system
Focus on principles that go beyond current policies
6
7
Bootstrapping Privacy Compliance in Big
Data Systems
With S. Sen (CMU) and
S. Guha, S. Rajamani, J. Tsai, J. M. Wing (MSR)
2014 IEEE Symposium on Security & Privacy
(Best Student Paper Award)
Privacy Compliance for Bing
Setting:
Auditor has access to source code
8
The Privacy Compliance Challenge
9
Specification
Verification
Scale Compliance?
A Streamlined Audit Workflow
10
Encode Refine
Code analysis
Checker
Annotated
Code
Legalease
Policy
Potential violations
Fix code
Update Grok Developer annotations
A Streamlined Audit Workflow
Encode Refine
Code analysis, developer annotations
Checker
Annotated
Code
Legalease
Policy
Potential violations
Fix code
Update Grok
Workflow for privacy compliance
Legalease, usable yet formal policy specification language
Grok, bootstrapped data inventory for big data systems
Scalable implementation for Bing
11
A Streamlined Audit Workflow
Encode Refine
Code analysis, developer annotations
Checker
Annotated
Code
Legalease
Policy
Potential violations
Fix code
Update Grok
12
Specification: Legalease
Usable by
lawyers
and
privacy
champs.
Expressive
enough for
real-world
policies.
Precise
semantics
for local
reasoning.
Usable.
Expressive.
Precise.
13
Legalease: Example Policy
DENY Datatype IPAddress
UseForPurpose Advertising
EXCEPT
ALLOW
Datatype IPAddress:Truncated
ALLOW
UseForPurpose AbuseDetect
EXCEPT
DENY Datatype
IPAddress, AccountInfo
14
We will not use full IP Address for Advertising. IP Address may be used for detecting abuse. In such cases, it will not be combined with account information.
Legalease: Example Policy
15
We will not use full IP Address for Advertising. IP Address may be used for detecting abuse. In such cases, it will not be combined with account information.
DENY Datatype IPAddress
UseForPurpose Advertising
EXCEPT
ALLOW
Datatype IPAddress:Truncated
ALLOW
UseForPurpose AbuseDetect
EXCEPT
DENY Datatype
IPAddress, AccountInfo
DENY Datatype IPAddress
UseForPurpose Advertising
EXCEPT
ALLOW
Datatype IPAddress:Truncated
ALLOW
UseForPurpose AbuseDetect
EXCEPT
DENY Datatype
IPAddress, AccountInfo
We will not use full IP Address for Advertising. IP Address may be used for detecting abuse. In such cases, it will not be combined with account information.
We will not use full IP Address for Advertising. IP Address may be used for detecting abuse. In such cases, it will not be combined with account information.
Legalease : Policy Checking
16
Program
A Lattice of Policy Labels
17
…
IPAddress
• If “IPAddress” use is allowed then so is everything below it
• If “IPAddress:Truncated” use is denied then so is everything above it
T
…
…
IPAddress: Truncated
18
Designed for Precision
Designed for Expressivity (Bing, October 2013)
Designed for Expressivity (Google, October 2013)
20
DENY Datatype IPAddress
UseForPurpose Advertising
EXCEPT
ALLOW
Datatype IPAddress:Truncated
ALLOW
UseForPurpose AbuseDetect
EXCEPT
DENY Datatype
IPAddress, AccountInfo
Designed for Usability
Exceptions How legal texts are structured
One-to one correspondence
Local Reasoning Each exception refines its immediate parent
Formally proven property
Independent of Code
21
H. DeYoung, D. Garg, L. Jia, D. Kaynar, and A. Datta,
“Experiences in the logical specification of the HIPAA and GLBA
privacy laws”
Legalease Usability
Survey taken by 12 policy authors within Microsoft
Encode Bing data usage policy after a brief tutorial
Time spent 2.4 mins on the tutorial
14.3 mins on encoding policy
High overall correctness
22
A Streamlined Audit Workflow
Checker
Encode Refine
Code analysis
Annotated
Code
Legalease
Policy
Potential violations
Fix code
Update Grok Developer annotations
23
A Streamlined Audit Workflow
Encode Refine
Code analysis, developer annotations
Checker
Annotated
Code
Legalease
Policy
Potential violations
Fix code
Update Grok
24
Scope, Hive, Dremel
Data in the form of Tables
Code Transforms Columns to Columns
No Shared State
Limited Hidden Flows
Process 1
Dataset A Dataset B
Dataset
C
Map-Reduce Programming Systems
25
Verification
Nightly
audit of
all jobs
executed.
Static
source
code
analysis.
What
data,
stored
where?
Who
used.
26
Process 1
Dataset A Dataset B
Dataset
C
Dataset F Dataset E
Process 2
Process 3
Dataset
D
Process 5
Dataset J
Process 6
Process 4
Dataset
H Dataset I
Dataset
G
Grok
27
Process 1
Dataset A Dataset B
Dataset
C
Dataset F Dataset E
Process 2
Process 3
Dataset
D
Process 5
Dataset J
Process 6
Process 4
Dataset
H Dataset I
Dataset
G
NewAcct
Login
Check
Hijack
GeoIP
Check
Fraud
Reporting
Grok
Purpose Labels
Annotate programs with purpose labels
28
Initial Data Labels
Heuristics and Annotations
29
Process 1
Dataset A Dataset B
Dataset
C
Dataset F Dataset E
Process 2
Process 3
Dataset
D
Process 5
Dataset J
Process 6
Process 4
Dataset
H Dataset I
Dataset
G
NewAcct
Login
Check
Hijack
GeoIP
Check
Fraud
Reporting
Name Age IPAddress IDX
?? Country
Timestamp Hash
IDX
??
Grok
Purpose Labels
Annotate programs with purpose labels
29
Flow Labels
Source labels propagated via data flow graph
30
Process 1
Dataset A Dataset B
Dataset
C
Dataset F Dataset E
Process 2
Process 3
Dataset
D
Process 5
Dataset J
Process 6
Process 4
Dataset
H Dataset I
Dataset
G
NewAcct
Login
Check
Hijack
GeoIP
Check
Fraud
Reporting
Name Age IPAddress IDX
Profile Country
Timestamp Hash
IDX
IDX
D. E. Denning. “A lattice model of secure information flow”
Grok
Purpose Labels
Annotate programs with purpose labels
Initial Data Labels
Heuristics and Annotations
30
Nightly
Compliance
Process
Generate
report
Static
code
analysis
Manual
Audit
Proce
ss 1
Datas
et A
Datas
et B
Datas
et C
Datas
et F
Datas
et E
Proce
ss 2
Proce
ss 3
Datas
et D
Proce
ss 5
Datas
et J
Proce
ss 6
Proce
ss 4
Datas
et H Datas
et I
Datas
et G
…
…
…
…
…
…
FIMLa
st
Name
LiveId
Age
ss_us
er_ip
M_A
NID
MCM
UID
LocId
s
csts msMUI
D2
msnA
NID
User
Anid
DB
Read
Datase
t D
Read
Datase
t G
Transfor
m Data
Write
Dataset
H, I
Positive
Patterns (40 Taxonomy values, 400
patterns)
Negative
Patterns (2500 total entries)
Granular Overrides (116 total entries)
-- DENY DataType UniqueIdentifier WITH PII InStore BingStore SELECT * FROM (SELECT * FROM Report WHERE Taxonomy='ANID' AND Confidence>='High') AS ID INNER JOIN (SELECT * FROM Report WHERE TaxonomyGroup='PII' AND Confidence>='High') AS P ON ID.VC = P.VC
files
25M+ schemas
2M+
privacy
elements*
300K+
audit
candidates
10K+
teams
8
audit
items
1K+ 31
Why Bootstrapping Grok Works
Pick the nodes which will
label the most of the
graph
~200 annotations label 60% of nodes
A small number of annotations
is enough to get off the ground.
33
Scale
77,000 jobs run each day By 7000 entities
300 functional groups
1.1 million unique lines of code 21% changes on avg, daily
46 million table schemas
32 million files
Manual audit infeasible
Information flow analysis takes ~30 mins daily
34
A Streamlined Audit Workflow
Checker
Encode Refine
Code analysis
Annotated
Code
Legalease
Policy
Potential violations
Fix code
Update Grok Developer annotations
35
A Streamlined Audit Workflow
Encode Refine
Code analysis, developer annotations
Checker
Annotated
Code
Legalease
Policy
Potential violations
Fix code
Update Grok
36
Today: Focus on Web Privacy
1. Bootstrapping Privacy Compliance in Big Data Systems
Methodology
Tool and application to Bing’s advertising system
Focus on current policies
2. Information Flow Experiments
Methodology
Tool and application to Google’s advertising system
Focus on principles that go beyond current policies
37
38
Information Flow Experiments Methodology
With Michael Carl Tschantz (CMU UC Berkeley)
Amit Datta (CMU)
Jeannette M. Wing (CMU Microsoft Research)
39
User Ads
Browsing history Other users
Advertisers
Websites
Confounding
inputs
Personalized Web Advertising
?
Probabilistic Interference
Control Group
Experimental Design
Scientist
40
Experimental Group
Drug
Placebo
Group 2
Information Flow Experiment (IFE)
41
Group 1 Rehab ads
Substance abuse websites
Generic ads
Idle
IFE Methodology
42
Control
treatment
Experimenter
Experimental
treatment
Random
permutation
Measurements
p-value Significance testing
The
Internet
Information Flow Experiments as Science
Experimental Science Information Flow
Natural process System in question
Population of units Subset of interactions
… …
Causation Information flow
43
Theorem
Pearl’s Causation = Probabilistic Interference
44
Information Flow Experiments
on Personalized Ad Settings: A Tale of Opacity, Choice and Discrimination
With Amit Datta (CMU) and
Michael Carl Tschantz (UC Berkeley)
Google Ad Settings
45
Goals
Study transparency, choice, fairness
Methodology and tool (AdFisher)
Automation, statistical rigor, scalability, explanations
46
Browsing
Behavior
Ads
Received
Ad
Settings
Internal
State
Experiment 1: Opacity
Experimental group visits top 100 substance abuse sites
Control group idles
Then both groups visit Times of India and collects ads
47
Browsing
Behavior
Ads
Received
Ad
Settings
Internal
State
Experiment 1: Significant Opacity
Substance abuse: significant effect on ads, no effect on ad
settings
Disability: significant effect on ads, “unrelated” effect on ad
settings
48
Treatment p-value
Substance abuse 0.0000053
Disability 0.0000053
Mental disorder 0.053
Infertility 0.11
Adult websites 0.42
Statistical
significance
Experiment 1: Opacity Explanation
Top ads for group visiting substance abuse webpages
The Watershed Rehab www.thewatershed.com/Help
Watershed Rehab www.thewatershed.com/Rehab
The Watershed Rehab Ads by Google
Veteran Home Loans www.vamortgagecenter.com
CAD Paper Rolls paper-roll.net/Cad-Paper
Top ads for control group
Alluria Alert www.bestbeautybrand.com
Best Dividend Stocks dividends.wyattresearch.com
10 Stocks to Hold Forever www.streetauthority.com
Delivery Drivers Wanted get.lyft.com/drive
VA Home Loans Start Here www.vamortgagecenter.com
49
Experiment 2: Choice
Experimental group visits top 100 dating sites; then removes
dating interest from ad settings
Control group visits top 100 dating sites; then keeps dating
interest
Then both groups visit Times of India and collects ads
50
Browsing
Behavior
Ads
Received
Ad
Settings
Internal
State
Experiment 2:
Choice Buttons have an Effect
Treatment p-value
Opting out 0.0000053
Dating 0.0000053
Weight loss 0.041
51
Statistical
significance
Experiment 2: Choice Explanation
Top ads for group keeping dating interest
Are You Single? www.zoosk.com/Dating
Top 5 Online Dating Sites www.consumer-rankings.com/Dating
Why can't I find a date? www.gk2gk.com
Latest Breaking News www.onlineinsider.com
Gorgeous Russian Ladies anastasiadate.com
52
Top ads for group removing dating interest
Car Loans w/ Bad Credit www.car.com/Bad-Credit-Car-Loan
Individual Health Plans www.individualhealthquotes.com
Crazy New Obama Tax www.endofamerica.com
Atrial Fibrillation Guide www.johnshopkinshealthalerts.com
Free $5 - $25 Gift Cards swagbucks.com
Experiment 3: Discrimination
Experimental group visits top 100 job sites with gender set to
male in ad settings
Control group visits top 100 job sites with gender set to
female in ad settings
Then both groups visit Times of India and collects ads
53
Browsing
Behavior
Ads
Received
Ad
Settings
Internal
State
Experiment 3:
Discrimination Explanation
Top ads for female group
Jobs (Hiring Now) www.jobsinyourarea.co
4Runner Parts Service www.westernpatoyotaservice.com
Criminal Justice Program www3.mc3.edu/Criminal+Justice
Goodwill - Hiring goodwill.careerboutique.com
UMUC Cyber Training www.umuc.edu/cybersecuritytraining
54
Top ads for male group
$200k+ Jobs - Execs Only careerchange.com
Find Next $200k+ Job careerchange.com
Become a Youth Counselor www.youthcounseling.degreeleap.com
CDL-A OTR Trucking Jobs www.tadrivers.com/OTRJobs
Free Resume Templates resume-templates.resume-now.com
55
Information Flow Experiments More on methodology
With Michael Carl Tschantz (CMU UC Berkeley)
Amit Datta (CMU)
Jeannette M. Wing (CMU Microsoft Research)
Google Exhibits Complex Behavior
0
5
10
15
20
25
30
35
40
45
0 50 100 150 200
Ad
id
Reload number
56
56
Browser Instances are Not Independent
57
17
13 13 13 12
11 10 10
8 7
Which Statistical Test to Use?
Our Idea:
Use a non-parametric test
Does not require model of Google
Specifically, a permutation test
Does not require independence among browser instances or
assumption that ads are independent and identically distributed
58
Permutation Test over Keywords
59
0
5 6
30 30
0
19 22
31
2
1 2 3 4 5 6 7 8 9 10
Permutation Test over Keywords
60
0 0 2
5 6
19 22
30 30 31
1 6 10 2 3 7 8 4 5 9
Permutation Test over Keywords
61
13
132
1,6,10,2,3 7,8,4,5,9
119
Permutation Test over Keywords
62
44
101
9,6,10,2,3 7,8,4,5,1
67
Permutation Test over Keywords
63
-57
119
67
7
Conclusion
A rigorous methodology for information flow
experiments
1. Probabilistic interference = Pearl’s causation
2. Experimental design for causal determination
3. Significance testing with non-parametric statistics
An experimental study of Google Ads
1. AdFisher Tool
2. Findings of opacity, choice and discrimination
64
Prior Work on Behavioral Marketing
Authors Test Limitation
Guha et al. Cosine similarity No statistical significance
Balebako et al. Cosine similarity No statistical significance
Wills and Tatar Ad hoc examination No statistical significance
Liu et al. Process of elimination No statistical significance
Barford et al. χ2 test Assumes ads identically distributed
Lécuyer et al. Parametric Model Correlation, not causation; assumes
ads are independent
65
Privacy as Restrictions on Personal
Information Flow
66
Restrictions
Info
rmatio
n F
low
Direct
Interference
Probabilistic
Interference
Temporal Purpose & Role based
EPAL
XACML
*-access control
Purpose Planning
FOTLs
[Formal Contextual Integrity,
Reduce audit algorithm,
Basin et al.]
Grok +
Legalease Jif,
FlowCaml,…
[Hayati &
Abadi]
Information Flow
Experiments
Differential
Privacy
Web Privacy
Healthcare
Privacy
Summary
1. Information Flow Experiments
Methodology
Tool and application to Google’s advertising system with
findings of opacity, choice and discrimination
2. Privacy Compliance in Big Data Systems
Methodology
Tool and application to Bing’s compliance workflow, privacy
policies and advertising programs on production system
67
68
Privacy through Accountability:
An Emerging Research Area
Privacy as a right to restrictions on
personal information flow
Computational accountability mechanisms
for enforcement
http://www.andrew.cmu.edu/user/danupam/privacy.html