Problem Statement Privacy: is differentially private Calibration: can obtain or approximate p , ’ ) Utility: p , ’ ) is near to p , ’ , ) Utility ‣ Maximum Mean Discrepancy to non-private posterior as measure of utility ‣ Naïve method only matches noise-aware method at easier settings Sufficient Statistics Update ‣ Sufficient statistics =∑ /01 2 ( / ) are a sum over individuals ‣ Approximate with CLT as ~ : , Σ < ‣ : and Σ < derived in terms of first four moments of Run time ‣ MCMC-Ind scales with population size ‣ Other methods stay constant Differentially Private Bayesian Linear Regression Garrett Bernstein 1 , Daniel Sheldon 1,2 1 University of Massachusetts Amherst; 2 Mount Holyoke College Results Background ϵ-Differential Privacy ‣ Pr ≤ exp Pr F ‣ Probability of any output is nearly unchanged by a single individual opting into the data Bayesian Linear Regression ‣ Normal-inverse-gamma is conjugate prior ‣ Leads to closed form posterior , ’ , ) ‣ Sufficient statistics: =∑ /01 2 ( / , / )= I , I , I Laplace Mechanism ‣ Release noisy sufficient statistics: = + Laplace(Δ/) ‣ Δ = sensitivity: Impact on a function’s output due to removal/addition of an individual’s data Supported by National Science Foundation Grant Nos. 1522054 and 1617533 www.GarrettBernstein.com https://github.com/gbernstein6/private_bayesian_regression ‣ Given individuals’ sensitive covariate data and response data ∼ Normal( I , ’ ), and prior , ’ … ‣ …privately release = (, )… ‣ …then calculate the posterior p , ’ ) via with… Private Bayesian Inference Challenges + Solutions ‣ Only noisy sufficient statistics observed Use Gibbs sampler ‣ Integration over all possible individuals is intractable CLT approximation ‣ Not easy to sample from product of normal and Laplace Scale mixture of normal . ‣ Need to make assumptions about data Only need first four moments. Sufficient Statistics-based Gibbs sampler Want: p , ’ ) Private Model Individual-based MCMC – randomized algorithm , F — neighboring data sets ‣ Set explicit data prior Calibration ‣ Given true parameters , ’ , let be CDF of true posterior , ’ ) ‣ Using fact (, ’ ) is uniform, test correctness of approximate posterior by testing uniformity of its CDF, W (, ’ ), via Kolmogorov-Smirnov test ‣ Noise-aware methods are nearly as well-calibrated as non-private method Parameter Update ‣ Conjugate update uses latent , ’ ∼ p(, ’ ; F ); F = Conjugate-Update(, , ) Noise Update ‣ Represent Laplace noise model as scale mixture of normals for easier sampling with CLT normal 1/! 2 j ⇠ InverseGaussian ⇣ ✏ Δ s |z-s| , ✏ 2 Δ 2 s ⌘ 10 1 10 2 10 3 n 0.00 0.35 0.70 .S stat. θ 0 10 1 10 2 10 3 n θ bias 10 1 10 2 10 3 n σ 2 (F) Empirical CDFs ‣ Empirical CDF plots show noise-aware methods are well-calibrated, but the naïve method is over-confident 0 1 rDnk index 0 1 true CD) vDlue θ 0 0 1 rDnk index θ bias 0 1 rDnk index σ 2 (e) 10 1 10 2 n 0 200 400 600 seconds 10 1 10 2 10 3 n 0 1 00D θ 0 10 1 10 2 10 3 n θ bias 10 1 10 2 10 3 n σ 2 parameters covariate data sufficient statistics noisy sufficient statistics I , ’ I I response data ✓ , σ 2 ⇠ p(✓ , σ 2 ) x 1:n iid ⇠ ??? y 1:n ⇠ N (✓ T x i , σ 2 ) s = X n i=1 t(x i ,y i ) z ⇠ s + Laplace(Δ/✏) need to make assumptions! ‣ Implement in PyMC3 ‣ Instantiate individuals First four moments of ‣ Gibbs-SS-Noisy: Release moments from sample data ‣ Gibbs-SS-Prior: Sample population via data prior, then calculate moments ‣ Gibbs-SS-Update: Hierarchical normal prior with updated parameters CLT approximation scale mixture of normals ✓ , σ 2 ⇠ p(✓ , σ 2 ) s ⇠ N (nμ t ,n⌃ t ) ! 2 ⇠ Exp(✏ 2 /2Δ 2 ) z ⇠ N (s, ! 2 )
1
Embed
Differentially Private Bayesian Linear Regression Garrett ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Problem Statement
Privacy: 𝒜 is differentially privateCalibration: 𝒫 can obtain or approximate p 𝜽, 𝜎' 𝒛)Utility: p 𝜽, 𝜎' 𝒛) is near to p 𝜽, 𝜎' 𝑋, 𝒚)
Utility‣ Maximum Mean Discrepancy to non-private posterior as measure of utility
‣ Naïve method only matches noise-aware method at easier settingsSufficient Statistics Update‣ Sufficient statistics 𝒔 = ∑/012 𝑡(𝑥/) are a sum
over individuals‣ Approximate with CLT as 𝒔 ~𝒩 𝑛𝝁:, 𝑛Σ<‣ 𝝁: and Σ< derived in terms of first four
moments of 𝒙
Run time‣MCMC-Ind scales with population size
‣Other methods stay constant
Differentially Private Bayesian Linear Regression Garrett Bernstein1, Daniel Sheldon1,21University of Massachusetts Amherst; 2 Mount Holyoke College
Results
Background
ϵ-Differential Privacy‣ Pr 𝒜 𝑿 ≤ exp 𝜖 Pr 𝒜 𝑿F
‣ Probability of any output is nearly unchanged by a single individual opting into the data
Bayesian Linear Regression‣ Normal-inverse-gamma is conjugate prior
Laplace Mechanism‣ Release noisy sufficient statistics: 𝒛 = 𝒔 + Laplace(Δ/𝜖)‣ Δ = sensitivity: Impact on a function’s output due to
removal/addition of an individual’s data
Supported by National Science Foundation Grant Nos. 1522054 and 1617533www.GarrettBernstein.com https://github.com/gbernstein6/private_bayesian_regression
‣ Given individuals’ sensitive covariate data 𝑋 and response data 𝒚 ∼ Normal(𝜽I𝑋, 𝜎'), and prior 𝑝 𝜽, 𝜎' …
‣ …privately release 𝐳 = 𝒜(𝑋, 𝒚)…‣ …then calculate the posterior p 𝜽, 𝜎' 𝒛) via 𝒫 with…
Private Bayesian Inference
Challenges+
Solutions
‣ Only noisy sufficient statistics observed Use Gibbs sampler‣ Integration over all possible individuals is intractable CLT approximation ‣ Not easy to sample from product of normal and Laplace Scale mixture of normal .‣ Need to make assumptions about data Only need first four moments.
Sufficient Statistics-based Gibbs sampler
Want: p 𝜽, 𝜎' 𝒛)Private Model
Individual-based MCMC
𝒜 – randomized algorithm𝑿,𝑿F — neighboring data sets
‣ Set explicit data prior
Calibration‣ Given true parameters 𝜽, 𝜎', let 𝐹𝒛 be CDF of true posterior 𝑝 𝜽, 𝜎' 𝒛)‣ Using fact 𝐹𝒛(𝜽, 𝜎') is uniform, test correctness of approximate posterior by
testing uniformity of its CDF, W𝐹𝒛(𝜽, 𝜎'), via Kolmogorov-Smirnov test ‣ Noise-aware methods are nearly as well-calibrated as non-private method
Noise Update‣ Represent Laplace noise model as scale mixture of
normals for easier sampling with CLT normal
Algorithm 1 Gibbs Sampler
1: Initialize ✓,�2,!
2
2: repeat
3: Calculate µt and ⌃t via Eqs. (2) and (3)4: s ⇠ NormProduct
�nµt, n⌃t, z, diag(!2)
�
5: ✓,�2 ⇠ NIG(✓,�2;µn,⇤n, an, bn) via Eqn. (1)
6: 1/!2j ⇠ InverseGaussian
⇣✏
�s|z�s| ,✏2
�2s
⌘for all j
Subroutine NormProduct1: input: µ1,⌃1,µ2,⌃2
2: ⌃3 =�⌃�1
1 + ⌃�12
��1
3: µ3 = ⌃3
�⌃�1
1 µ1 + ⌃�12 µ2
�
4: return: N (µ3,⌃3)
⌘ijkl, and ⇠ij,kl. The current parameter values are available within the sampler, but the mod-172
eler must provide estimates for the moments of x, either using prior knowledge or by (privately)173
estimating the moments from the data. We discuss three specific possibilities in Section 3.4.4.174
E [xiy] =X
j
✓j⌘ij
E⇥y2⇤ = �2 +
X
i,j
✓i✓j⌘ij
Cov (xixj , xky) =X
l
✓l⇠ij,kl
Cov�xixj , y
2� =X
k,l
✓k✓l⇠ij,kl
Cov (xiy, xjy) = �2⌘ij +X
k,l
✓k✓l⇠ij,kl
Cov�xiy, y
2� =X
j,k,l
✓j✓k✓l⇠ij,kl + 2�2X
j
✓j⌘ij
Var�y2� = 2�4 +
X
i,j,k,l
✓i✓j✓k✓l⇠ij,kl
+ 4�2X
i,j
✓i✓j⌘ij
Once again, more modeling assumptions are175
needed than in the non-private case, where it176
is possible to condition on x. Gibbs-SS re-177
quires milder assumptions (second and fourth178
moments), however, than MCMC-Ind (a full179
prior distribution).180
3.4.2 Variable augmentation for p(z | s)181
The above approximation for the distribution182
over sufficient statistics means the full condi-183
tional distribution involves the product of a184
normal and a Laplace distribution,185
p(s | ✓, z) / N (s;nµt, n⌃t)
· Lap(z; s,�s/✏).
It is unclear how to sample from this distri-186
bution directly. A similar situation arises in187
the Bayesian Lasso, where it is solved by188
variable augmentation [25]. Bernstein and189
Sheldon [5] adapted the variable augmenta-190
tion scheme to private inference in exponential family models. We take the same approach here,191
and represent a Laplace random variable as a scale mixture of normals. Specifically, l ⇠ Lap(u, b)192
is identically distributed to l ⇠ N (u,!2) where the variance !2 ⇠ Exp
�1/(2b2)
�is drawn from193
the exponential distribution (with density 1/(2b2) exp��!
2/(2b2)
�). We augment separately for194
each component of the vector z so that z ⇠ N�s, diag(!2)
�, where !
2j ⇠ Exp
�✏2/(2�2
s)�. The195
augmented full conditional p(s | ✓, z,!) is a product of two multivariate normal distributions, which196
is itself a multivariate normal distribution.197
3.4.3 The Gibbs sampler198
✓,�2 ⇠ NIG(✓,�2;µ0,⇤0, a0, b0)
s ⇠ N (nµt, n⌃t)
!2j ⇠ Exp
✏2
2�2s
!for all j
z ⇠ N�s, diag(!2)
�
The full generative process is shown to the199
right, and the corresponding Gibbs sampler200
is shown in Algorithm 1. The update for201
!2 follows Park and Casella [25]; the inverse202
Gaussian density is InverseGaussian(w;m, v) =203 pv/(2⇡w3) exp
��v(w �m)2/(2m2
w)�. Note204
that the resulting s drawn from p(s | µt,⌃t,!2)205
may require projection onto the space of valid sufficient statistics. This can be done by observing that206
if A = [X,y] then the sufficient statistics are contained in the positive-semidefinite (PSD) matrix207
B = ATA. For a randomly drawn s, we project if necessary so the corresponding B matrix is PSD.208
5
101 102 103
n
0.00
0.35
0.70
.S s
tat.
θ0
101 102 103
n
θbias
101 102 103
n
(a)σ2
.S s
tat.
(F)
Figure 3: Synthetic data results: (a) calibration vs. n for ✏ = 0.1; (b) calibration vs. ✏ for n = 10;(c) QQ plot for n = 10 and ✏ = 0.1; (d) 95% credible interval coverage; (e) MMD of methods tonon-private posterior; (f) method runtimes for ✏ = 0.1.
to approximations in the calculation of multivariate normal distribution fourth moments from a data295
prior. Utility results are shown in Figure 3e; the noise-aware methods provide at least as good utility296
as Naive. Run time results are shown in Figure 3f; MCMC-Ind scales with increasing population size297
while the Gibbs-SS methods, Naive, and Non-Private remain constant. Accordingly, we do not298
include results for MCMC-Ind for n = 1000 as its run time is prohibitive in those settings.299
4.3 Predictive posteriors on real data300
Figure 4: Coverage for predictive posterior 50%and 90% credible intervals.
We evaluate the predictive posteriors of the301
methods on a real world data set measuring302
the effect of drinking rate on cirrhosis rate.4303
We scale both covariate and response data to304
[0, 1]. There are 46 total points, which we305
randomly split into 36 training examples and306
10 test points for each trial. After prelimi-307
nary exploration to gain domain knowledge,308
we set a reasonable model prior of ✓,�2 ⇠309
NIG�[1, 0], diag([.25, .25]), 20, .5
�. We draw310
samples ✓(k),�
2k from the posterior given train-311
ing data, and then form the posterior predictive distribution for each test point yi from these samples.312
Figure 4 shows coverage of 50% and 90% credible intervals on 1000 test points collected over 100313
random train-test splits. Non-Private achieves nearly correct coverage, with the discrepancy due314
to the fact that the data is not actually drawn from the prior. Gibbs-SS-Noisy achieves nearly315
the coverage of Non-Private, while Naive is drastically worse in this regime. We note that this316
experiment emphasizes the advantage of Gibbs-SS-Noisy not needing an explicitly defined data317
prior, as it only requires the same parameter prior that is needed in non-private analysis.318
Empirical CDFs‣ Empirical CDF plots show noise-aware methods are well-calibrated, but the
naïve method is over-confident
0 1rDnk index
0
1
true
CD
) vD
lue
θ0
0 1rDnk index
θbias
0 1rDnk index
(F)σ2
coverage
(e)
Figure 3: Synthetic data results: (a) calibration vs. n for ✏ = 0.1; (b) calibration vs. ✏ for n = 10;(c) QQ plot for n = 10 and ✏ = 0.1; (d) 95% credible interval coverage; (e) MMD of methods tonon-private posterior; (f) method runtimes for ✏ = 0.1.
to approximations in the calculation of multivariate normal distribution fourth moments from a data295
prior. Utility results are shown in Figure 3e; the noise-aware methods provide at least as good utility296
as Naive. Run time results are shown in Figure 3f; MCMC-Ind scales with increasing population size297
while the Gibbs-SS methods, Naive, and Non-Private remain constant. Accordingly, we do not298
include results for MCMC-Ind for n = 1000 as its run time is prohibitive in those settings.299
4.3 Predictive posteriors on real data300
Figure 4: Coverage for predictive posterior 50%and 90% credible intervals.
We evaluate the predictive posteriors of the301
methods on a real world data set measuring302
the effect of drinking rate on cirrhosis rate.4303
We scale both covariate and response data to304
[0, 1]. There are 46 total points, which we305
randomly split into 36 training examples and306
10 test points for each trial. After prelimi-307
nary exploration to gain domain knowledge,308
we set a reasonable model prior of ✓,�2 ⇠309
NIG�[1, 0], diag([.25, .25]), 20, .5
�. We draw310
samples ✓(k),�
2k from the posterior given train-311
ing data, and then form the posterior predictive distribution for each test point yi from these samples.312
Figure 4 shows coverage of 50% and 90% credible intervals on 1000 test points collected over 100313
random train-test splits. Non-Private achieves nearly correct coverage, with the discrepancy due314
to the fact that the data is not actually drawn from the prior. Gibbs-SS-Noisy achieves nearly315
the coverage of Non-Private, while Naive is drastically worse in this regime. We note that this316
experiment emphasizes the advantage of Gibbs-SS-Noisy not needing an explicitly defined data317
prior, as it only requires the same parameter prior that is needed in non-private analysis.318
Figure 3: Synthetic data results: (a) calibration vs. n for ✏ = 0.1; (b) calibration vs. ✏ for n = 10;(c) QQ plot for n = 10 and ✏ = 0.1; (d) 95% credible interval coverage; (e) MMD of methods tonon-private posterior; (f) method runtimes for ✏ = 0.1.
to approximations in the calculation of multivariate normal distribution fourth moments from a data295
prior. Utility results are shown in Figure 3e; the noise-aware methods provide at least as good utility296
as Naive. Run time results are shown in Figure 3f; MCMC-Ind scales with increasing population size297
while the Gibbs-SS methods, Naive, and Non-Private remain constant. Accordingly, we do not298
include results for MCMC-Ind for n = 1000 as its run time is prohibitive in those settings.299
4.3 Predictive posteriors on real data300
Figure 4: Coverage for predictive posterior 50%and 90% credible intervals.
We evaluate the predictive posteriors of the301
methods on a real world data set measuring302
the effect of drinking rate on cirrhosis rate.4303
We scale both covariate and response data to304
[0, 1]. There are 46 total points, which we305
randomly split into 36 training examples and306
10 test points for each trial. After prelimi-307
nary exploration to gain domain knowledge,308
we set a reasonable model prior of ✓,�2 ⇠309
NIG�[1, 0], diag([.25, .25]), 20, .5
�. We draw310
samples ✓(k),�
2k from the posterior given train-311
ing data, and then form the posterior predictive distribution for each test point yi from these samples.312
Figure 4 shows coverage of 50% and 90% credible intervals on 1000 test points collected over 100313
random train-test splits. Non-Private achieves nearly correct coverage, with the discrepancy due314
to the fact that the data is not actually drawn from the prior. Gibbs-SS-Noisy achieves nearly315
the coverage of Non-Private, while Naive is drastically worse in this regime. We note that this316
experiment emphasizes the advantage of Gibbs-SS-Noisy not needing an explicitly defined data317
prior, as it only requires the same parameter prior that is needed in non-private analysis.318
Figure 3: Synthetic data results: (a) calibration vs. n for ✏ = 0.1; (b) calibration vs. ✏ for n = 10;(c) QQ plot for n = 10 and ✏ = 0.1; (d) 95% credible interval coverage; (e) MMD of methods tonon-private posterior; (f) method runtimes for ✏ = 0.1.
to approximations in the calculation of multivariate normal distribution fourth moments from a data295
prior. Utility results are shown in Figure 3e; the noise-aware methods provide at least as good utility296
as Naive. Run time results are shown in Figure 3f; MCMC-Ind scales with increasing population size297
while the Gibbs-SS methods, Naive, and Non-Private remain constant. Accordingly, we do not298
include results for MCMC-Ind for n = 1000 as its run time is prohibitive in those settings.299
4.3 Predictive posteriors on real data300
Figure 4: Coverage for predictive posterior 50%and 90% credible intervals.
We evaluate the predictive posteriors of the301
methods on a real world data set measuring302
the effect of drinking rate on cirrhosis rate.4303
We scale both covariate and response data to304
[0, 1]. There are 46 total points, which we305
randomly split into 36 training examples and306
10 test points for each trial. After prelimi-307
nary exploration to gain domain knowledge,308
we set a reasonable model prior of ✓,�2 ⇠309
NIG�[1, 0], diag([.25, .25]), 20, .5
�. We draw310
samples ✓(k),�
2k from the posterior given train-311
ing data, and then form the posterior predictive distribution for each test point yi from these samples.312
Figure 4 shows coverage of 50% and 90% credible intervals on 1000 test points collected over 100313
random train-test splits. Non-Private achieves nearly correct coverage, with the discrepancy due314
to the fact that the data is not actually drawn from the prior. Gibbs-SS-Noisy achieves nearly315
the coverage of Non-Private, while Naive is drastically worse in this regime. We note that this316
experiment emphasizes the advantage of Gibbs-SS-Noisy not needing an explicitly defined data317
prior, as it only requires the same parameter prior that is needed in non-private analysis.318
z ⇠ s+ Laplace(�/✏)<latexit sha1_base64="mESBHDnPc4iE5UHlkppjC9Aifao=">AAADInicfVJNbxMxEPUuXyV8pXDkYhFRpaJKswGJCqlKBRw4IFSkpq0UJyuvM0ms2t7VehY1WPtbuPBXuHAAASckfgzeNBJNWjEra5/ezPMbj51kSlpst38H4ZWr167fWLtZu3X7zt179fX7hzYtcgE9kao0P064BSUN9FCiguMsB64TBUfJyasqf/QBcitTc4CzDAaaT4wcS8HRU/F6sMOSVI3sTPufYzgF5OUWZVZONB926IZHmmbN/1ZtUsZqTHOcJmN3WsYuemHKSopcnOSgnJSj0lU7lZRtVV+32600s3+l3mW+g+DKvSsv8xsenLOQK+6Wbux6ptDMpEpqiTZ2cjcqh4Zic1k3i+VSwx+X/D1hS/qEMoRTdG95prgA389rUMi3GWRWqtRs1mpxvdFutedBL4JoARpkEftx/ScbpaLQYFAobm0/amc4cDxHKRSUNVZYyPzA+AT6HhquwQ7c/IpL+tgzIzpOc78M0jl7XuG4ttW4fGV1Cruaq8jLcv0CxzsDJ01WIBhxZjQuFMWUVu+FjmQOAtXMAy5y6XulYspzLtC/qmoI0eqRL4LDTit62uq8f9bYe7kYxxp5SB6RJonIc7JH3pB90iMi+BR8Cb4F38PP4dfwR/jrrDQMFpoHZCnCP38BxOX/6Q==</latexit>
need to make assumptions!
‣ Implement in PyMC3‣ Instantiate individuals
First four moments of 𝒙‣ Gibbs-SS-Noisy: Release moments from sample data‣ Gibbs-SS-Prior: Sample population via data prior,
then calculate moments‣ Gibbs-SS-Update: Hierarchical normal prior with
updated parameters
CLT approximation
scale mixture of normals
✓,�2 ⇠ p(✓,�2)
s ⇠ N (nµt, n⌃t)
!2 ⇠ Exp(✏2/2�2)
z ⇠ N (s,!2)<latexit sha1_base64="cktWEax2Dzs3az41IsBO1ym9PUY=">AAAC1nicfVLLbhMxFPUMrxJeKSzZWESgVKrCzIBUlhUPiRUqKmkjxenI49wkVv0Yje+ghtGwACG2fBs7PoJ/wJOMKH2IK1k+Ovcen+trZ7mSDqPoVxBeuXrt+o2Nm51bt+/cvdfdvH/gbFkIGAqrbDHKuAMlDQxRooJRXgDXmYLD7PhVkz/8CIWT1nzAZQ4TzedGzqTg6Km0+5tlVk3dUvutYrgA5PU2ZU7ONT9K6BOPNM37/63aoox1mOa4yGaVq1vRihBcVe/qvjmj12Wd4jY1bL/Rp7jWWw3zU0uGcILVm5O87jPInVTWHCVPE/YaFP61bGq85afLLE/78Y22Z2+l3V40iFZBL4K4BT3Sxl7a/cmmVpQaDArFnRvHUY6TihcohYK6w0oHORfHfA5jDw3X4CbV6llq+tgzUzqzhV8G6Yr9V1Fx7ZqZ+MqmWXc+15CX5cYlzl5MKmnyEsGItdGsVBQtbd6YTmUBAtXSAy4K6XulYsELLtD/hI4fQnz+yhfBQTKInw2S9897uy/bcWyQh+QR6ZOY7JBd8pbskSERwX6wDL4EX8NR+Dn8Fn5fl4ZBq3lAzkT44w93XuNO</latexit>