Differentially Private Bayesian Linear Regression Garrett ...

Problem Statement

Privacy: 𝒜 is differentially privateCalibration: 𝒫 can obtain or approximate p 𝜽, 𝜎' 𝒛)Utility: p 𝜽, 𝜎' 𝒛) is near to p 𝜽, 𝜎' 𝑋, 𝒚)

Utility‣ Maximum Mean Discrepancy to non-private posterior as measure of utility

‣ Naïve method only matches noise-aware method at easier settingsSufficient Statistics Update‣ Sufficient statistics 𝒔 = ∑/012 𝑡(𝑥/) are a sum

over individuals‣ Approximate with CLT as 𝒔 ~𝒩 𝑛𝝁:, 𝑛Σ<‣ 𝝁: and Σ< derived in terms of first four

moments of 𝒙

Run time‣MCMC-Ind scales with population size

‣Other methods stay constant

Differentially Private Bayesian Linear Regression Garrett Bernstein1, Daniel Sheldon1,21University of Massachusetts Amherst; 2 Mount Holyoke College

Results

Background

ϵ-Differential Privacy‣ Pr 𝒜 𝑿 ≤ exp 𝜖 Pr 𝒜 𝑿F

‣ Probability of any output is nearly unchanged by a single individual opting into the data

Bayesian Linear Regression‣ Normal-inverse-gamma is conjugate prior

‣ Leads to closed form posterior 𝑝 𝜽, 𝜎' 𝑋, 𝒚)‣ Sufficient statistics: 𝒔 = ∑/012 𝑡(𝒙/, 𝑦/) = 𝑋I𝑋, 𝑋I𝒚, 𝒚I𝒚

Laplace Mechanism‣ Release noisy sufficient statistics: 𝒛 = 𝒔 + Laplace(Δ/𝜖)‣ Δ = sensitivity: Impact on a function’s output due to

removal/addition of an individual’s data

Supported by National Science Foundation Grant Nos. 1522054 and 1617533www.GarrettBernstein.com https://github.com/gbernstein6/private_bayesian_regression

‣ Given individuals’ sensitive covariate data 𝑋 and response data 𝒚 ∼ Normal(𝜽I𝑋, 𝜎'), and prior 𝑝 𝜽, 𝜎' …

‣ …privately release 𝐳 = 𝒜(𝑋, 𝒚)…‣ …then calculate the posterior p 𝜽, 𝜎' 𝒛) via 𝒫 with…

Private Bayesian Inference

Challenges+

Solutions

‣ Only noisy sufficient statistics observed Use Gibbs sampler‣ Integration over all possible individuals is intractable CLT approximation ‣ Not easy to sample from product of normal and Laplace Scale mixture of normal .‣ Need to make assumptions about data Only need first four moments.

Sufficient Statistics-based Gibbs sampler

Want: p 𝜽, 𝜎' 𝒛)Private Model

Individual-based MCMC

𝒜 – randomized algorithm𝑿,𝑿F — neighboring data sets

‣ Set explicit data prior

Calibration‣ Given true parameters 𝜽, 𝜎', let 𝐹𝒛 be CDF of true posterior 𝑝 𝜽, 𝜎' 𝒛)‣ Using fact 𝐹𝒛(𝜽, 𝜎') is uniform, test correctness of approximate posterior by

testing uniformity of its CDF, W𝐹𝒛(𝜽, 𝜎'), via Kolmogorov-Smirnov test ‣ Noise-aware methods are nearly as well-calibrated as non-private method

Parameter Update‣ Conjugate update uses latent 𝒔

𝜽, 𝜎' ∼ p(𝜃, 𝜎'; 𝜆F); 𝜆F = Conjugate-Update(𝜆, 𝒔, 𝑛)

Noise Update‣ Represent Laplace noise model as scale mixture of

normals for easier sampling with CLT normal

Algorithm 1 Gibbs Sampler

1: Initialize ✓,�2,!

2

2: repeat

3: Calculate µt and ⌃t via Eqs. (2) and (3)4: s ⇠ NormProduct

�nµt, n⌃t, z, diag(!2)

�

5: ✓,�2 ⇠ NIG(✓,�2;µn,⇤n, an, bn) via Eqn. (1)

6: 1/!2j ⇠ InverseGaussian

⇣✏

�s|z�s| ,✏2

�2s

⌘for all j

Subroutine NormProduct1: input: µ1,⌃1,µ2,⌃2

2: ⌃3 =�⌃�1

1 + ⌃�12

��1

3: µ3 = ⌃3

�⌃�1

1 µ1 + ⌃�12 µ2

�

4: return: N (µ3,⌃3)

⌘ijkl, and ⇠ij,kl. The current parameter values are available within the sampler, but the mod-172

eler must provide estimates for the moments of x, either using prior knowledge or by (privately)173

estimating the moments from the data. We discuss three specific possibilities in Section 3.4.4.174

E [xiy] =X

j

✓j⌘ij

E⇥y2⇤ = �2 +

X

i,j

✓i✓j⌘ij

Cov (xixj , xky) =X

l

✓l⇠ij,kl

Cov�xixj , y

2� =X

k,l

✓k✓l⇠ij,kl

Cov (xiy, xjy) = �2⌘ij +X

k,l

✓k✓l⇠ij,kl

Cov�xiy, y

2� =X

j,k,l

✓j✓k✓l⇠ij,kl + 2�2X

j

✓j⌘ij

Var�y2� = 2�4 +

X

i,j,k,l

✓i✓j✓k✓l⇠ij,kl

+ 4�2X

i,j

✓i✓j⌘ij

Once again, more modeling assumptions are175

needed than in the non-private case, where it176

is possible to condition on x. Gibbs-SS re-177

quires milder assumptions (second and fourth178

moments), however, than MCMC-Ind (a full179

prior distribution).180

3.4.2 Variable augmentation for p(z | s)181

The above approximation for the distribution182

over sufficient statistics means the full condi-183

tional distribution involves the product of a184

normal and a Laplace distribution,185

p(s | ✓, z) / N (s;nµt, n⌃t)

· Lap(z; s,�s/✏).

It is unclear how to sample from this distri-186

bution directly. A similar situation arises in187

the Bayesian Lasso, where it is solved by188

variable augmentation [25]. Bernstein and189

Sheldon [5] adapted the variable augmenta-190

tion scheme to private inference in exponential family models. We take the same approach here,191

and represent a Laplace random variable as a scale mixture of normals. Specifically, l ⇠ Lap(u, b)192

is identically distributed to l ⇠ N (u,!2) where the variance !2 ⇠ Exp

�1/(2b2)

�is drawn from193

the exponential distribution (with density 1/(2b2) exp��!

2/(2b2)

�). We augment separately for194

each component of the vector z so that z ⇠ N�s, diag(!2)

�, where !

2j ⇠ Exp

�✏2/(2�2

s)�. The195

augmented full conditional p(s | ✓, z,!) is a product of two multivariate normal distributions, which196

is itself a multivariate normal distribution.197

3.4.3 The Gibbs sampler198

✓,�2 ⇠ NIG(✓,�2;µ0,⇤0, a0, b0)

s ⇠ N (nµt, n⌃t)

!2j ⇠ Exp

✏2

2�2s

!for all j

z ⇠ N�s, diag(!2)

�

The full generative process is shown to the199

right, and the corresponding Gibbs sampler200

is shown in Algorithm 1. The update for201

!2 follows Park and Casella [25]; the inverse202

Gaussian density is InverseGaussian(w;m, v) =203 pv/(2⇡w3) exp

��v(w �m)2/(2m2

w)�. Note204

that the resulting s drawn from p(s | µt,⌃t,!2)205

may require projection onto the space of valid sufficient statistics. This can be done by observing that206

if A = [X,y] then the sufficient statistics are contained in the positive-semidefinite (PSD) matrix207

B = ATA. For a randomly drawn s, we project if necessary so the corresponding B matrix is PSD.208

5

101 102 103

n

0.00

0.35

0.70

.S s

tat.

θ0

101 102 103

n

θbias

101 102 103

n

(a)σ2

.S s

tat.

(F)

Figure 3: Synthetic data results: (a) calibration vs. n for ✏ = 0.1; (b) calibration vs. ✏ for n = 10;(c) QQ plot for n = 10 and ✏ = 0.1; (d) 95% credible interval coverage; (e) MMD of methods tonon-private posterior; (f) method runtimes for ✏ = 0.1.

to approximations in the calculation of multivariate normal distribution fourth moments from a data295

prior. Utility results are shown in Figure 3e; the noise-aware methods provide at least as good utility296

as Naive. Run time results are shown in Figure 3f; MCMC-Ind scales with increasing population size297

while the Gibbs-SS methods, Naive, and Non-Private remain constant. Accordingly, we do not298

include results for MCMC-Ind for n = 1000 as its run time is prohibitive in those settings.299

4.3 Predictive posteriors on real data300

Figure 4: Coverage for predictive posterior 50%and 90% credible intervals.

We evaluate the predictive posteriors of the301

methods on a real world data set measuring302

the effect of drinking rate on cirrhosis rate.4303

We scale both covariate and response data to304

[0, 1]. There are 46 total points, which we305

randomly split into 36 training examples and306

10 test points for each trial. After prelimi-307

nary exploration to gain domain knowledge,308

we set a reasonable model prior of ✓,�2 ⇠309

NIG�[1, 0], diag([.25, .25]), 20, .5

�. We draw310

samples ✓(k),�

2k from the posterior given train-311

ing data, and then form the posterior predictive distribution for each test point yi from these samples.312

Figure 4 shows coverage of 50% and 90% credible intervals on 1000 test points collected over 100313

random train-test splits. Non-Private achieves nearly correct coverage, with the discrepancy due314

to the fact that the data is not actually drawn from the prior. Gibbs-SS-Noisy achieves nearly315

the coverage of Non-Private, while Naive is drastically worse in this regime. We note that this316

experiment emphasizes the advantage of Gibbs-SS-Noisy not needing an explicitly defined data317

prior, as it only requires the same parameter prior that is needed in non-private analysis.318

4http://people.sc.fsu.edu/~jburkardt/datasets/regression/x20.txt

8

Empirical CDFs‣ Empirical CDF plots show noise-aware methods are well-calibrated, but the

naïve method is over-confident

0 1rDnk index

0

1

true

CD

) vD

lue

θ0

0 1rDnk index

θbias

0 1rDnk index

(F)σ2

coverage

(e)


















NIG�[1, 0], diag([.25, .25]), 20, .5

�. We draw310

samples ✓(k),�










8

03 101 102n

0

200

400

600

seconds


















NIG�[1, 0], diag([.25, .25]), 20, .5

�. We draw310

samples ✓(k),�










8

101 102 103n

0

1

00D

θ0

101 102 103n

θbias

101 102 103n

(e)σ2

seconds


















NIG�[1, 0], diag([.25, .25]), 20, .5

�. We draw310

samples ✓(k),�










8

parameters

covariatedata

sufficientstatistics

noisy sufficientstatistics

𝒛𝑋I𝒚

𝒚𝜽, 𝜎'

𝑋𝑋I𝑋

𝒚I𝒚response

data𝒔 ✓,�2 ⇠ p(✓,�2)

x1:niid⇠ ???

y1:n ⇠ N (✓Txi,�2)

s =Xn

i=1t(xi, yi)

z ⇠ s+ Laplace(�/✏)<latexit sha1_base64="mESBHDnPc4iE5UHlkppjC9Aifao=">AAADInicfVJNbxMxEPUuXyV8pXDkYhFRpaJKswGJCqlKBRw4IFSkpq0UJyuvM0ms2t7VehY1WPtbuPBXuHAAASckfgzeNBJNWjEra5/ezPMbj51kSlpst38H4ZWr167fWLtZu3X7zt179fX7hzYtcgE9kao0P064BSUN9FCiguMsB64TBUfJyasqf/QBcitTc4CzDAaaT4wcS8HRU/F6sMOSVI3sTPufYzgF5OUWZVZONB926IZHmmbN/1ZtUsZqTHOcJmN3WsYuemHKSopcnOSgnJSj0lU7lZRtVV+32600s3+l3mW+g+DKvSsv8xsenLOQK+6Wbux6ptDMpEpqiTZ2cjcqh4Zic1k3i+VSwx+X/D1hS/qEMoRTdG95prgA389rUMi3GWRWqtRs1mpxvdFutedBL4JoARpkEftx/ScbpaLQYFAobm0/amc4cDxHKRSUNVZYyPzA+AT6HhquwQ7c/IpL+tgzIzpOc78M0jl7XuG4ttW4fGV1Cruaq8jLcv0CxzsDJ01WIBhxZjQuFMWUVu+FjmQOAtXMAy5y6XulYspzLtC/qmoI0eqRL4LDTit62uq8f9bYe7kYxxp5SB6RJonIc7JH3pB90iMi+BR8Cb4F38PP4dfwR/jrrDQMFpoHZCnCP38BxOX/6Q==</latexit>

need to make assumptions!

‣ Implement in PyMC3‣ Instantiate individuals

First four moments of 𝒙‣ Gibbs-SS-Noisy: Release moments from sample data‣ Gibbs-SS-Prior: Sample population via data prior,

then calculate moments‣ Gibbs-SS-Update: Hierarchical normal prior with

updated parameters

CLT approximation

scale mixture of normals

✓,�2 ⇠ p(✓,�2)

s ⇠ N (nµt, n⌃t)

!2 ⇠ Exp(✏2/2�2)

z ⇠ N (s,!2)<latexit sha1_base64="cktWEax2Dzs3az41IsBO1ym9PUY=">AAAC1nicfVLLbhMxFPUMrxJeKSzZWESgVKrCzIBUlhUPiRUqKmkjxenI49wkVv0Yje+ghtGwACG2fBs7PoJ/wJOMKH2IK1k+Ovcen+trZ7mSDqPoVxBeuXrt+o2Nm51bt+/cvdfdvH/gbFkIGAqrbDHKuAMlDQxRooJRXgDXmYLD7PhVkz/8CIWT1nzAZQ4TzedGzqTg6Km0+5tlVk3dUvutYrgA5PU2ZU7ONT9K6BOPNM37/63aoox1mOa4yGaVq1vRihBcVe/qvjmj12Wd4jY1bL/Rp7jWWw3zU0uGcILVm5O87jPInVTWHCVPE/YaFP61bGq85afLLE/78Y22Z2+l3V40iFZBL4K4BT3Sxl7a/cmmVpQaDArFnRvHUY6TihcohYK6w0oHORfHfA5jDw3X4CbV6llq+tgzUzqzhV8G6Yr9V1Fx7ZqZ+MqmWXc+15CX5cYlzl5MKmnyEsGItdGsVBQtbd6YTmUBAtXSAy4K6XulYsELLtD/hI4fQnz+yhfBQTKInw2S9897uy/bcWyQh+QR6ZOY7JBd8pbskSERwX6wDL4EX8NR+Dn8Fn5fl4ZBq3lAzkT44w93XuNO</latexit>

Differentially Private Bayesian Linear Regression Garrett ...

Documents