DECOUPLING SHRINKAGE AND SELECTION IN BAYESIAN …faculty.mccombs.utexas.edu/carlos.carvalho/HahnCarvalhoDSS2014.pdf · DECOUPLING SHRINKAGE AND SELECTION IN BAYESIAN LINEAR MODELS:

DECOUPLING SHRINKAGE AND SELECTION IN BAYESIAN

LINEAR MODELS: A POSTERIOR SUMMARY PERSPECTIVE

By P. Richard Hahn and Carlos M. Carvalho

Booth School of Business and McCombs School of Business

Selecting a subset of variables for linear models remains an active

area of research. This paper reviews many of the recent contributions

to the Bayesian model selection and shrinkage prior literature. A

posterior variable selection summary is proposed, which distills a full

posterior distribution over regression coefficients into a sequence of

sparse linear predictors.

1. Introduction. This paper revisits the venerable problem of variable selection in linear1

models. The vantage point throughout is Bayesian: a normal likelihood is assumed and inferences2

are based on the posterior distribution, which is arrived at by conditioning on observed data.3

In applied regression analysis, a “high-dimensional” linear model can be one which involves tens4

or hundreds of variables, especially when seeking to compute a full Bayesian posterior distribution.5

Our review will be from the perspective of a data analyst facing a problem in this “moderate”6

regime. Likewise, we focus on the situation where the number of predictor variables, p, is fixed.7

In contrast to other recent papers surveying the large body of literature on Bayesian variable8

selection [Liang et al., 2008, Bayarri et al., 2012] and shrinkage priors [O’Hara and Sillanpaa, 2009,9

Polson and Scott, 2012], our review focuses specifically on the relationship between variable selection10

priors and shrinkage priors. Selection priors and shrinkage priors are related both by the statistical11

ends they attempt to serve (e.g., strong regularization and efficient estimation) and also in the12

technical means they use to achieve these goals (hierarchical priors with local scale parameters).13

We also compare these approaches on computational considerations.14

Finally, we turn to variable selection as a problem of posterior summarization. We argue that15

if variable selection is desired primarily for parsimonious communication of linear trends in the16

data, that this can be accomplished as a post-inference operation irrespective of the choice of prior17

distribution. To this end, we introduce a posterior variable selection summary, which distills a full18

Keywords and phrases: decision theory, linear regression, loss function, model selection, parsimony, shrinkage prior,

sparsity, variable selection.

1imsart-aos ver. 2014/02/20 file: HahnCarvalhoDSS2014_R2.tex date: November 18, 2014

posterior distribution over regression coefficients into a sequence of sparse linear predictors. In this19

sense “shrinkage” is decoupled from “selection”.20

We begin by describing the two most common approaches to this scenario and show how the two21

approaches can be seen as special cases of an encompassing formalism.22

1.1. Bayesian model selection formalism. A now-canonical way to formalize variable selection23

in Bayesian linear models is as follows. Let Mφ denote a normal linear regression model indexed24

by a vector of binary indicators φ = (φ1, . . . , φp) ∈ 0, 1p signifying which predictors are included25

in the regression. Model Mφ defines the data distribution as26

(1) (Yi|Mφ, βφ, σ2) ∼ N(Xφ

i βφ, σ2)

where Xφi represents the pφ-vector of predictors in model Mφ. For notational simplicity, (1) does27

not include an intercept. Standard practice is to include an intercept term and to assign it a uniform28

prior.29

Given a sample Y = (Y1, . . . , Yn) and prior π(βφ, σ2), the inferential target is the set of posterior30

model probabilities defined by31

(2) p(Mφ | Y) =p(Y | Mφ)p(Mφ)∑φ p(Y | Mφ)p(Mφ)

,

where p(Y | Mφ) =∫p(Y | Mφ, βφ, σ

2)π(βφ, σ2)dβφdσ

2 is the marginal likelihood of model Mφ32

and p(Mφ) is the prior over models.33

Posterior inferences concerning a quantity of interest ∆ are obtained via Bayesian model aver-34

aging (or BMA), which entails integrating over the model space35

(3) p(∆ | Y) =∑φ

p(∆ | Mφ,Y)p(Mφ | Y).

As an example, optimal predictions of future values of Y under squared-error loss are defined36

through37

(4) E(Y | Y) ≡∑φ

E(Y | Mφ,Y)p(Mφ | Y).

An early reference adopting this formulation is Raftery et al. [1997]; see also Clyde and George38

[2004].39

Despite its straightforwardness, carrying out variable selection in this framework demands at-40

tention to detail: priors over model-specific parameters must be specified, priors over models must41

2

be chosen, marginal likelihood calculations must be performed and a 2p-dimensional discrete space42

must be explored. These concerns have animated Bayesian research in linear model variable selec-43

tion for the past two decades [George and McCulloch, 1993, 1997, Clyde and George, 2004, Hans44

et al., 2007, Liang et al., 2008, Scott and Berger, 2010, Clyde et al., 2011, Bayarri et al., 2012].45

Regarding model parameters, the consensus default prior for model parameters is π(βφ, σ2) =46

π(β | σ2)π(σ2) = N(0, gΩ) × σ−1. The most widely-studied choice of prior covariance is Ω =47

σ2(XtφXφ)−1, referred to as “Zellner’s g-prior” [Zellner, 1986], a “g-type” prior or simply g-prior.48

Notice that this choice of Ω dictates that the prior and likelihood are conjugate normal-inverse-49

gamma pairs (for a fixed value of g).50

For reasons detailed in Liang et al. [2008], it is advised to place a prior on g rather than use a51

fixed value. Several recent papers describe priors p(g) that still lead to efficient computations of52

marginal likelihoods; see Cui and George [2008], Liang et al. [2008], Maruyama and George [2011],53

and Bayarri et al. [2012]. Each of these papers (as well as the earlier literature cited therein) study54

priors of the form55

(5) p(g) = a [ρφ(b+ n)]a gd(g + b)−(a+c+d+1)1g > ρφ(b+ n)− b

with a > 0, b > 0, c > −1, and d > −1. Specific configurations of these hyper parameters56

recommended in the literature include: a = 1, b = 1, d = 0, ρφ = 1/(1 + n) [Cui and George,57

2008], a = 1/2, b = 1 (b = n), c = 0, d = 0, ρφ = 1/(1 + n) [Liang et al., 2008], and a = 1, b =58

1, c = −3/4, d = (n− 5)/2− pφ/2 + 3/4, ρφ = 1/(1 + n) [Maruyama and George, 2011].59

Bayarri et al. [2012] motivates the use of such priors from a testing perspective, using a variety60

of formal desiderata based on Jeffreys [1961] and Berger and Pericchi [2001], including consistency61

criteria, predictive matching criteria and invariance criteria. Their recommended prior uses a =62

1/2, b = 1, c = 0, d = 0, ρφ = 1/pφ. This prior is termed the robust prior, in the tradition following63

Strawderman [1971] and Berger [1980], who examine the various senses in which such priors are64

“robust”. This prior will serve as a benchmark in the examples of Section 3.65

Regarding prior model probabilities, see Scott and Berger [2010], who recommend a hierarchical66

prior of the form φjiid∼Ber(q), q ∼ Unif(0, 1).67

1.2. Shrinkage regularization priors. Although the formulation above provides a valuable theo-68

retical framework, it does not necessarily represent an applied statistician’s first choice. To assess69

which variables contribute dominantly to trends in the data, the goal may be simply to mitigate—70

3

rather than categorize—spurious correlations. Thus, faced with many potentially irrelevant predic-71

tor variables, a common first choice would be a powerful regularization prior.72

Regularization — understood here as the intentional biasing of an estimate to stabilize posterior73

inference — is inherent to most Bayesian estimators via the use of proper prior distributions and is74

one of the often-cited advantages of the Bayesian approach. More specifically, regularization priors75

refer to priors explicitly designed with a strong bias for the purpose of separating reliable from76

spurious patterns in the data. In linear models, this strategy takes the form of zero-centered priors77

with sharp modes and simultaneously fat tails.78

A well-studied class of priors fitting this description will serve to connect continuous priors to79

the model selection priors described above. Local scale mixture of normal distributions are of the80

form [West, 1987, Carvalho et al., 2010, Griffin and Brown, 2012]81

(6) π(βj | λ) =

∫N(βj | 0, λ2λ2j )π(λ2j )dλj ,

where different priors are derived from different choices for π(λ2j ).82

The last several years have seen tremendous interest in this area, motivated by an analogy with83

penalized-likelihood methods [Tibshirani, 1996]. Penalized likelihood methods with an additive84

penalty term lead to estimating equations of the form85

(7)∑i

h(Yi,Xi, β) + αQ(β)

where h and Q are positive functions and their sum is to be minimized; α is a scalar tuning variable86

dictating the strength of the penalty. Typically, h is interpreted as a negative log-likelihood, given87

data Y, and Q is a penalty term introduced to stabilize maximum likelihood estimation. A common88

choice is Q(β) = ||β||1, which yields sparse optimal solutions β∗ and admits fast computation [Tib-89

shirani, 1996]; this choice underpins the lasso estimator, an initialism for “least absolute shrinkage90

and selection operator”.91

Park and Casella [2008] and Hans [2009] “Bayesified” these expressions by interpreting Q(β) as92

the negative log prior density and developing algorithms for sampling from the resulting Bayesian93

posterior, building upon work of earlier Bayesian authors [Spiegelhalter, 1977, West, 1987, Pericchi94

and Walley, 1991, Pericchi and Smith, 1992]. Specifically, an exponential prior π(λ2j ) = Exp(α2)95

leads to independent Laplace (double-exponential) priors on the βj , mirroring expression (7).96

This approach has two implications unique to the Bayesian paradigm. First, it presented an97

opportunity to treat the global scale parameter λ (equivalently the regularization penalty parameter98

4

α) as a hyper parameter to be estimated. Averaging over λ in the Bayesian paradigm has been99

empirically observed to give better prediction performance than cross-validated selection of α (e.g.,100

Hans [2009]). Second, a Bayesian approach necessitates forming point estimators from posterior101

distributions; typically the posterior mean is adopted on the basis that it minimizes mean squared102

prediction error. Note that posterior mean regression coefficient vectors from these models are non-103

sparse with probability one. Ironically, the two main appeals of the penalized likelihood methods—104

efficient computation and sparse solution vectors β∗—were lost in the migration to a Bayesian105

approach. See, however, Hans [2010] for an application of double-exponential priors in the context106

of model selection.107

Nonetheless, wide interest in “Bayesian lasso” models paved the way for more general local108

shrinkage regularization priors of the form (6). In particular, Carvalho et al. [2010] develops a109

prior over location parameters that attempts to shrink irrelevant signals strongly toward zero while110

avoiding excessive shrinkage of relevant signals. To contextualize this aim, recall that solutions to `1111

penalized likelihood problems are often interpreted as (convex) approximations to more challenging112

formulations based on `0 penalties: ||γ||0 =∑

j 1(γj 6= 0). As such, it was observed that the global113

`1 penalty “overshrinks” what ought to be large magnitude coefficients. For one example, Carvalho114

et al. [2010] prior may be written as115

π(βj | λ) = N(0, λ2λ2j ),

λjiid∼C+(0, 1).

(8)

with λ ∼ C+(0, 1) or λ ∼ C+(0, σ2). The choice of half-Cauchy arises from the insight that for scalar116

observations yj ∼ N(θj , 1) and prior θj ∼ N(0, λ2j ), the posterior mean of θj may be expressed:117

(9) E(θj | yj) = 1− E(κj | yj)yj ,

where κj = 1/(1 + λ2j ). The authors observe that U-shaped Beta(1/2,1/2) distributions (like a118

horseshoe) on κj imply a prior over θj with high mass around the origin but with polynomial tails.119

That is, the “horseshoe” prior encodes the assumption that some coefficients will be very large120

and many others will be very nearly zero. This U-shaped prior on κj implies the half-cauchy prior121

density π(λj). The implied marginal prior on β has Cauchy-like tails and a pole at the origin which122

entails more aggressive shrinkage than a Laplace prior.123

Other choices of π(λj) lead to different “shrinkage profiles” on the “κ scale”. Polson and Scott124

[2012] provides an excellent taxonomy of the various priors over β that can be obtained as scale-125

5

mixtures of normals. The horseshoe and similar priors (e.g., Griffin and Brown [2012]) have proven126

empirically to be fine default choices for regression coefficients: they lack hyper parameters, force-127

fully separate strong from weak predictors, and exhibit robust predictive performance.128

1.3. Model selection priors as shrinkage priors. It is possible to express model selection priors as129

shrinkage priors. To motivate this re-framing, observe that the posterior mean regression coefficient130

vector is not well-defined in the model selection framework. Using the model-averaging notion, the131

posterior average β may be be defined as:132

(10) E(β | Y) ≡∑φ

E(β | Mφ,Y)p(Mφ | Y),

where E(βj | Mφ,Y) ≡ 0 whenever φj = 0. Without this definition, the posterior expectation of βj133

is undefined in models where the jth predictor does not appear. More specifically, as the likelihood134

is constant in variable j in such models, the posterior remains whatever the prior was chosen to be.135

To fully resolve this indeterminacy, it is common to set βj identically equal to zero in models136

where the jth predictor does not appear, consistent with the interpretation that βj ≡ ∂E(Y )/∂Xj .137

A hierarchical prior reflecting this choice may be expressed as138

(11) π(β | g,Λ,Ω) = N(0, gΛΩΛt).

In this expression, Λ ≡ diag((λ1, λ2, . . . , λp)) and Ω is a positive semi-definite matrix, both of which139

may depend on φ and/or σ2. When Ω is the identity matrix, one recovers (6).140

In order to set βj = 0 when φj = 0, let λj ≡ φjsj for sj > 0, so that when φj = 0, the prior141

variance of βj is set to zero (with prior mean of zero). George and McCulloch [1997] develops142

this approach in detail, including the g-prior specification, Ω(φ) = σ2(XtφXφ)−1. Priors over the143

sj induce a prior on Λ. Under these definitions of the λj and Ω, the component-wise marginal144

distribution for βj , j = 1, . . . , p, may be written as145

(12) π(βj | φj , σ2, g, λj) = (1− φj)δ0 + φjN(0, gλ2jωj),

where δ0 denotes a point mass distribution at zero and ωj is the jth diagonal element of Ω. Hi-146

erarchical priors of this form are sometimes called “spike-and-slab” priors (δ0 is the spike and the147

continuous full-support distribution is the slab) or the “two-groups model” for variable selection.148

References for this specification include Mitchell and Beauchamp [1988] and Geweke et al. [1996],149

among others.150

6

It is also possible to think of the component-wise prior over each βj directly in terms of (11) and151

a prior over λj (marginalizing over φ):152

(13) π(λj | q) = (1− q)δ0 + qPλj ,

where Pr(φj = 1) = q, and Pλj is some continuous distribution on R+. Of course, q can be given153

a prior distribution as well; a uniform distribution is common. This representation transparently154

embeds model selection priors within the class of local scale mixture of normal distributions. An155

important paper exploring the connections between shrinkage priors and model selection priors is156

Ishwaran and Rao [2005], who consider a version of (11) via a specification of π(λj) which is bimodal157

with one peak at zero and one peak away from zero. In many respects, this paper anticipated the158

work of Park and Casella [2008], Hans [2009], Carvalho et al. [2010], Griffin and Brown [2012],159

Polson and Scott [2012] and the like.160

1.4. Computational issues in variable selection. Because posterior sampling is computation-161

intensive and because variable selection is most desirable in contexts with many predictor variables,162

computational considerations are important in motivating and evaluating the approaches above.163

The discrete model selection approach and the continuous shrinkage prior approach are both quite164

challenging in terms of posterior sampling.165

In the model selection setting, for p > 30, enumerating all possible models (to compute marginal166

likelihoods, for example) is beyond the reach of modern capability. As such, stochastic exploration167

of the model space is required, with the hope that the unvisited models comprise a vanishingly small168

fraction of the posterior probability. George and McCulloch [1997] is frank about this limitation;169

noting that a Markov Chain run of length less than 2p steps cannot have visited each model even170

once, they write hopefully that “it may thus be possible to identify at least some of the high171

probability values”.172

Garcia-Donato and Martinez-Beneito [2013] carefully evaluates methods for dealing with this173

problem and come to compelling conclusions in favor of some methods over others. Their anal-174

ysis is beyond the scope of this paper, but anyone interested in the variable selection problem175

in large p settings should consult its advice. In broad strokes, they find that MCMC approaches176

based on Gibbs samplers (i.e., George and McCulloch [1997]) appear better at estimating posterior177

quantities—such as the highest probability model, the median probability model, etc—compared178

to methods based on sampling without replacement (i.e., Hans et al. [2007] and Clyde et al. [2011]).179

7

Regarding shrinkage priors, there is no systematic study in the literature suggesting that the180

above computational problems are alleviated for continuous parameters. In fact, the results of181

Garcia-Donato and Martinez-Beneito [2013] (see section 6) suggest that posterior sampling in fi-182

nite sample spaces is easier than the corresponding problem for continuous parameters, in that183

convergence to stationarity occurs more rapidly.184

Moreover, if one is willing to entertain an extreme prior with π(φ) = 0 for ||φ||0 > M for a given185

constant M , model selection priors offer a tremendous practical benefit: one never has to invert a186

matrix larger than M×M , rather than the p×p dimensional inversions required of a shrinkage prior187

approach. Similarly, only vectors up to size M need to be saved in memory and operated upon. In188

extremely large problems, with thousands of variables, setting M = O(√p) or M = O(log p) saves189

considerable computational effort [Hans et al., 2007]. According to personal communications with190

researchers at Google, this approach is routinely applied to large scale internet data. Should M191

be chosen too small, little can be said; if M truly represents one’s computational budget, the best192

model of size M will have to do.193

1.5. Selection: from posteriors to sparsity. To go from a posterior distribution to a sparse point194

estimate requires additional processing, regardless of what prior is used. The specific process used195

to achieve sparse estimates will depend on the underlying purpose for which the sparsity is desired.196

In some cases, identifying sparse models (subsets of non-zero coefficients) might be an end in197

itself, as in the case of trying to isolate scientifically important variables in the context of a controlled198

experiment. In this case, a prior with point-mass probabilities at the origin is necessary in terms199

of defining the implicit (multiple) testing problem. For this purpose, the use of posterior model200

probabilities is a well-established methodology for evaluating evidence in the data in favor of various201

hypotheses. Indeed, the highest posterior probability model (HPM) is optimal under model selection202

loss: L(γ, φ) = 1(γ = φ), where γ denotes the “action” of selecting a particular model. Under203

symmetric variable selection loss, L(γ, φ) =∑

j |γj −φj |, it is easy to show that the optimal model204

is the one which includes all and only variables with marginal posterior inclusion probabilities205

greater than 1/2. This model is commonly referred to as median probability model (MPM).206

Note that many authors discuss model selection in terms of Bayes factors with respect to a “base207

model” Mφ∗ :208

(14)p(Mφ | Y)

p(Mφ∗ | Y)= BF(Mφ,Mφ∗)

p(Mφ)

p(Mφ∗),

8

whereMφ∗ is typically chosen to be the full model with no zero coefficients or the null model with209

all zero coefficients. This notation should not obscure the fact that posterior model probabilities210

underlie subsequent model selection decisions.211

As an alternative to posterior model probabilities, many ad-hoc hard thresholding methods have

been proposed, which can be employed when π(λj) has a non-point-mass density. Such methods

derive classification rules for selecting subsets of non-zero coefficients, on the basis of the posterior

distribution over βj and/or λj . For example, Carvalho et al. [2010] suggest setting to zero those

coefficients for which

E(κj = 1/(1 + λ2j ) | Y) < 1/2.

Ishwaran and Rao [2005] discuss a variety of posterior thresholding rules and relate them to conven-212

tional thresholding rules based on ordinary least squares estimates of β. An important limitation of213

most commonly used thresholding approaches is that they are applied separately to each coefficient,214

irrespective of any dependencies that arise in the posterior between the elements of λ1, . . . , λp.215

In other cases, the goal—rather than isolating all and only relevant variables, no matter their216

absolute size—is simply to describe the “important” relationships between predictors and response.217

In this case, the model selection route is simply a means to an end. From this perspective, a218

natural question is how to fashion a sparse vector of regression coefficients which parsimoniously219

characterizes the available data. Leamer [1978] is a notable early effort advocating ad-hoc model220

selection for the purpose of human comprehensibility. Fouskakis and Draper [2008], Fouskakis et al.221

[2009] and Draper [2013] represent efforts to define variable importance in real-world terms using222

subject matter considerations. A more generic approach is to gauge predictive relevance [Gelfand223

et al., 1992].224

A widely cited result relating variable selection to predictive accuracy is that of Barbieri and225

Berger [2004]. Consider mean squared prediction error (MSPE), n−1E∑

i(Yi − Xiβ)2, and recall226

that the model-specific optimal regression vector is βφ ≡ E(β | Mφ,Y). Barbieri and Berger [2004]227

show that for XtX diagonal, the best predicting model according to MSPE is again the median228

probability model. Their result holds both for a fixed design X of prediction points or for stochastic229

predictors with EXtX diagonal. However, the main condition of their theorem — XtX diagonal230

— is almost never satisfied in practice. Nonetheless, they argue that the median probability model231

tends to outperform the highest probability model on out-of-sample prediction tasks. Note that the232

HPM and MPM can be substantially different models, especially in the case of strong dependence233

9

among predictors.234

Broadly speaking, the difference in the two situations described above is one between “statistical235

significance” and “practical significance”. In the former situation, posterior model probabilities236

are the preferred alternative, with thresholding rules being an imperfect analogue for use with237

continuous (non-point-mass) priors on β. In the latter case, predictive relevance is a commonly238

invoked operational definition of “practical”, but theoretical results are not available for the case239

of correlated predictor variables.240

2. Posterior summary variable selection. In this section we describe a posterior summary241

based on an expected loss minimization problem. The loss function is designed to balance prediction242

ability (in the sense of mean square prediction error) and narrative parsimony (in the sense of243

sparsity). The new summary checks three important boxes:244

• it produces sparse vectors of regression coefficients for prediction,245

• it can be applied to a posterior distribution arising from any prior distribution,246

• it explicitly accounts for co-linearity in the matrix of prediction points and dependencies in247

the posterior distribution of β.248

2.1. The cost of measuring irrelevant variables. Suppose that collecting information on individ-249

ual covariates incurs some cost; thus the goal is to make an accurate enough prediction subject to250

a penalty for acquiring predictively irrelevant facts.251

Consider the problem of predicting an n-vector of future observables Y ∼ N(Xβ, σ2I) at a pre-252

specified set of design points X. Assume that a posterior distribution over the model parameters (β,253

σ2) has been obtained via Bayesian conditioning, given past data Y and design matrix X; denote254

the density of this posterior by π(β, σ2 | Y).255

It is crucial to note that X and X need not be the same. That is, the locations in predictor space256

where one wants to predict need not be the same points at which one has already observed past257

data. For notational simplicity, we will write X instead of X in what follows. Of course, taking258

X = X is a conventional choice, but distinguishing between the two becomes important in certain259

cases such as when p > n.260

Define an optimal action as one which minimizes expected loss E(L(Y , γ)) over all model selection261

vectors γ, where the expectation is taken over the predictive distribution of unobserved values:262

(15) f(Y | Y) =

∫f(Y | β, σ2)π(β, σ2 | Y)d(β, σ2).

10

As a widely applicable loss function, consider263

(16) L(Y , γ) = λ||γ||0 + n−1||Xγ − Y ||22,

where again ||γ||0 =∑

j 1(γj 6= 0). This loss sums two components, one of which is a “parsimony264

penalty” on the action γ and the other of which is the squared prediction loss of the linear predictor265

defined by γ. The scalar utility parameter λ dictates how severely we penalize each of these two266

components, relatively. Integrating over Y conditional on (β, σ2) (and overloading the notation of267

L) gives268

(17) L(β, σ, γ) ≡ E(L(Y , γ)) = λ||γ||0 + n−1||Xγ −Xβ||22 + σ2.

Because (β, σ2) are unknown, an additional integration over π(β, σ2 | Y) yields269

(18) L(γ) ≡ E(L(β, σ, γ)) = λ||γ||0 + σ2 + n−1tr(XtXΣβ) + n−1||Xβ −Xγ||22,

where σ2 = E(σ2), β = E(β) and Σβ = Cov(β), and all expectations are with respect to the270

posterior.271

Dropping constant terms, one arrives at the “decoupled shrinkage and selection” (DSS) loss272

function:273

(19) L(γ) = λ||γ||0 + n−1||Xβ −Xγ||22.

Optimization of the DSS loss function is a combinatorial programming problem depending on274

the posterior distribution via the posterior mean of β; the DSS loss function explicitly trades off275

the number of variables in the linear predictor with its resulting predictive performance. Denote276

this optimal solution by277

(20) βλ ≡ arg minγ λ||γ||0 + n−1||Xβ −Xγ||22.

Note that the above derivation applies straightforwardly to the selection prior setting via expres-278

sion (10) or (equivalently) via the hierarchical formulation in (12), which guarantee that β is well279

defined marginally across different models.280

2.2. Analogy with high posterior density regions. Although orthographically (19) resembles ex-281

pressions used in penalized likelihood methods, the better analogy is a Bayesian highest posterior282

density (HPD) region, defined as the shortest contiguous interval encompassing some fixed fraction283

11

of the posterior mass. The insistence on reporting the shortest interval is analogous to the DSS sum-284

mary being defined in terms of the sparsest linear predictor which still has reasonable prediction285

performance. Like HPD regions, DSS summaries are well defined under any prior giving a proper286

posterior.287

To amplify, the DSS optimization problem is well-defined for any posterior as long as β exists.288

Different priors may lead to very different posteriors, potentially with very different means. However,289

regardless of the precise nature of the posterior (e.g., the presence of multimodality), β is the290

optimal summary under squared-error prediction loss, which entails that expression (20) represents291

the sparsified solution to the optimization problem given in (16).292

An important implication of this analogy is the realization that a DSS summary can be produced293

for a prior distribution directly, in the same way that a prior distribution has an HPD (with the294

posterior trivially equal to the prior). The DSS summary requires the user to specify a matrix of295

prediction points X, but conditional on this choice one can extract sparse linear predictors directly296

from a prior distribution.297

In Section 3, we discuss strategies for using additional features of the posterior π(β, σ2 | Y) to298

guide the choice of picking λ.299

2.3. Computing and approximating βλ. The counting penalty ||γ||0 yields an intractable opti-300

mization problem for even tens of variables (p ≈ 30). This problem has been addressed in recent301

years by approximating the counting norm with modifications of the `1 norm, ||γ||1 =∑

h |γh|,302

leading to a surrogate loss function which is convex and readily minimized by a variety of software303

packages. Crucially, such approximations still yield a sequence of sparse actions (the solution path304

as a function of λ), simplifying the 2p dimensional selection problem to a choice between at most p305

alternatives. The goodness of these approximations is a natural and relevant concern. Note, how-306

ever, that the goodness of approximation is an computational rather than inferential concern. This307

is what is meant by “decoupled” shrinkage and selection.308

More specifically, recall that DSS requires finding the optimal solution defined in (20). The most309

simplistic and yet widely-used approximation is to substitute the `0 by the `1 norm, which leads to a310

convex optimization problem for which many implementations are available, in particular the lars311

algorithm (Efron et al. [2004]). Using this approximation, βλ can be obtained simply by running312

the lars algorithm using Y = Xβ as the “data”.313

It is well-known that the `1 approximation may unduly “shrink” all elements of βλ beyond the314

12

shrinkage arising naturally from the prior over β. To avoid this potential “double-shrinkage” it is315

possible to explicitly adjust the `1 approach towards the desired `0 target. Specifically, the local316

linear approximation argument of Zou and Li [2008] and Lv and Fan [2009] advises to solve a317

surrogate optimization problem (for any wj near the corresponding `0 solution)318

(21) βλ ≡ arg minγ∑j

λ

|wj ||γj |+ n−1||Xβ −Xγ||22.

This approach yields a procedure analogous to the adaptive lasso of Zou [2006] with Y = Xβ in319

place of Y. In what follows, we use wj = βj (whereas the adaptive lasso uses the least-squares320

estimate βj). The lars package in R can then be used to obtain solutions to this objective function321

by a straightforward rescaling of the design matrix.322

In our experience, this approximation successfully avoids double-shrinkage. In fact, as illustrated323

in the U.S. crime example below, this approach is able to un-shrink coefficients depending on which324

variables are selected into the model.325

For a fixed value of λ, expression (19) uniquely determines a sparse vector βλ as its corresponding326

Bayes estimator. However, choosing λ to define this estimator is a non-trivial decision in its own327

right. Section 3 considers how to use the posterior distribution π(β, σ2 | Y) to illuminate the328

trade-offs implicit in the selection of a given value of λ.329

3. Selection summary plots. How should one think about the generalization error across330

possible values of λ? Consider first two extreme cases. When λ = 0, the solution to the DSS331

optimization problem is simply the posterior mean: βλ=0 ≡ β. Conversely, for very large λ, the332

optimal solution will be the zero vector, βλ=∞ = 0, which will have expected prediction loss equal333

to the marginal variance of the response Y (which will depend on the predictor points in question).334

Because the sparsity of βλ depends directly on the choice of λ, making its selection an important335

consideration.336

A sensible way to judge the goodness of βλ in terms of prediction is relative to the predictive337

performance of β—were it known—which is the optimal linear predictor under squared-error loss.338

The relevant scale for this comparison is dictated by σ2, which quantifies the best one can hope to339

do even if β were known. With these benchmarks in mind, one wants to address the question: how340

much predictive deterioration is a result of sparsification?341

The remainder of this section defines three plots that can be used by a data analyst to visualize the342

predictive deterioration across various values of λ. The first plot concerns a measure of “variation-343

13

explained”, the second plot considers the excess prediction loss on the scale of the response variable,344

and the final plot looks at the magnitude of the elements of βλ.345

In the following examples, the outcome variable and covariates are centered and scaled to mean346

zero and unit variance. This step is not strictly required, but it does facilitates default prior specifi-347

cation. Likewise the solution to (19) is invariant to scaling, but the approximation based on (21) is348

sensitive to the scale of the predictors. Finally, although the exposition above assumed no intercept,349

in the examples an intercept is always fit; the intercept is given a flat prior and does not appear in350

the formulation of the DSS utility function.351

3.1. Variation-explained of a sparsified linear predictor. Define the “variation-explained” at352

design points X (perhaps different than those seen in the data sample used to form the posterior353

distribution) as:354

(22) ρ2 =n−1||Xβ||2

n−1||Xβ||2 + σ2.

Denote by355

(23) ρ2λ =n−1||Xβ||2

n−1||Xβ||2 + σ2 + n−1||Xβ −Xβλ||2

the analogous quantity for the sparsified linear predictor βλ. The gap between βλ and β due to356

sparsification is tallied as a contribution to the noise term, which decreases the variation-explained.357

This quantity has the benefit of being directly comparable to the ubiquitous R2 metric of model358

fit familiar to users of statistical software and least-squares theory.359

Posterior samples of ρ2λ can be obtained as follows.360

1. First, solve (21) by applying the lars algorithm with inputs wj = βj and Y = Xβ. A361

single run of this algorithm will produce a sequence of solutions βλ for a range of λ values.362

(Obtaining draws of ρ2λ using a model selection prior requires posterior samples from (β, σ2)363

marginally across models.)364

2. Second, for each element in the sequence of βλ’s, convert posterior samples of (β, σ2) into365

samples of ρ2λ via definition (23).366

3. Finally, plot the expected value and 90% credible intervals of ρ2λ against the model size, ||βλ||λ.367

The posterior mean of ρ20 may be overlaid as a horizontal line for benchmarking purposes; note368

that even for λ = 0 (so that βλ=0 = β), the corresponding variation-explained, ρ20, will have369

a (non-degenerate) posterior distribution induced by the posterior distribution over (β, σ2).370

14

Variation-explained sparsity summary plots depict the posterior uncertainty of ρ2λ, thus providing371

a measure of confidence concerning the predictive goodness of the sparsified vector. In these plots,372

one often observes that the sparsified variation-explained does not deteriorate “statistically signifi-373

cantly” in the sense that the credible interval for ρ2λ overlaps the posterior mean of the unsparsified374

variation-explained.375

3.2. Excess error of a sparsified linear predictor. Define the “excess error” of a sparsified linear376

predictor βλ as377

(24) ψλ =√n−1||Xβλ −Xβ||2 + σ2 − σ.

This metric of model fit, while less widely used than variation-explained, has the virtue of being378

on the same scale as the response variable. Note that excess error attains a minimum of zero379

precisely when βλ = β. As with the variation-explained, the excess error is a random variable and380

so has a posterior distribution. By plotting the mean and 90% credible intervals of the excess error381

against model size (corresponding to increasing values of λ), one can see at a glance the degree of382

predictive deterioration incurred by sparsification. Samples of ψλ can be obtained analogously to383

the procedure for producing samples of ρ2λ, but using (24) in place of (23).384

3.3. Coefficient magnitude plot. In addition to the two previous plots, it is instructive to exam-385

ine which variables remain in the model at different levels of sparsification, which can be achieved386

simply by plotting the magnitude of each element of βλ as λ (hence model size) varies. However,387

using λ or ||βλ||0 for the horizontal axis can obscure the real impact of the sparsification because388

the predictive impact of sparsification is non-constant. That is, the jump from a model of size 7389

to one of size 6, for example, may correspond to a negligible predictive impact, while the jump390

from model of size 3 to a model of size 2 could correspond to considerable predictive deterioration.391

Plotting the magnitude of the elements of βλ against the corresponding excess error ψλ gives the392

horizontal axis a more interpretable scale.393

3.4. A heuristic for reporting a single model. The three plots described above achieve a remark-394

able consolidation of information hidden within the posterior samples of π(β, σ2 | Y). They relate395

sparsification of a linear predictor to the associated loss in predictive ability, while keeping the396

posterior uncertainty in these quantities in clear view. Nonetheless, in many situations one would397

like a procedure that yields a single linear predictor.398

15

For producing a single-model linear summary, we propose the following heuristic: report the399

sparsified linear predictor corresponding to the smallest model whose 90% ρ2λ credible interval400

contains E(ρ20). In words, we want the smallest linear predictor whose predictive ability (practical401

significance) is not statistically different than the full model’s.402

Certainly, this approach requires choosing a credibility level—there is nothing privileged about403

the 90% level rather than say the 75% or 95%. However, this is true of alternative methods such404

as hard thresholding or examination of marginal inclusion probabilities, which both require similar405

conventional choices to be determined. The DSS model selection heuristic offer a crucial benefit406

over these approaches, though—it explicitly includes a design matrix of predictors into its very407

formulation. Standard thresholding rules and methods such as the median probability model ap-408

proach are instead defined on a one-by-one basis, which does not explicitly account for colinearity409

in the predictor space. (Recall that both the thresholding rules studied in Ishwaran and Rao [2005]410

and the median probability theorems of Barbieri and Berger [2004] restrict their analysis to the411

orthogonal design situation.)412

In the DSS approach to model selection, dependencies in the predictor space appear both in the413

formation of the posterior and also in the definition of the loss function. In this sense, while the414

response vector Y is only “used once” in the formation of the posterior, the design information415

may be “used twice”, both in defining the posterior and also in defining the loss function. Note416

that this is reasonable in the sense that the model is conditional on X in the first place. Note also417

that the DSS loss function may be based on a predictor matrix different than the one used in the418

formation of the posterior.419

Example: U.S. crime dataset (p = 15, n = 47). The U.S. crime data of Vandaele [1978] appears420

in Raftery et al. [1997] and Clyde et al. [2011] among others. The dataset consists of n = 47421

observations on p = 15 predictors. As in earlier analyses we log transform all continuous variables.422

We produce DSS selection summary plots for three different priors: (i) the horseshoe prior, (ii) the423

robust prior of Bayarri et al. [2012] with uniform model probabilities, and (iii) a g-prior with g = n424

and model probabilities as suggested in Scott and Berger [2006]. With p = 15 < 30, we are able to425

evaluate marginal likelihoods for all models under the model selection priors (ii) and (iii).426

We use these particular priors not to endorse them, but merely as representative examples of427

widely-used specifications.428

Figures 1 and 2 show the resulting DSS plots under each prior. Notice that with this data set the429

16

prior choice has an impact; the resulting posteriors for ρ2 are quite different. For example, under430

the horseshoe prior we observe a significantly larger amount of shrinkage, leading to a posterior431

for ρ2 that concentrates around smaller values as compared to the results in Figure 2. Despite this432

difference, a conservative reading of the plots would lead to the same conclusion in either situation:433

the 7-variable model is essentially equivalent (in both suggested metrics, ρ2 and ψ) to the full434

model.435

To use these plots to produce a single sparse linear predictor for the purpose of data summary,436

we employ the heuristic described in Section 3.4. Table 1 compares the resulting summary to the437

model chosen according to the median probability model criterion. Notably, the DSS heuristic yields438

the same 7-variable model under all three choices of prior. In contrast, the HPM is the full model,439

while the MPM gives either an 11-variable or a 7-variable model depending on which prior is used.440

Both the HPM and MPM under the robust prior choice would include variables with low statistical441

and practical significance.442

Notice also that the MPM under the robust prior contains four variables with marginal inclusion443

probabilities near 1/2. The precise numerics of these quantities is highly prior dependent and444

sensitive to search methods when enumeration is not possible. Accordingly, the MPM model in445

this case is highly unstable. By focusing on metrics more closely related to practical significance,446

the DSS heuristic provides more stable selection, returning the same 7-variable model under all447

prior specifications in this example. As such, this data set provides a clear example of statistical448

significance—as evaluated by standard posterior quantities—overwhelming practical relevance. The449

summary provided by a selection summary plot makes an explicit distinction between the two450

notions of relevance, providing a clear sense of the predictive cost associated with dropping a451

predictor.452

Finally, notice that there is no evidence of “double-shrinkage”. That is, one might suppose that453

DSS penalizes coefficients twice, once in the prior and again in the sparsification process, leading to454

unwanted attenuation of large signals. However, double-shrinkage would not occur if the `0 penalty455

were being applied exactly, so any unwanted attenuation is attributable to the imprecision of the456

surrogate optimization in (21). In practice, we observe that the adaptive lasso-based approximation457

exhibits minimal evidence of double-shrinkage. Figure 3 displays the resulting values of βλ in the458

U.S. crime example plotted against the posterior mean (under the horseshoe prior). Notice that,459

moving from larger to smaller models, no double-shrinkage is apparent. In fact, we observe re-460

17

inflation or “unshrinkage” of some coefficients as one progresses to smaller models, as might be461

expected under the `0 norm.462

DSS-HS DSS-Robust DSS-g-prior MPM(Robust) MPM(g-prior) HS(th) t-stat

M • • • 0.89 0.85 • •So – – – 0.39 0.27 – –Ed • • • 0.97 0.96 • •Po1 • • • 0.71 0.68 • –Po2 – – – 0.52 0.45 • –LF – – – 0.36 0.22 – –M.F – – – 0.38 0.24 – –Pop – – – 0.51 0.40 – –NW • • • 0.77 0.70 • •U1 – – – 0.39 0.27 – –U2 • • • 0.71 0.63 – •GDP – – – 0.52 0.39 – –Ineq • • • 0.99 0.99 • •Prob • • • 0.91 0.88 • •Time – – – 0.52 0.40 – –

R2MLE 82.6% 82.6% 82.6% 85.4% 82.6% 80.0% 69.0%

Table 1Selected models by different methods in the U.S. crime example. The MPM column displays marginal inclusion

probabilities with the numbers in bold associated with the variables included in the median probability model. TheHS(th) column refers to the hard thresholding in Section 1.5 under the horseshoe prior. The t-stat column is themodel defined by OLS p-values smaller that 0.05. The R2

mle row reports the traditional in-sample percentage ofvariation-explained of the least-squares fit based on only the variables in a given column.

Example: Diabetes dataset (p = 10, n = 447). The diabetes data was used to demonstrate the463

lars algorithm in Efron et al. [2004]. The data consist of p = 10 baseline measurements on n = 442464

diabetic patients; the response variable is a numerical measurement of disease progression. As in465

Efron et al. [2004], we work with centered and scaled predictor and response variables. In this466

example we only used the robust prior of Bayarri et al. [2012]. The goal is to focus on the sequence467

in which the variables are included and to illustrate how DSS provides an attractive alternative to468

the median probability model.469

Table 2 shows the variables included in each model in the DSS path up to the 5-variable model.470

The DSS plots in this example (omitted here) suggest that this should be the largest model under471

consideration. The table also reports the median probability model.472

Notice that marginal inclusion probabilities do not necessarily offer a good alternative to rank473

variable importance, particularly in cases where the predictors are highly colinear. This is evident474

in the current example in the “dilution” of inclusion probabilities of the variables with the strongest475

dependencies in this dataset: TC, LDL, HDL, TCH and LTG. It is possible to see the same effect476

18

Model Size

ρ λ2

Full (15) 10 9 8 7 6 5 4 3 2 1

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

Model Size

ψλ

Full (15) 10 9 8 7 6 5 4 3 2 1

0.0

0.1

0.2

0.3

0.4

0.5

0.06 0.08 0.10 0.12 0.14 0.16 0.18

0.0

0.2

0.4

0.6

0.8

Average excess error

β

Fig 1. U.S. Crime Data: DSS plots under the horseshoe prior.

in the rank of high probability models, as most models on the top of the list represent distinct477

combinations of correlated predictors. In the sequence of models from DSS, variables LTG and478

HDL are chosen as the representatives for this group.479

Meanwhile, a variable such as Sex appears with marginal inclusion probability of 0.98, and yet480

its removal from DSS (five-variable) leads to only a minor decrease in the model’s predictive ability.481

Thus the diabetes data offer a clear example where statistical significance can overwhelm practical482

19

relevance if one looks only at standard Bayesian outputs. The summary provided by DSS makes483

a distinction between the two notions of relevance, providing a clear sense of the predictive cost484

associated with dropping a predictor.485

486

Example: protein activation dataset (p = 88, n = 96). The protein activity dataset is from487

Clyde et al. [2011]. This example differs from the previous example in that with p = 88 predictors,488

the model space can no longer be exhaustively enumerated. In addition, correlation between the489

Model Size

ρ λ2

Full (15) 13 11 9 8 7 6 5 4 3 2 1

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

Model Size

ψλ

Full (15) 13 11 9 8 7 6 5 4 3 2 1

0.0

0.1

0.2

0.3

0.4

0.5

Model Size

ρ λ2

Full (15) 12 11 10 9 8 7 6 5 4 3 2 1

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

Model Size

ψλ

Full (15) 12 11 10 9 8 7 6 5 4 3 2 1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Fig 2. U.S. Crime Data: DSS plot under the “robust” prior of Bayarri et al. [2012] (top row) and under a g-priorwith g = n (bottom row). All 215 models were evaluated in this example.

20

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

dss model size: 15

β

β DSS

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

dss model size: 9

β

β DSS

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

dss model size: 7

β

β DSS

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

dss model size: 2

β

β DSS

Fig 3. U.S. Crime data under the horseshoe prior: β refers to the posterior mean while βDSS is the value of βλ underdifferent values of λ such that different number of variables are selected.

potential predictors is as high as 0.99, with 17 pairs of variables having correlations above 0.95. For490

this example, the horseshoe prior and the robust prior are considered. To search the model space,491

we use a conventional Gibbs sampling strategy as in Garcia-Donato and Martinez-Beneito [2013]492

(Appendix A), based on George and McCulloch [1997].493

Figure 4 shows the DSS plots under the two priors considered. Once again, the horseshoe prior494

leads to smaller estimates of ρ2. And once again, despite this difference, the DSS heuristic returns495

the same six predictors under both priors. On this data set, the MPM under the Gibbs search (as496

well as the HPM and MPM given by BAS) coincide with the DSS summary model.497

Example: protein activation dataset (p = 88, n = 80). To explore the behavior of DSS in the498

p > n regime, we modify the previous example by randomly selecting a subset of n = 80 observations499

21

DSS-Robust(5) DSS-Robust(4) DSS-Robust(3) DSS-Robust(2) DSS-Robust(1) MPM(Robust) t-stat

Age – – – – – 0.08 –Sex • – – – – 0.98 •BMI • • • • • 0.99 •MAP • • • – – 0.99 •TC – – – – – 0.66 •LDL – – – – – 0.46 –HDL • • – – – 0.51 –TCH – – – – – 0.26 –LTG • • • • – 0.99 •GLU – – – – – 0.13 –

R2MLE 50.8% 49.2% 48.0% 45.9% 34.4% 51.3% 50.0%

Table 2Selected models by DSS and model selection prior in the Diabetes example. The MPM column displays marginal

inclusion probabilities, and the numbers in bold are associated with the variables included in the median probabilitymodel. The t-stat column is the model defined by OLS p-values smaller that 0.05. The R2

MLE row reports thetraditional in-sample percentage of variation-explained of the least-squares fit based on only the variables in a given

column.

from the original dataset. These 80 observations are used to form our posterior distribution. To500

define the DSS summary, we take X to be the entire set of 96 predictor values. For simplicity we only501

use the robust model selection prior. Figure 5 shows the results; with fewer observations, smaller502

models don’t give up as much in the ρ2 and ψ scales as the original example. A conservative read503

of the DSS plots leads to the same 6-variable model, however, in this limited information situation,504

the models with 5 or 4 variables are competitive. One important aspect of Figure 5 is that even505

working in the p > n regime, DSS is able to evaluate the performance and provide a summary of506

models of any dimension up to the full model. This is accomplished even in this situations where507

by using the robust prior, the posterior was limited to models up to dimension n− 1. In order for508

this to be achieved all DSS needs is the number of points in X to be larger than p. In situations509

where not enough points are available in the dataset, all the user needs to do is to add (arbitrary510

and without loss of generality) representative points in which to make predictions about potential511

Y .512

4. Discussion. A detailed examination of the previous literature reveals that sparsity can513

play many roles in a statistical analysis—model selection, strong regularization, and improved514

computation, for example. A central, but often implicit, virtue of sparsity is that human beings515

find fewer variables easier to think about.516

When one desires sparse model summaries for improved comprehensibility, prior distributions are517

an unnatural vehicle for furnishing this bias. Instead, we describe how to use a decision theoretic518

22

Model Size

ρ λ2

Full 8 7 6 5 4 3 2 1

0.40

0.45

0.50

0.55

0.60

0.65

0.70

Model Size

ψλ

Full 8 7 6 5 4 3 2 1

0.0

0.1

0.2

0.3

0.4

0.5

Model Size

ρ λ2

Full 29 23 21 19 16 13 11 9 7 5 3 1

0.35

0.40

0.45

0.50

0.55

0.60

0.65

Model Size

ψλ

Full 29 23 21 19 16 13 11 9 7 5 3 1

0.0

0.1

0.2

0.3

0.4

0.5

Fig 4. Protein Activation Data: DSS plots under model selection priors (top row) and under shrinkage priors (bottomrow).

approach to induce sparse posterior model summaries. Our new loss function resembles the popular519

penalized likelihood objective function of the lasso estimator, but its interpretation is very different.520

Instead of a regularizing tool for estimation, our loss function is a posterior summarizer with an521

explicit parsimony penalty. To our knowledge this is the first such loss function to be proposed in522

23

Model Size

ρ λ2

Full (88) 8 7 6 5 4 3 2 1

0.40

0.45

0.50

0.55

0.60

0.65

0.70

Model Size

ψλ

Full (88) 8 7 6 5 4 3 2 1

0.0

0.1

0.2

0.3

0.4

0.5

Fig 5. Protein Activation Data (p > n case): DSS plots under model selection priors

this capacity. Conceptually, its nearest forerunner would be high posterior density regions, which523

summarize a posterior density while satisfying a compactness constraint.524

Unlike hard thresholding rules, our selection summary plots convey posterior uncertainty associ-525

ated with the provided sparse summaries. In particular, posterior correlation between the elements526

of β impacts the posterior distribution of the sparsity degradation metrics ρ2 and ψ. While the527

DSS approach does not “automate” the problem of determining λ (and hence βλ), they do manage528

to distill the posterior distribution into a graphical summary that reflects the posterior uncertainty529

in the predictive degradation due to sparsification. Furthermore, they explicitly integrate informa-530

tion about the possibly non-orthogonal design space in ways that standard thresholding rules and531

marginal probabilities do not.532

As a summary device, these plots can be used in conjunction with whichever prior distribution533

is most appropriate to the applied problem under consideration. As such, they complement re-534

cent advances in Bayesian variable selection and shrinkage estimation and will benefit from future535

advances in these areas.536

We demonstrate how to apply the summary selection concept to logistic regression and Gaussian537

graphical models in a brief appendix.538

24

References.539

M. Barbieri and J. Berger. Optimal predictive model selection. Annals of Statistics, 32:870–897, 2004.540

M. Bayarri, J. Berger, A. Forte, and G. Garcia-Donato. Criteria for Bayesian model choice with application to variable541

selection. The Annals of Statistics, 40(3):1550–1577, 2012.542

J. Berger. A robust generalized bayes estimator and confidence region for a multivariate normal mean. The Annals543

of Statistics, pages 716–761, 1980.544

J. Berger and L. Pericchi. Objective Bayesian methods for model selection: introduction and comparison, volume 38545

of Lecture Notes-Monograph Series, pages 135–207. Institute of Mathematical Statistics, 2001.546

C. Carvalho, N. Polson, and J. Scott. The horseshoe estimator for sparse signals. Biometrika, 97:465–480, 2010.547

M. Clyde and E. George. Model uncertainty. Statistical Science, 19:81–94, 2004.548

M. Clyde, J. Ghosh, and M. Littman. Bayesian adaptive sampling for variable selection and model averaging. Journal549

of Computational and Graphical Statistics, 20:80–101, 2011.550

W. Cui and E. I. George. Empirical Bayes vs. fully Bayes variable selection. Journal of Statistical Planning and551

Inference, 138(4):888–900, 2008.552

D. Draper. Bayesian model specification: Heuristics and examples. In P. Damien, P. Dellaportas, N. Polson, and553

D. Stephens, editors, Bayesian Theory and Applications, pages 409–431. Oxford University Press, 2013.554

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407–499, 2004.555

D. Fouskakis and D. Draper. Comparing stochastic optimization methods for variable selection in binary outcome556

prediction, with application to health policy. Journal of the American Statistical Association, 103(484):1367–1381,557

2008.558

D. Fouskakis, I. Ntzoufras, and D. Draper. Bayesian variable selection using cost-adjusted BIC, with application to559

cost-effective measurement of quality of health care. Annals of Applied Statistics, 3:663–690, 2009.560

J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics,561

9(3):432–441, 2008.562

J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent.563

Journal of Statistical Software, 33(1):1–22, 2010.564

G. Garcia-Donato and M. Martinez-Beneito. On sampling strategies in Bayesian variable selection problems with565

large model spaces. Journal of the American Statistical Association, 108:340–352, 2013.566

A. E. Gelfand, D. K. Dey, and H. Chang. Model determination using predictive distributions with implementation567

via sampling-based methods. Technical report, DTIC Document, 1992.568

E. George and R. McCulloch. Variable selection via Gibbs sampling. Journal of the American Statistical Association,569

88:881–889, 1993.570

E. George and R. McCulloch. Approaches for Bayesian variable selection. Statistica Sinica, 7:339–373, 1997.571

J. Geweke et al. Variable selection and model comparison in regression. In A. D. J.M. Bernardo, J.O. Berger and572

A. Smith, editors, Bayesian Statistics 5, pages 609–620, New York, 1996. Oxford University Press.573

J. Griffin and P. Brown. Structuring shrinkage: some correlated priors for regression. Biometrika, 99:481–487, 2012.574

C. Hans. Bayesian lasso regression. Biometrika, 96:835–845, 2009.575

C. Hans. Model uncertainty and variable selection in Bayesian lasso regression. Statistics and Computing, 20(2):576

25

221–229, 2010.577

C. Hans, A. Dobra, and M. West. Shotgun stochastic search in regression with many predictors. Journal of the578

American Statistical Association, 102:507–516, 2007.579

H. Ishwaran and J. S. Rao. Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of580

Statistics, 33(2):730–773, 2005.581

H. Jeffreys. Theory of probability. Oxford University Press, 1961.582

B. Jones, C. Carvalho, A. Dobra, C. Hans, C. Carter, and M. West. Experiments in stochastic computation for583

high-dimensional graphical models. Statistical Science, 20:388–400, 2005.584

E. Leamer. Specification searches: ad hoc inference with nonexperimental data. Wiley series in probability and math-585

ematical statistics. Wiley, 1978. ISBN 9780471015208. URL http://books.google.com/books?id=sYVYAAAAMAAJ.586

F. Liang, R. Paulo, G. Molina, M. Clyde, and J. Berger. Mixtures of g priors for Bayesian variable selection. Journal587

of the American Statistical Association, 103:410–423, 2008.588

J. Lv and Y. Fan. A unified approach to model selection and sparse recovery using regularized least squares. Annals589

of Statistics, 37:3498–3528, 2009.590

Y. Maruyama and E. I. George. Fully Bayes factors with a generalized g-prior. The Annals of Statistics, 39(5):591

2740–2765, 2011.592

T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the American593

Statistical Association, 83(404):1023–1032, 1988.594

R. B. O’Hara and M. J. Sillanpaa. A review of Bayesian variable selection methods: what, how and which. Bayesian595

Analysis, 4(1):85–117, 2009.596

T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103:681–686, 2008.597

L. Pericchi and A. Smith. Exact and approximate posterior moments for a normal location parameter. Journal of598

the Royal Statistical Society. Series B (Methodological), pages 793–804, 1992.599

L. R. Pericchi and P. Walley. Robust Bayesian credible intervals and prior ignorance. International Statistical Review,600

pages 1–23, 1991.601

N. G. Polson and J. G. Scott. Local shrinkage rules, Levy processes and regularized regression. Journal of the Royal602

Statistical Society: Series B (Statistical Methodology), 74(2):287–311, 2012.603

N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using Polya-Gamma latent variables.604

Journal of the American Statistical Association, 108:1339–1349, 2013.605

A. Raftery, D. Madigan, and J. Hoeting. Bayesian model averaging for linear regression models. Journal of the606

American Statistical Association, 92:1197–1208, 1997.607

J. Scott and J. Berger. An exploration of aspects of Bayesian multiple testing. Journal of Statistical Planning and608

Inference, 136:2144–2162, 2006.609

J. G. Scott and J. O. Berger. Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem.610

The Annals of Statistics, 38(5):2587–2619, 2010.611

D. Spiegelhalter. A test for normality against symmetric alternatives. Biometrika, 64(2):415–418, 1977.612

W. E. Strawderman. Proper bayes minimax estimators of the multivariate normal mean. The Annals of Mathematical613

Statistics, pages 385–388, 1971.614

26

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, B, 58:267–288,615

1996.616

W. Vandaele. In A. Blumstein, J. Cohen, and D. Nagin, editors, Deterrence and Incapacitation, pages 270–335.617

National Academy of Sciences Press, 1978.618

M. West. On scale mixtures of normal distributions. Biometrika, 74(3):646–648, 1987.619

A. Zellner. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In Bayesian620

Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, pages 233–243. Amsterdam: North-621

Holland, 1986.622

H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101:1418–1429,623

2006.624

H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36:625

1509–1533, 2008.626

APPENDIX A: EXTENSIONS

A.1. Selection summary in logistic regression. Selection summary can be applied outside627

the realm of normal linear models as well. This section explicitly shows how to extend the approach628

to logistic regression and provides an illustration on real data.629

Although one has many choices for judging predictive accuracy, it is convenient to note that630

squared prediction loss is precisely the negative log likelihood in the normal linear model setting,631

which suggests the following generalization of (16):632

(25) L(Y , γ) = λ||γ||0 − n−1 log[f(Y ,X, γ)

]where f(Y , γ) denotes the likelihood of Y with “parameters” γ.633

In the case of a binary outcome vector using a logistic link function, the generalized DSS loss634

becomes635

(26) L(Y , γ) = λ||γ||0 + n−1n∑i=1

(YiXiγ − log (1 + exp (Xiγ))

).

Taking expectations yields636

(27) L(Y , π) = λ||γ||0 + n−1n∑i=1

(πiXiγ − log (1 + exp (Xiγ))) ,

where πi is the posterior mean probability that Yi = 1. To help interpret this formula, note that637

it can be rewritten as a weighted logistic regression as follows. For each observed Xi, associate a638

pair of pseudo-responses Zi = 1 and Zi+n = 0 with weights wi = πi and wi+n = 1− πi respectively.639

27

Then πiXiγ − log (1 + exp (Xiγ)) may be written as640

(28)[wiZiXiγ − wi log (1 + exp (Xiγ))

]+[wi+nZi+nXiγ − wi+n log (1 + exp (Xiγ))

].

Thus, optimizing the DSS logistic regression loss is equivalent to finding the penalized maximum641

likelihood of a weighted logistic regression where each point in predictor space has a response642

Zi = 1, given weight πi, and a counterpart response Zi = 0, given weight 1− πi. The observed data643

determines πi via the posterior distribution. As before, if we replace (27) by the surrogate `1 norm644

(29) L(Y , π) = λ||γ||1 + n−1n∑i=1

(πiXiγ − log (1 + exp (Xiγ))) ,

then an optimal solution can be computed via the R package GLMNet (Friedman et al. [2010]).645

The DSS summary selection plot may be adapted to logistic regression by defining the excess646

error as647

(30) ψλ =

√n−1

∑i

πi − 2πλ,iπi + π2λ,i −√n−1

∑i

πi(1− πi)

where πi is the probability that yi = 1 given the true model parameters, and πλ,i is the corresponding648

quantity under the λ-sparsified model. This expression for the logistic excess error relates to the649

linear model case in that each expression can be derived from650

(31) ψλ =

√n−1E

(||Y − Yλ||2

)−√n−1E

(||Y − E(Y )||2

)where the expectation is with respect to the predictive distribution of Y conditional on the model651

parameters, and Yλ denotes the optimal λ-sparse prediction. In particular, Yλ ≡ Xβλ for the linear652

model and yλ,i ≡ πλ,i = (1 + exp−Xiβλ)−1 for the logistic regression model. One notable difference653

between the expressions for excess error under the linear model and the logistic model is that the654

linear model has constant variance whereas the variance term depends on the predictor point in655

the logistic model as a result of the Bernoulli likelihood.656

Example: German credit data (n = 1000, p = 48). To illustrate selection summary in657

the logistic regression context, we use the German Credit data from the UCI repository, where658

n = 1000 and p = 48. In each record we have available covariates associated with a loan applicant,659

such as credit history, checking account status, car ownership and employment status. The outcome660

variable is a judgment of whether or not the applicant has “good credit”. A natural objective when661

28

analyzing this data would be to develop a good model for assessing creditworthiness of future662

applicants. A default shrinkage prior over the regression coefficients is used, based on the ideas663

described in Polson et al. [2013] and the associated R package BayesLogit. The DSS selection664

summary plots (adapted to a logistic regression) are displayed in Figure 6. The plot suggests a high665

degree of “pre-variable selection”, in that all of the predictor variables appear to add an incremental666

amount of prediction accuracy, with no single predictor appearing to dominate. Nonetheless, several667

of the larger models (smaller than the full forty-eight variable model) do not give up much in excess668

error, suggesting that a moderately reduced model (≈ 35), may suffice in practice. Depending on669

the true costs associated with measuring those ten least valuable covariates, relative to the cost670

associated with an increase of 0.01 in excess error, this reduced model may be preferable.671

Model Size

ψλ

48 41 36 31 26 20 15 9 3

0.00

0.04

0.08

0.02 0.04 0.06 0.08

0.0

0.2

0.4

0.6

0.8

Average excess error

β

Fig 6. DSS plots for the German credit data. For this data, each included variable seems to add an incrementalamount, as the excess error plot builds steadily until reaching the null model with no predictors.

A.2. Selection summary for Gaussian graphical models. Covariance estimation is yet672

another area where a sparsifying loss function can be used to induce a parsimonious posterior673

summary.674

Consider a (p× 1) vector (x1, x2, . . . , xp) = X ∼ N(0,Σ). Zeros in the precision matrix Ω = Σ−1675

imply conditional independence among certain dimensions of X. As sparse precision matrices can676

be represented through a labelled graph, this modeling approach is often referred to as Gaussian677

graphical modeling. Specifically, for a graph G = (V,E), where V is the set of vertices and E is the678

29

set of edges, let each edge represent a non-zero element of Ω. See Jones et al. [2005] for a thorough679

overview. This problem is equivalent to finding a sparse representation in p separate linear models680

for Xj |X−j , making the selection summary approach developed above directly applicable.681

As with linear models, one has the option of modeling the entries in the precision matrix via682

shrinkage priors or via selection priors with point masses at zero. Regardless of the specific choice683

of prior, summarizing patterns of conditional independence favored in the posterior distribution684

remains a major challenge.685

A DSS parsimonious summary can be achieved via a multivariate extension of (16) by once again686

leveraging the notion of “predictive accuracy” as defined by the negative log likelihood:687

(32) L(X,Γ) = λ||Γ||0 − log det(Γ)− tr(n−1XX′Γ)

where Γ represents the decision variable for Ω and ||Γ||0 represents the sum of non-zero entries in688

off-diagonal elements of Γ. Taking expectations with respect to the posterior predictive of X yields689

(33) L(Γ) = E(L(X,Γ)

)= λ||Γ||0 − log det(Γ)− tr(ΣΓ)

where Σ represents the posterior mean of Σ.690

As before, an approximate solution to the DSS graphical model posterior summary optimization691

problem can be obtained by employing the surrogate `1 penalty692

(34) L(Γ) = E(L(X,Γ)

)= λ||Γ||1 − log det(Γ)− tr(ΣΓ).

as developed by penalized likelihood methods such as the graphical lasso [Friedman et al., 2008].693

30

DECOUPLING SHRINKAGE AND SELECTION IN BAYESIAN …faculty.mccombs.utexas.edu/carlos.carvalho/HahnCarvalhoDSS2014.pdf · DECOUPLING SHRINKAGE AND SELECTION IN BAYESIAN LINEAR MODELS:

Documents