DECOUPLING SHRINKAGE AND SELECTION IN BAYESIAN LINEAR MODELS: A POSTERIOR SUMMARY PERSPECTIVE By P. Richard Hahn and Carlos M. Carvalho Booth School of Business and McCombs School of Business Selecting a subset of variables for linear models remains an active area of research. This paper reviews many of the recent contributions to the Bayesian model selection and shrinkage prior literature. A posterior variable selection summary is proposed, which distills a full posterior distribution over regression coefficients into a sequence of sparse linear predictors. 1. Introduction. This paper revisits the venerable problem of variable selection in linear 1 models. The vantage point throughout is Bayesian: a normal likelihood is assumed and inferences 2 are based on the posterior distribution, which is arrived at by conditioning on observed data. 3 In applied regression analysis, a “high-dimensional” linear model can be one which involves tens 4 or hundreds of variables, especially when seeking to compute a full Bayesian posterior distribution. 5 Our review will be from the perspective of a data analyst facing a problem in this “moderate” 6 regime. Likewise, we focus on the situation where the number of predictor variables, p, is fixed. 7 In contrast to other recent papers surveying the large body of literature on Bayesian variable 8 selection [Liang et al., 2008, Bayarri et al., 2012] and shrinkage priors [O’Hara and Sillanp¨ a¨ a, 2009, 9 Polson and Scott, 2012], our review focuses specifically on the relationship between variable selection 10 priors and shrinkage priors. Selection priors and shrinkage priors are related both by the statistical 11 ends they attempt to serve (e.g., strong regularization and efficient estimation) and also in the 12 technical means they use to achieve these goals (hierarchical priors with local scale parameters). 13 We also compare these approaches on computational considerations. 14 Finally, we turn to variable selection as a problem of posterior summarization. We argue that 15 if variable selection is desired primarily for parsimonious communication of linear trends in the 16 data, that this can be accomplished as a post-inference operation irrespective of the choice of prior 17 distribution. To this end, we introduce a posterior variable selection summary, which distills a full 18 Keywords and phrases: decision theory, linear regression, loss function, model selection, parsimony, shrinkage prior, sparsity, variable selection. 1 imsart-aos ver. 2014/02/20 file: HahnCarvalhoDSS2014_R2.tex date: November 18, 2014
30
Embed
DECOUPLING SHRINKAGE AND SELECTION IN BAYESIAN …faculty.mccombs.utexas.edu/carlos.carvalho/HahnCarvalhoDSS2014.pdf · DECOUPLING SHRINKAGE AND SELECTION IN BAYESIAN LINEAR MODELS:
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
DECOUPLING SHRINKAGE AND SELECTION IN BAYESIAN
LINEAR MODELS: A POSTERIOR SUMMARY PERSPECTIVE
By P. Richard Hahn and Carlos M. Carvalho
Booth School of Business and McCombs School of Business
Selecting a subset of variables for linear models remains an active
area of research. This paper reviews many of the recent contributions
to the Bayesian model selection and shrinkage prior literature. A
posterior variable selection summary is proposed, which distills a full
posterior distribution over regression coefficients into a sequence of
sparse linear predictors.
1. Introduction. This paper revisits the venerable problem of variable selection in linear1
models. The vantage point throughout is Bayesian: a normal likelihood is assumed and inferences2
are based on the posterior distribution, which is arrived at by conditioning on observed data.3
In applied regression analysis, a “high-dimensional” linear model can be one which involves tens4
or hundreds of variables, especially when seeking to compute a full Bayesian posterior distribution.5
Our review will be from the perspective of a data analyst facing a problem in this “moderate”6
regime. Likewise, we focus on the situation where the number of predictor variables, p, is fixed.7
In contrast to other recent papers surveying the large body of literature on Bayesian variable8
selection [Liang et al., 2008, Bayarri et al., 2012] and shrinkage priors [O’Hara and Sillanpaa, 2009,9
Polson and Scott, 2012], our review focuses specifically on the relationship between variable selection10
priors and shrinkage priors. Selection priors and shrinkage priors are related both by the statistical11
ends they attempt to serve (e.g., strong regularization and efficient estimation) and also in the12
technical means they use to achieve these goals (hierarchical priors with local scale parameters).13
We also compare these approaches on computational considerations.14
Finally, we turn to variable selection as a problem of posterior summarization. We argue that15
if variable selection is desired primarily for parsimonious communication of linear trends in the16
data, that this can be accomplished as a post-inference operation irrespective of the choice of prior17
distribution. To this end, we introduce a posterior variable selection summary, which distills a full18
Keywords and phrases: decision theory, linear regression, loss function, model selection, parsimony, shrinkage prior,
sparsity, variable selection.
1imsart-aos ver. 2014/02/20 file: HahnCarvalhoDSS2014_R2.tex date: November 18, 2014
posterior distribution over regression coefficients into a sequence of sparse linear predictors. In this19
sense “shrinkage” is decoupled from “selection”.20
We begin by describing the two most common approaches to this scenario and show how the two21
approaches can be seen as special cases of an encompassing formalism.22
1.1. Bayesian model selection formalism. A now-canonical way to formalize variable selection23
in Bayesian linear models is as follows. Let Mφ denote a normal linear regression model indexed24
by a vector of binary indicators φ = (φ1, . . . , φp) ∈ 0, 1p signifying which predictors are included25
in the regression. Model Mφ defines the data distribution as26
(1) (Yi|Mφ, βφ, σ2) ∼ N(Xφ
i βφ, σ2)
where Xφi represents the pφ-vector of predictors in model Mφ. For notational simplicity, (1) does27
not include an intercept. Standard practice is to include an intercept term and to assign it a uniform28
prior.29
Given a sample Y = (Y1, . . . , Yn) and prior π(βφ, σ2), the inferential target is the set of posterior30
model probabilities defined by31
(2) p(Mφ | Y) =p(Y | Mφ)p(Mφ)∑φ p(Y | Mφ)p(Mφ)
,
where p(Y | Mφ) =∫p(Y | Mφ, βφ, σ
2)π(βφ, σ2)dβφdσ
2 is the marginal likelihood of model Mφ32
and p(Mφ) is the prior over models.33
Posterior inferences concerning a quantity of interest ∆ are obtained via Bayesian model aver-34
aging (or BMA), which entails integrating over the model space35
(3) p(∆ | Y) =∑φ
p(∆ | Mφ,Y)p(Mφ | Y).
As an example, optimal predictions of future values of Y under squared-error loss are defined36
through37
(4) E(Y | Y) ≡∑φ
E(Y | Mφ,Y)p(Mφ | Y).
An early reference adopting this formulation is Raftery et al. [1997]; see also Clyde and George38
[2004].39
Despite its straightforwardness, carrying out variable selection in this framework demands at-40
tention to detail: priors over model-specific parameters must be specified, priors over models must41
2
be chosen, marginal likelihood calculations must be performed and a 2p-dimensional discrete space42
must be explored. These concerns have animated Bayesian research in linear model variable selec-43
tion for the past two decades [George and McCulloch, 1993, 1997, Clyde and George, 2004, Hans44
et al., 2007, Liang et al., 2008, Scott and Berger, 2010, Clyde et al., 2011, Bayarri et al., 2012].45
Regarding model parameters, the consensus default prior for model parameters is π(βφ, σ2) =46
π(β | σ2)π(σ2) = N(0, gΩ) × σ−1. The most widely-studied choice of prior covariance is Ω =47
σ2(XtφXφ)−1, referred to as “Zellner’s g-prior” [Zellner, 1986], a “g-type” prior or simply g-prior.48
Notice that this choice of Ω dictates that the prior and likelihood are conjugate normal-inverse-49
gamma pairs (for a fixed value of g).50
For reasons detailed in Liang et al. [2008], it is advised to place a prior on g rather than use a51
fixed value. Several recent papers describe priors p(g) that still lead to efficient computations of52
marginal likelihoods; see Cui and George [2008], Liang et al. [2008], Maruyama and George [2011],53
and Bayarri et al. [2012]. Each of these papers (as well as the earlier literature cited therein) study54
priors of the form55
(5) p(g) = a [ρφ(b+ n)]a gd(g + b)−(a+c+d+1)1g > ρφ(b+ n)− b
with a > 0, b > 0, c > −1, and d > −1. Specific configurations of these hyper parameters56
recommended in the literature include: a = 1, b = 1, d = 0, ρφ = 1/(1 + n) [Cui and George,57
2008], a = 1/2, b = 1 (b = n), c = 0, d = 0, ρφ = 1/(1 + n) [Liang et al., 2008], and a = 1, b =58
1, c = −3/4, d = (n− 5)/2− pφ/2 + 3/4, ρφ = 1/(1 + n) [Maruyama and George, 2011].59
Bayarri et al. [2012] motivates the use of such priors from a testing perspective, using a variety60
of formal desiderata based on Jeffreys [1961] and Berger and Pericchi [2001], including consistency61
criteria, predictive matching criteria and invariance criteria. Their recommended prior uses a =62
1/2, b = 1, c = 0, d = 0, ρφ = 1/pφ. This prior is termed the robust prior, in the tradition following63
Strawderman [1971] and Berger [1980], who examine the various senses in which such priors are64
“robust”. This prior will serve as a benchmark in the examples of Section 3.65
Regarding prior model probabilities, see Scott and Berger [2010], who recommend a hierarchical66
prior of the form φjiid∼Ber(q), q ∼ Unif(0, 1).67
1.2. Shrinkage regularization priors. Although the formulation above provides a valuable theo-68
retical framework, it does not necessarily represent an applied statistician’s first choice. To assess69
which variables contribute dominantly to trends in the data, the goal may be simply to mitigate—70
3
rather than categorize—spurious correlations. Thus, faced with many potentially irrelevant predic-71
tor variables, a common first choice would be a powerful regularization prior.72
Regularization — understood here as the intentional biasing of an estimate to stabilize posterior73
inference — is inherent to most Bayesian estimators via the use of proper prior distributions and is74
one of the often-cited advantages of the Bayesian approach. More specifically, regularization priors75
refer to priors explicitly designed with a strong bias for the purpose of separating reliable from76
spurious patterns in the data. In linear models, this strategy takes the form of zero-centered priors77
with sharp modes and simultaneously fat tails.78
A well-studied class of priors fitting this description will serve to connect continuous priors to79
the model selection priors described above. Local scale mixture of normal distributions are of the80
form [West, 1987, Carvalho et al., 2010, Griffin and Brown, 2012]81
(6) π(βj | λ) =
∫N(βj | 0, λ2λ2j )π(λ2j )dλj ,
where different priors are derived from different choices for π(λ2j ).82
The last several years have seen tremendous interest in this area, motivated by an analogy with83
penalized-likelihood methods [Tibshirani, 1996]. Penalized likelihood methods with an additive84
penalty term lead to estimating equations of the form85
(7)∑i
h(Yi,Xi, β) + αQ(β)
where h and Q are positive functions and their sum is to be minimized; α is a scalar tuning variable86
dictating the strength of the penalty. Typically, h is interpreted as a negative log-likelihood, given87
data Y, and Q is a penalty term introduced to stabilize maximum likelihood estimation. A common88
choice is Q(β) = ||β||1, which yields sparse optimal solutions β∗ and admits fast computation [Tib-89
shirani, 1996]; this choice underpins the lasso estimator, an initialism for “least absolute shrinkage90
and selection operator”.91
Park and Casella [2008] and Hans [2009] “Bayesified” these expressions by interpreting Q(β) as92
the negative log prior density and developing algorithms for sampling from the resulting Bayesian93
posterior, building upon work of earlier Bayesian authors [Spiegelhalter, 1977, West, 1987, Pericchi94
and Walley, 1991, Pericchi and Smith, 1992]. Specifically, an exponential prior π(λ2j ) = Exp(α2)95
leads to independent Laplace (double-exponential) priors on the βj , mirroring expression (7).96
This approach has two implications unique to the Bayesian paradigm. First, it presented an97
opportunity to treat the global scale parameter λ (equivalently the regularization penalty parameter98
4
α) as a hyper parameter to be estimated. Averaging over λ in the Bayesian paradigm has been99
empirically observed to give better prediction performance than cross-validated selection of α (e.g.,100
Hans [2009]). Second, a Bayesian approach necessitates forming point estimators from posterior101
distributions; typically the posterior mean is adopted on the basis that it minimizes mean squared102
prediction error. Note that posterior mean regression coefficient vectors from these models are non-103
sparse with probability one. Ironically, the two main appeals of the penalized likelihood methods—104
efficient computation and sparse solution vectors β∗—were lost in the migration to a Bayesian105
approach. See, however, Hans [2010] for an application of double-exponential priors in the context106
of model selection.107
Nonetheless, wide interest in “Bayesian lasso” models paved the way for more general local108
shrinkage regularization priors of the form (6). In particular, Carvalho et al. [2010] develops a109
prior over location parameters that attempts to shrink irrelevant signals strongly toward zero while110
avoiding excessive shrinkage of relevant signals. To contextualize this aim, recall that solutions to `1111
penalized likelihood problems are often interpreted as (convex) approximations to more challenging112
formulations based on `0 penalties: ||γ||0 =∑
j 1(γj 6= 0). As such, it was observed that the global113
`1 penalty “overshrinks” what ought to be large magnitude coefficients. For one example, Carvalho114
et al. [2010] prior may be written as115
π(βj | λ) = N(0, λ2λ2j ),
λjiid∼C+(0, 1).
(8)
with λ ∼ C+(0, 1) or λ ∼ C+(0, σ2). The choice of half-Cauchy arises from the insight that for scalar116
observations yj ∼ N(θj , 1) and prior θj ∼ N(0, λ2j ), the posterior mean of θj may be expressed:117
(9) E(θj | yj) = 1− E(κj | yj)yj ,
where κj = 1/(1 + λ2j ). The authors observe that U-shaped Beta(1/2,1/2) distributions (like a118
horseshoe) on κj imply a prior over θj with high mass around the origin but with polynomial tails.119
That is, the “horseshoe” prior encodes the assumption that some coefficients will be very large120
and many others will be very nearly zero. This U-shaped prior on κj implies the half-cauchy prior121
density π(λj). The implied marginal prior on β has Cauchy-like tails and a pole at the origin which122
entails more aggressive shrinkage than a Laplace prior.123
Other choices of π(λj) lead to different “shrinkage profiles” on the “κ scale”. Polson and Scott124
[2012] provides an excellent taxonomy of the various priors over β that can be obtained as scale-125
5
mixtures of normals. The horseshoe and similar priors (e.g., Griffin and Brown [2012]) have proven126
empirically to be fine default choices for regression coefficients: they lack hyper parameters, force-127
fully separate strong from weak predictors, and exhibit robust predictive performance.128
1.3. Model selection priors as shrinkage priors. It is possible to express model selection priors as129
shrinkage priors. To motivate this re-framing, observe that the posterior mean regression coefficient130
vector is not well-defined in the model selection framework. Using the model-averaging notion, the131
posterior average β may be be defined as:132
(10) E(β | Y) ≡∑φ
E(β | Mφ,Y)p(Mφ | Y),
where E(βj | Mφ,Y) ≡ 0 whenever φj = 0. Without this definition, the posterior expectation of βj133
is undefined in models where the jth predictor does not appear. More specifically, as the likelihood134
is constant in variable j in such models, the posterior remains whatever the prior was chosen to be.135
To fully resolve this indeterminacy, it is common to set βj identically equal to zero in models136
where the jth predictor does not appear, consistent with the interpretation that βj ≡ ∂E(Y )/∂Xj .137
A hierarchical prior reflecting this choice may be expressed as138
(11) π(β | g,Λ,Ω) = N(0, gΛΩΛt).
In this expression, Λ ≡ diag((λ1, λ2, . . . , λp)) and Ω is a positive semi-definite matrix, both of which139
may depend on φ and/or σ2. When Ω is the identity matrix, one recovers (6).140
In order to set βj = 0 when φj = 0, let λj ≡ φjsj for sj > 0, so that when φj = 0, the prior141
variance of βj is set to zero (with prior mean of zero). George and McCulloch [1997] develops142
this approach in detail, including the g-prior specification, Ω(φ) = σ2(XtφXφ)−1. Priors over the143
sj induce a prior on Λ. Under these definitions of the λj and Ω, the component-wise marginal144
distribution for βj , j = 1, . . . , p, may be written as145
Table 1Selected models by different methods in the U.S. crime example. The MPM column displays marginal inclusion
probabilities with the numbers in bold associated with the variables included in the median probability model. TheHS(th) column refers to the hard thresholding in Section 1.5 under the horseshoe prior. The t-stat column is themodel defined by OLS p-values smaller that 0.05. The R2
mle row reports the traditional in-sample percentage ofvariation-explained of the least-squares fit based on only the variables in a given column.
Example: Diabetes dataset (p = 10, n = 447). The diabetes data was used to demonstrate the463
lars algorithm in Efron et al. [2004]. The data consist of p = 10 baseline measurements on n = 442464
diabetic patients; the response variable is a numerical measurement of disease progression. As in465
Efron et al. [2004], we work with centered and scaled predictor and response variables. In this466
example we only used the robust prior of Bayarri et al. [2012]. The goal is to focus on the sequence467
in which the variables are included and to illustrate how DSS provides an attractive alternative to468
the median probability model.469
Table 2 shows the variables included in each model in the DSS path up to the 5-variable model.470
The DSS plots in this example (omitted here) suggest that this should be the largest model under471
consideration. The table also reports the median probability model.472
Notice that marginal inclusion probabilities do not necessarily offer a good alternative to rank473
variable importance, particularly in cases where the predictors are highly colinear. This is evident474
in the current example in the “dilution” of inclusion probabilities of the variables with the strongest475
dependencies in this dataset: TC, LDL, HDL, TCH and LTG. It is possible to see the same effect476
18
Model Size
ρ λ2
Full (15) 10 9 8 7 6 5 4 3 2 1
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
Model Size
ψλ
Full (15) 10 9 8 7 6 5 4 3 2 1
0.0
0.1
0.2
0.3
0.4
0.5
0.06 0.08 0.10 0.12 0.14 0.16 0.18
0.0
0.2
0.4
0.6
0.8
Average excess error
β
Fig 1. U.S. Crime Data: DSS plots under the horseshoe prior.
in the rank of high probability models, as most models on the top of the list represent distinct477
combinations of correlated predictors. In the sequence of models from DSS, variables LTG and478
HDL are chosen as the representatives for this group.479
Meanwhile, a variable such as Sex appears with marginal inclusion probability of 0.98, and yet480
its removal from DSS (five-variable) leads to only a minor decrease in the model’s predictive ability.481
Thus the diabetes data offer a clear example where statistical significance can overwhelm practical482
19
relevance if one looks only at standard Bayesian outputs. The summary provided by DSS makes483
a distinction between the two notions of relevance, providing a clear sense of the predictive cost484
associated with dropping a predictor.485
486
Example: protein activation dataset (p = 88, n = 96). The protein activity dataset is from487
Clyde et al. [2011]. This example differs from the previous example in that with p = 88 predictors,488
the model space can no longer be exhaustively enumerated. In addition, correlation between the489
Model Size
ρ λ2
Full (15) 13 11 9 8 7 6 5 4 3 2 1
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Model Size
ψλ
Full (15) 13 11 9 8 7 6 5 4 3 2 1
0.0
0.1
0.2
0.3
0.4
0.5
Model Size
ρ λ2
Full (15) 12 11 10 9 8 7 6 5 4 3 2 1
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Model Size
ψλ
Full (15) 12 11 10 9 8 7 6 5 4 3 2 1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Fig 2. U.S. Crime Data: DSS plot under the “robust” prior of Bayarri et al. [2012] (top row) and under a g-priorwith g = n (bottom row). All 215 models were evaluated in this example.
20
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
dss model size: 15
β
β DSS
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
dss model size: 9
β
β DSS
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
dss model size: 7
β
β DSS
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
dss model size: 2
β
β DSS
Fig 3. U.S. Crime data under the horseshoe prior: β refers to the posterior mean while βDSS is the value of βλ underdifferent values of λ such that different number of variables are selected.
potential predictors is as high as 0.99, with 17 pairs of variables having correlations above 0.95. For490
this example, the horseshoe prior and the robust prior are considered. To search the model space,491
we use a conventional Gibbs sampling strategy as in Garcia-Donato and Martinez-Beneito [2013]492
(Appendix A), based on George and McCulloch [1997].493
Figure 4 shows the DSS plots under the two priors considered. Once again, the horseshoe prior494
leads to smaller estimates of ρ2. And once again, despite this difference, the DSS heuristic returns495
the same six predictors under both priors. On this data set, the MPM under the Gibbs search (as496
well as the HPM and MPM given by BAS) coincide with the DSS summary model.497
Example: protein activation dataset (p = 88, n = 80). To explore the behavior of DSS in the498
p > n regime, we modify the previous example by randomly selecting a subset of n = 80 observations499
Table 2Selected models by DSS and model selection prior in the Diabetes example. The MPM column displays marginal
inclusion probabilities, and the numbers in bold are associated with the variables included in the median probabilitymodel. The t-stat column is the model defined by OLS p-values smaller that 0.05. The R2
MLE row reports thetraditional in-sample percentage of variation-explained of the least-squares fit based on only the variables in a given
column.
from the original dataset. These 80 observations are used to form our posterior distribution. To500
define the DSS summary, we take X to be the entire set of 96 predictor values. For simplicity we only501
use the robust model selection prior. Figure 5 shows the results; with fewer observations, smaller502
models don’t give up as much in the ρ2 and ψ scales as the original example. A conservative read503
of the DSS plots leads to the same 6-variable model, however, in this limited information situation,504
the models with 5 or 4 variables are competitive. One important aspect of Figure 5 is that even505
working in the p > n regime, DSS is able to evaluate the performance and provide a summary of506
models of any dimension up to the full model. This is accomplished even in this situations where507
by using the robust prior, the posterior was limited to models up to dimension n− 1. In order for508
this to be achieved all DSS needs is the number of points in X to be larger than p. In situations509
where not enough points are available in the dataset, all the user needs to do is to add (arbitrary510
and without loss of generality) representative points in which to make predictions about potential511
Y .512
4. Discussion. A detailed examination of the previous literature reveals that sparsity can513
play many roles in a statistical analysis—model selection, strong regularization, and improved514
computation, for example. A central, but often implicit, virtue of sparsity is that human beings515
find fewer variables easier to think about.516
When one desires sparse model summaries for improved comprehensibility, prior distributions are517
an unnatural vehicle for furnishing this bias. Instead, we describe how to use a decision theoretic518
22
Model Size
ρ λ2
Full 8 7 6 5 4 3 2 1
0.40
0.45
0.50
0.55
0.60
0.65
0.70
Model Size
ψλ
Full 8 7 6 5 4 3 2 1
0.0
0.1
0.2
0.3
0.4
0.5
Model Size
ρ λ2
Full 29 23 21 19 16 13 11 9 7 5 3 1
0.35
0.40
0.45
0.50
0.55
0.60
0.65
Model Size
ψλ
Full 29 23 21 19 16 13 11 9 7 5 3 1
0.0
0.1
0.2
0.3
0.4
0.5
Fig 4. Protein Activation Data: DSS plots under model selection priors (top row) and under shrinkage priors (bottomrow).
approach to induce sparse posterior model summaries. Our new loss function resembles the popular519
penalized likelihood objective function of the lasso estimator, but its interpretation is very different.520
Instead of a regularizing tool for estimation, our loss function is a posterior summarizer with an521
explicit parsimony penalty. To our knowledge this is the first such loss function to be proposed in522
23
Model Size
ρ λ2
Full (88) 8 7 6 5 4 3 2 1
0.40
0.45
0.50
0.55
0.60
0.65
0.70
Model Size
ψλ
Full (88) 8 7 6 5 4 3 2 1
0.0
0.1
0.2
0.3
0.4
0.5
Fig 5. Protein Activation Data (p > n case): DSS plots under model selection priors
this capacity. Conceptually, its nearest forerunner would be high posterior density regions, which523
summarize a posterior density while satisfying a compactness constraint.524
R. B. O’Hara and M. J. Sillanpaa. A review of Bayesian variable selection methods: what, how and which. Bayesian595
Analysis, 4(1):85–117, 2009.596
T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103:681–686, 2008.597
L. Pericchi and A. Smith. Exact and approximate posterior moments for a normal location parameter. Journal of598
the Royal Statistical Society. Series B (Methodological), pages 793–804, 1992.599
L. R. Pericchi and P. Walley. Robust Bayesian credible intervals and prior ignorance. International Statistical Review,600
pages 1–23, 1991.601
N. G. Polson and J. G. Scott. Local shrinkage rules, Levy processes and regularized regression. Journal of the Royal602
Statistical Society: Series B (Statistical Methodology), 74(2):287–311, 2012.603
N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using Polya-Gamma latent variables.604
Journal of the American Statistical Association, 108:1339–1349, 2013.605
A. Raftery, D. Madigan, and J. Hoeting. Bayesian model averaging for linear regression models. Journal of the606
American Statistical Association, 92:1197–1208, 1997.607
J. Scott and J. Berger. An exploration of aspects of Bayesian multiple testing. Journal of Statistical Planning and608
Inference, 136:2144–2162, 2006.609
J. G. Scott and J. O. Berger. Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem.610
The Annals of Statistics, 38(5):2587–2619, 2010.611
D. Spiegelhalter. A test for normality against symmetric alternatives. Biometrika, 64(2):415–418, 1977.612
W. E. Strawderman. Proper bayes minimax estimators of the multivariate normal mean. The Annals of Mathematical613
Statistics, pages 385–388, 1971.614
26
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, B, 58:267–288,615
1996.616
W. Vandaele. In A. Blumstein, J. Cohen, and D. Nagin, editors, Deterrence and Incapacitation, pages 270–335.617
National Academy of Sciences Press, 1978.618
M. West. On scale mixtures of normal distributions. Biometrika, 74(3):646–648, 1987.619
A. Zellner. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In Bayesian620
Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, pages 233–243. Amsterdam: North-621
Holland, 1986.622
H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101:1418–1429,623
2006.624
H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36:625
1509–1533, 2008.626
APPENDIX A: EXTENSIONS
A.1. Selection summary in logistic regression. Selection summary can be applied outside627
the realm of normal linear models as well. This section explicitly shows how to extend the approach628
to logistic regression and provides an illustration on real data.629
Although one has many choices for judging predictive accuracy, it is convenient to note that630
squared prediction loss is precisely the negative log likelihood in the normal linear model setting,631
which suggests the following generalization of (16):632
(25) L(Y , γ) = λ||γ||0 − n−1 log[f(Y ,X, γ)
]where f(Y , γ) denotes the likelihood of Y with “parameters” γ.633
In the case of a binary outcome vector using a logistic link function, the generalized DSS loss634
becomes635
(26) L(Y , γ) = λ||γ||0 + n−1n∑i=1
(YiXiγ − log (1 + exp (Xiγ))
).
Taking expectations yields636
(27) L(Y , π) = λ||γ||0 + n−1n∑i=1
(πiXiγ − log (1 + exp (Xiγ))) ,
where πi is the posterior mean probability that Yi = 1. To help interpret this formula, note that637
it can be rewritten as a weighted logistic regression as follows. For each observed Xi, associate a638
pair of pseudo-responses Zi = 1 and Zi+n = 0 with weights wi = πi and wi+n = 1− πi respectively.639
27
Then πiXiγ − log (1 + exp (Xiγ)) may be written as640
(28)[wiZiXiγ − wi log (1 + exp (Xiγ))
]+[wi+nZi+nXiγ − wi+n log (1 + exp (Xiγ))
].
Thus, optimizing the DSS logistic regression loss is equivalent to finding the penalized maximum641
likelihood of a weighted logistic regression where each point in predictor space has a response642
Zi = 1, given weight πi, and a counterpart response Zi = 0, given weight 1− πi. The observed data643
determines πi via the posterior distribution. As before, if we replace (27) by the surrogate `1 norm644
(29) L(Y , π) = λ||γ||1 + n−1n∑i=1
(πiXiγ − log (1 + exp (Xiγ))) ,
then an optimal solution can be computed via the R package GLMNet (Friedman et al. [2010]).645
The DSS summary selection plot may be adapted to logistic regression by defining the excess646
error as647
(30) ψλ =
√n−1
∑i
πi − 2πλ,iπi + π2λ,i −√n−1
∑i
πi(1− πi)
where πi is the probability that yi = 1 given the true model parameters, and πλ,i is the corresponding648
quantity under the λ-sparsified model. This expression for the logistic excess error relates to the649
linear model case in that each expression can be derived from650
(31) ψλ =
√n−1E
(||Y − Yλ||2
)−√n−1E
(||Y − E(Y )||2
)where the expectation is with respect to the predictive distribution of Y conditional on the model651
parameters, and Yλ denotes the optimal λ-sparse prediction. In particular, Yλ ≡ Xβλ for the linear652
model and yλ,i ≡ πλ,i = (1 + exp−Xiβλ)−1 for the logistic regression model. One notable difference653
between the expressions for excess error under the linear model and the logistic model is that the654
linear model has constant variance whereas the variance term depends on the predictor point in655
the logistic model as a result of the Bernoulli likelihood.656
Example: German credit data (n = 1000, p = 48). To illustrate selection summary in657
the logistic regression context, we use the German Credit data from the UCI repository, where658
n = 1000 and p = 48. In each record we have available covariates associated with a loan applicant,659
such as credit history, checking account status, car ownership and employment status. The outcome660
variable is a judgment of whether or not the applicant has “good credit”. A natural objective when661
28
analyzing this data would be to develop a good model for assessing creditworthiness of future662
applicants. A default shrinkage prior over the regression coefficients is used, based on the ideas663
described in Polson et al. [2013] and the associated R package BayesLogit. The DSS selection664
summary plots (adapted to a logistic regression) are displayed in Figure 6. The plot suggests a high665
degree of “pre-variable selection”, in that all of the predictor variables appear to add an incremental666
amount of prediction accuracy, with no single predictor appearing to dominate. Nonetheless, several667
of the larger models (smaller than the full forty-eight variable model) do not give up much in excess668
error, suggesting that a moderately reduced model (≈ 35), may suffice in practice. Depending on669
the true costs associated with measuring those ten least valuable covariates, relative to the cost670
associated with an increase of 0.01 in excess error, this reduced model may be preferable.671
Model Size
ψλ
48 41 36 31 26 20 15 9 3
0.00
0.04
0.08
0.02 0.04 0.06 0.08
0.0
0.2
0.4
0.6
0.8
Average excess error
β
Fig 6. DSS plots for the German credit data. For this data, each included variable seems to add an incrementalamount, as the excess error plot builds steadily until reaching the null model with no predictors.
A.2. Selection summary for Gaussian graphical models. Covariance estimation is yet672
another area where a sparsifying loss function can be used to induce a parsimonious posterior673
summary.674
Consider a (p× 1) vector (x1, x2, . . . , xp) = X ∼ N(0,Σ). Zeros in the precision matrix Ω = Σ−1675
imply conditional independence among certain dimensions of X. As sparse precision matrices can676
be represented through a labelled graph, this modeling approach is often referred to as Gaussian677
graphical modeling. Specifically, for a graph G = (V,E), where V is the set of vertices and E is the678
29
set of edges, let each edge represent a non-zero element of Ω. See Jones et al. [2005] for a thorough679
overview. This problem is equivalent to finding a sparse representation in p separate linear models680
for Xj |X−j , making the selection summary approach developed above directly applicable.681
As with linear models, one has the option of modeling the entries in the precision matrix via682
shrinkage priors or via selection priors with point masses at zero. Regardless of the specific choice683
of prior, summarizing patterns of conditional independence favored in the posterior distribution684
remains a major challenge.685
A DSS parsimonious summary can be achieved via a multivariate extension of (16) by once again686
leveraging the notion of “predictive accuracy” as defined by the negative log likelihood:687
(32) L(X,Γ) = λ||Γ||0 − log det(Γ)− tr(n−1XX′Γ)
where Γ represents the decision variable for Ω and ||Γ||0 represents the sum of non-zero entries in688
off-diagonal elements of Γ. Taking expectations with respect to the posterior predictive of X yields689
(33) L(Γ) = E(L(X,Γ)
)= λ||Γ||0 − log det(Γ)− tr(ΣΓ)
where Σ represents the posterior mean of Σ.690
As before, an approximate solution to the DSS graphical model posterior summary optimization691
problem can be obtained by employing the surrogate `1 penalty692
(34) L(Γ) = E(L(X,Γ)
)= λ||Γ||1 − log det(Γ)− tr(ΣΓ).
as developed by penalized likelihood methods such as the graphical lasso [Friedman et al., 2008].693