Bayesian inference on high-dimensional Seemingly Unrelated Regressions Alex Lewin and Marco Banterle Brunel University London April 2017 Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 1 / 22
Bayesian inference on high-dimensionalSeemingly Unrelated Regressions
Alex Lewin and Marco Banterle
Brunel University London
April 2017
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 1 / 22
Motivation
Motivation
Genetic studies aiming at identifying association between pointmutations (SNPs) and multivariate phenotypes:
gene expression measurements
metabolomics data
protein concentrations
...
Looking for sparse variable selection
Take into account data correlations
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 2 / 22
Motivation
Multivariate Data
Predictor matrix:
- n observations
- p variables
Xn Response matrix:
- n observations
- q variables
Yn
p q
Aim: identify which of the p variables in X are significantly associatedwith the outcomes in Y
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 3 / 22
Motivation
Case study: mQTL discovery in the North FinlandBirth Cohort study (NFBC)
The NFBC66 is a cohort of 12000 adults followed since 1966
Collection of data at age 31 years, including clinical data and bloodsamples
DNA extracted leading to a sample of 5746 adults genotypedacross genome (∼ 300,000 SNPs)
Metabolite lipid profile quantified from serum samples by NMR,giving measures of 137 metabolites
Question of interest is the discovery of genetic markersassociated with metabolite regulation of lipids
These responses are highly structured, with strong correlations
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 4 / 22
Motivation
Correlations in the mQTL data set
Y: Metabolite correlations X: SNP correlations
(only subset plotted)
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 5 / 22
Model
Multivariate Regression Model with Variable Selection
Frame the problem as a multivariate linear regression model:
Yn×q
= Xn×p
Bp×q
+ En×p
or equivalently:Y ∼MN (XB, In, R)
Sparse associations: set most elements of B to zero
Correlated outcomes: allow non-diagonal residual covariancematrix R
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 6 / 22
Model
Variable selection performed through binary matrix Γ (p× q)
γjk =
{1 outcome k associated with predictor j
0 else
Sparsity prior γjk ∼ Bern(ωjk), ωjk ∼ Beta()
Gamma matrix
Predictor variables
Out
com
es
Predictor Xj only appears in a regression if γjk is 1.
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 7 / 22
Model
BayesianSetting
Very high dimensional datap ≈ 104 to 106 variables in X
q ranging from 1 to 104 variables in Y
Around n = 5000 observations
Focus on Sparse Bayesian Variable Selection (sparse BVS)Minimise arbitrary tuning of model.
Provides the posterior probability of association for each predictorand each response.
Bayesian model averagingExplore space of 2p models
Marginal posterior inclusion probabilities are model averages.
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 8 / 22
Model
SUR model
Seemingly Unrelated Regressions (SUR) model:
ykn×1
= Xγkn×dk
βγkdk×1
+ �kn×1
for k = 1, · · · , q
Cov[�k�l] = Rkl 6= 0 =⇒ Outcomes do not naturally separate.
Different variables selected for each outcome: γk, k = 1, · · · , q.
So: vectorise model:
vec(y1, y2, · · · yq) ∼ N (vec(Xγ1βγ1 , Xγ2βγ2 , · · · , Xγqβγq), R⊗ In)
Covariance matrix is not block diagonal.
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 9 / 22
Model
SUR model
Priors:
vec(β1, β2, · · · , βq)|γ1, γ2, · · · , γq ∼ N (0, (Iq ⊗W )vec(γ1,γ2,··· ,γq))
R ∼ IW(ν,M)
γkj ∼ Bern(ωk), ωk ∼ Beta()
W is a p× p matrix (constant, or g-prior g(XTX)−1); differentrows/columns selected for different outcomes.
Not conjugate, cannot integrate out β and R.
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 10 / 22
Model
SUR model
Can calculate posterior full conditionals for βk and R→ Gibbs sampler for γk, βk and R.
However, this is computationally prohibitive due to calculation of(X t(R−1 ⊗ In)X
)−1where
X =
Xγ1 0 · · · 0
0 Xγ2 · · · 00 · · · · · · 00 · · · 0 Xγq
an nq× (d1 + d2 + · · · dq) matrix, which changes every MCMC iteration.
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 11 / 22
Model
New work: computation for the SUR model
Idea from Zellner and Ando (2010): decompose the Likelihood:
y1 = Xγ1βγ1 + ε1
y2 = Xγ2βγ2 + ρ21(y1 −Xγ1βγ1) + ε2...
yk = Xγkβγk +∑l
Model
New work: priors in the transformed space
We aim to decompose into product over responses, as with Likelihood.
Beta’s straightforward: ∏k
N (βk|γk,W )
Covariance matrix:R ∼ IW(ν,M)
becomes∏k
N ({ρk1, · · · , ρk,k−1}|σ2k,M)× IG(σ2k|{ρk1, · · · , ρk,k−1}, ν,M)
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 13 / 22
Model
New: posterior conditionals in transformed space
Covariance matrixIn transformed space,∏
k
N ({ρk1, · · · , }|σ2k,M, Y,B,Γ)× IG(σ2k|{ρk1, · · · , }, ν,M, Y,B,Γ)
So MCMC updates for R parameters factorise over responses.
BetasMCMC for B not so straightfoward: Zellner and Ando used simplifiedfactorisation + Gibbs resampling
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 14 / 22
Model
New: posterior conditionals in transformed space
We found the correct full conditionals for B (does decompose):
βγk | . . . ∼ N(Wk ×Xtγk ỹk , Wk
)(k = 1, . . . ,q)
where
Wk =
(XtγkXγk
(1
σ2k+∑l>k
ρ2lkσ2l
)+W−1γk
)−1,
ỹk = function of Y,B,Γ across responses
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 15 / 22
Model
New work: computation for the SUR model
Main point: we have got rid of the(X t(R−1 ⊗ In)X
)−1 big matrixcalculation.
Using ESS (evolutionary stochastic search) algorithm (Bottolo etal.) to explore space of Γ variable selection parameters.
Comp. time v. q Comp. time v. p Comp. time v. n
These in R; in C++ will be greater relative speedup.Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 16 / 22
Model
Simulated data
n=100
q=30, p=30
with correlated residuals
True Outcome Correlation
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 17 / 22
Model
Simulated data
True Gamma matrixTRUE
Predictors
Out
com
es
HESS GammaFull heat map
Predictors
Out
com
es
SUR GammaFull heat map
Predictors
Out
com
es
HESS GammaPosterior inclusion probabilities
PIP
0.0 0.2 0.4 0.6 0.8 1.0
020
4060
80
SUR GammaPosterior inclusion probabilities
PIP
0.0 0.2 0.4 0.6 0.8 1.0
020
4060
80
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 18 / 22
Case Study: mQTL discovery
mQTL analysis of NFBC data
After quality control,n = 4023 peopleq = 103 metabolitesp = 9172 SNPs on chromsome 16
Data Correlation Residual Correlation
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 19 / 22
Case Study: mQTL discovery
Evidence of enhanced linkage for Chromosome 16
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 20 / 22
Summary, Acknowledgements
Summary
Bayesian SUR model with sparsity prior to perform variableselection for multiple responses.
Estimating the residual covariance matrix increases the accuracyof the variable selection
We have extended the Zellner and Ando method to obtain directlythe correct posteriors
Computational speed-up→ model can be used on large genomicdata sets.
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 21 / 22
Summary, Acknowledgements
Thank you!Sylvia Richardson
Leonardo Bottolo
Marjo-Riitta Jarvelin
Habib Saadi
Marc Chadeau-Hyam
Lewin A et al. (2015)MT-HESS: an efficient Bayesian approach for simultaneous association detection in OMICSdatasets, with application to eQTL mapping in multiple tissues, Bioinformatics 10.1093
Bottolo L, Chadeau-Hyam, M et al. (2013)Guess-ing polygenic associations with multiple phenotypes using a GPU-based EvolutionaryStochastic Search algorithm,PLoS Genetics 10.1371.
Bhadra A and Mallick BK (2013)Joint high-dimensional Bayesian variable and covariance selection with an application to eQTLanalysis, Biometrics 10.1111.
Bottolo L, Petretto, E et al. (2011)Bayesian detection of expression quantitative trait loci hot spots, Genetics 189:1449–1459.
Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 22 / 22
Model-based clustering for multivariatecategorical data
Michael Fop, Keith Smart and Brendan Murphy
6th IBS Channel Network Conference
1 / 25
Back Pain Dataset
▶ A study to investigate the use of a mechanisms-basedclassification of muscoloskeletal pain in clinical practice.
▶ The aim of the study was to asses the discriminative power ofthe taxonomy of pain in Nociceptive, Peripheral Neuropathicand Central Sensitization for low-back disorders.
▶ There are N = 464 patients who were assessed according to alist of 36 binary clinical indicators (“Present”/“Absent”).
▶ Some of the indicators carry the same information about thepain categories, thus the interest here is to select a subset ofmost relevant clinical criteria, performing a partition of thepatients.
▶ Does the partition of the patients agree with the clinicaltaxonomy?
2 / 25
Clustering and Variable Selection
▶ The motivating example shows the need for:
▶ Clustering: Can we establish the existance of subgroups?How can we characterize these subgroups?
▶ Variable Selection: Can we use a subset of the variables todistinguish the subgroups?
3 / 25
Model-Based Clustering/Mixture Models
▶ Denote the N ×M data matrix by X
▶ The nth observation is denoted by Xn.
▶ Model-based clustering assumes that Xn arises from a finitemixture model
▶ Assuming G classes (components)
p(Xn|τ ,θ,G ) =G∑
g=1
τgp(Xn|θg ).
▶ τg are mixture weights
▶ p(Xn|θg ) is the component distribution.
4 / 25
Latent Class Analysis (LCA) model
▶ Latent Class Analysis (LCA) is a model for clusteringcategorical data.
▶ Let Xn = (Xn1,Xn2, . . . ,XnM) where Xnm takes a value from{1, 2, . . . ,Cm}.
▶ In LCA we assume that there is local independence betweenvariables, so that if we knew Xn was in class g we could writeit’s density as
p(Xn|θg ) =M∏
m=1
Cm∏c=1
θI(Xnm=c)gmc ,
where {θgm1, . . . , θgmCm} give the probabilities of observingthe categories {1, . . . ,Cm} in variable m
▶ θg will characterize and embody the differences betweengroups
5 / 25
LCA model (general)
▶ Model likelihood of the form,
p(Xn|θ, τ ,G ) =G∑
g=1
τg
M∏m=1
Cm∏c=1
θI(Xnm=c)gmc .
▶ More convenient to work with completed data
▶ Augment data with class labels Zn = (Zn1,Zn2, . . . ,ZnG )where
Zng =
{1 if observation n belongs to group g0 otherwise.
▶ Then we can write down completed data likelihood for anobservation
p(Xn,Zn|θ, τ ,G ) =G∏
g=1
{τg
M∏m=1
Cm∏c=1
θI(Xnm=c)gmc
}Zng.
6 / 25
LCA model (general)
▶ Estimation by EM algorithm or VB (see BayesLCA package)
▶ Note that G must be chosen in advance; possible todiscriminate the best G for the data using information criteria(eg. BIC)
▶ Bayesian approaches:▶ Pandolfi, Bartolucci and Friel (2014) use reversible jump to get
posterior probability for G .▶ White, Wyse and Murphy (2016) use a collapased Gibbs
sampler and incorporate variable selection.
7 / 25
Back Pain Data: LCA Results
Crit
.1C
rit.2
Crit
.3C
rit.4
Crit
.5C
rit.6
Crit
.7C
rit.8
Crit
.9C
rit.1
0C
rit.1
1C
rit.1
2C
rit.1
3C
rit.1
4C
rit.1
5C
rit.1
6C
rit.1
8C
rit.1
9C
rit.2
0C
rit.2
2C
rit.2
3C
rit.2
4C
rit.2
5C
rit.2
6C
rit.2
7C
rit.2
8C
rit.2
9C
rit.3
0C
rit.3
1C
rit.3
2C
rit.3
3C
rit.3
4C
rit.3
5C
rit.3
6C
rit.3
7C
rit.3
8
Class 1
Class 2
Class 3
Class 4
Class 5
0.0
0.2
0.4
0.6
0.8
1.0
8 / 25
Back Pain Data: LCA Clustering
1 2 3 4 5
Central Sensitization 48 1 5 0 41Nociceptive 0 10 96 126 3
Peripheral Neuropathic 0 89 1 1 4
9 / 25
Variable Selection: Dean & Raftery’s Greedy Search
▶ Dean & Raftery (2010) proposed a greedy stepwise variableselection algorithm for LCA.
▶ The observation vector Xn is partitioned as
Xn = (XCn ,X
Pn ,X
On )
where▶ XCn are the current clustering variables.▶ XPn is proposed to be added to the clustering variables.▶ XOn are the other variables.
10 / 25
Dean & Raftery’s Greedy Search
▶ Two competing models are compared:
z
XC
XP
XO
M1 z
XC
XP
XO
M∗2
▶ M1 assumes that the proposed variable has clusteringstructure.
▶ M∗2 assumes that the proposed variable has no clusteringstructure.
▶ This framework reduces the independence assumption of thepreviously described approach.
11 / 25
Local Independence (A Problem?)
▶ When analyzing the back pain data, we achieved very littledata reduction.
▶ In fact, only one variable was labeled as non-clustering.
▶ An explanation for this is the local independence assumptionin the model.
▶ Suppose we have two variables that are highly dependent andboth exhibit clustering.
▶ The variable selection method will include both variables inthe model, even if one variable contains no extra clusteringinformation.
12 / 25
Novel Extension: Relaxing Independence Further
▶ It is unrealistic to assume that XCn and XPn are conditionally
independent in M∗2.▶ We propose replacing M∗2 with a different model.
z
XC
XP
XO
M1 z
XC
XP
XO
XR ⊆ XC
M2
▶ M1 assumes that the proposed variable has clusteringstructure.
▶ M2 assumes that the proposed variable has no clusteringstructure beyond that explained by the clustering variables.
13 / 25
Stepwise Search Algorithm
▶ We propose a stepwise search algorithm to find an optimal setof variables for clustering.
▶ The algorithm involves the following steps:▶ Add: Add a variable to the current clustering variables.▶ Remove: Remove a variable from the current clustering
variables.▶ Swap: Swap a proposed variable with one already in the
clustering variables.
▶ Model selection is implemented using the BayesianInformation Criterion (BIC).
14 / 25
Back Pain Data
▶ The proposed model was applied to the back pain data:
VariablesN. latentclasses
BIC ARI
All 5 -12582.62 0.50All 3∗ -12763.81 0.82
35 Criteria 5 -12116.32 0.5035 Criteria 3∗ -12305.67 0.80
11 Criteria 3 -3965.24 0.75
▶ The new model achieves much greater data reduction.
15 / 25
Algorithm Run
Iter. Proposal BIC diff. Decision Proposal BIC diff. Decision1 Remove Crit.5 -122.2 Accepted2 Remove Crit.23 -126.3 Accepted Swap Crit.22 with Crit.5 -73.2 Rejected3 Remove Crit.38 -109.0 Accepted Swap Crit.25 with Crit.5 -81.5 Rejected4 Remove Crit.4 -103.5 Accepted Swap Crit.2 with Crit.38 -98.6 Rejected5 Remove Crit.1 -78.3 Accepted Swap Crit.29 with Crit.4 -23.1 Rejected6 Remove Crit.29 -73.2 Accepted Swap Crit.12 with Crit.1 2.7 Accepted7 Remove Crit.1 -73.5 Accepted Swap Crit.26 with Crit.29 3.2 Accepted8 Remove Crit.29 -66.8 Accepted Swap Crit.18 with Crit.12 -10.2 Rejected9 Remove Crit.35 -63.0 Accepted Swap Crit.7 with Crit.29 -9.0 Rejected10 Remove Crit.7 -59.6 Accepted Swap Crit.11 with Crit.35 -7.6 Rejected11 Remove Crit.10 -62.9 Accepted Swap Crit.8 with Crit.7 -76.1 Rejected12 Remove Crit.11 -50.4 Accepted Swap Crit.16 with Crit.10 6.8 Accepted13 Remove Crit.8 -54.5 Accepted Swap Crit.10 with Crit.16 -32.0 Rejected14 Remove Crit.3 -44.2 Accepted Swap Crit.31 with Crit.16 -9.5 Rejected15 Remove Crit.31 -33.2 Accepted Swap Crit.18 with Crit.16 -22.7 Rejected16 Remove Crit.22 -30.9 Accepted Swap Crit.24 with Crit.23 -1.7 Rejected17 Remove Crit.14 -22.7 Accepted Swap Crit.32 with Crit.31 -5.0 Rejected18 Remove Crit.32 -19.2 Accepted Swap Crit.37 with Crit.14 -8.0 Rejected19 Remove Crit.10 -35.4 Accepted Swap Crit.9 with Crit.3 -1.3 Rejected20 Remove Crit.24 -17.6 Accepted Swap Crit.30 with Crit.8 15.7 Accepted21 Remove Crit.34 -15.7 Accepted Swap Crit.37 with Crit.1 -0.7 Rejected22 Remove Crit.25 -13.7 Accepted Swap Crit.36 with Crit.1 3.3 Accepted23 Remove Crit.18 -10.5 Accepted Swap Crit.1 with Crit.31 8.5 Accepted24 Remove Crit.27 -13.7 Accepted Swap Crit.6 with Crit.26 6.1 Accepted25 Remove Crit.31 -1.3 Accepted Swap Crit.20 with Crit.6 5.6 Accepted26 Remove Crit.37 1.4 Rejected Swap Crit.6 with Crit.5 -3.1 Accepted27 Remove Crit.5 0.4 Rejected Swap Crit.37 with Crit.20 4.0 Rejected
16 / 25
Clustering / Clinical Taxonomy
▶ The clustering closely follows the clinical taxonomy.
Class 1 Class 2 Class 3
Nociceptive 210 21 4Peripheral Neuropathic 5 88 2Central Sensitiization 3 3 89
▶ It is not unusual for patients diagnosed as Nociceptive mayhave Peripheral Neuropathic aspects to their back pain.
17 / 25
Clustering Variables
▶ The selected variables exhibit strong clustering across thethree groups.
Crit
.2
Crit
.5
Crit
.8
Crit
.9
Crit
.13
Crit
.15
Crit
.19
Crit
.26
Crit
.28
Crit
.33
Crit
.37
Class 1
Class 2
Class 3
Selected criteria
0.0
0.2
0.4
0.6
0.8
1.0
18 / 25
Chosen Variables with Descriptions
The chosen variables have the following descriptions.Crit. Description Class 1 Class 2 Class 3
2 Pain associated to trauma, pathologic process or dysfunction 0.94 0.90 0.045 Usually intermittent and sharp with movement/mechanical provocation 0.94 0.84 0.248 Pain localized to the area of injury/dysfunction 0.97 0.50 0.319 Pain referred in a dermatomal or cutaneous distribution 0.06 1.00 0.11
13 Disproportionate, nonmechanical, unpredictable pattern of pain 0.01 0.00 0.9115 Pain in association with other dysesthesias 0.03 0.51 0.3419 Night pain/disturbed sleep 0.34 0.70 0.8626 Pain in association with high levels of functional disability 0.07 0.36 0.7928 Clear, consistent and proportionate pattern of pain 0.97 0.94 0.0733 Diffuse/nonanatomic areas of pain/tenderness on palpation 0.03 0.01 0.7337 Pain/symptom provocation on palpation of relevant neural tissues 0.07 0.57 0.19
19 / 25
Discarded Variables
▶ Many of the discarded variables are related with the clusteringvariables.
Crit.2 Crit.5 Crit.8 Crit.9 Crit.13 Crit.15 Crit.19 Crit.26 Crit.28 Crit.33 Crit.37
Crit.1
Crit.3
Crit.4
Crit.6
Crit.7
Crit.10
Crit.11
Crit.12
Crit.14
Crit.16
Crit.18
Crit.20
Crit.22
Crit.23
Crit.24
Crit.25
Crit.27
Crit.29
Crit.30
Crit.31
Crit.32
Crit.34
Crit.35
Crit.36
Crit.38
Selected criteria
Dis
card
ed c
riter
ia
0
100
200
300
400
▶ These are not clustering variables because they don’t exhibitclustering beyond what can be explained by the clusteringvariables. 20 / 25
Summary
▶ Model-based approaches to clustering and variable selectionachieve excellent performance.
▶ Removing independence assumptions in the model achievesimproved variable selection.Care needed interpreting the chosen/discarded variables.
21 / 25
Simulation 1
z
X1 X2 X3 X4
X5 X6 X7 X8
X9 X10 X11 X12
22 / 25
Simulation 1 Results
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
0.0
00
.50
1.0
0
23 / 25
Simulation 2
z
X1 X2 X3 X4 X5
X6
X7
X8
X9
X10
24 / 25
Simulation 2 Results
●
● ●
● ●
●
●●
●●
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
0.00
0.50
1.00
25 / 25
Exploring the dependence structure betweencategorical variables: Benefits and limitations of
using variable selection within Bayesian clustering
Michail Papathomas
6th IBS Channel Network Conference - Hasselt - 2017
(University of St Andrews) 1 / 37
Collaborators
Main collaborators in the development of the clustering approach[profile regression, based on the Dirichlet process]:
I Silvia LiveraniI David HastieI John MolitorI Sylvia Richardson
Work on the relation between clustering and log-linear modelling withSylvia Richardson
(University of St Andrews) 2 / 37
Motivation
Dependence structure and log-linear modelling
I Assume observations from P categorical variables {x.1, ..., x.P}.For example,
Subject Smoking (X) Drinking (Y)John 0 0Mary 0 1Jim 1 1
...I The resulting data can be arranged as counts in a P-way
contingency table. For example,
Table 1: Smoking (X) and Drinking (Y).Smoking (X) Drinking (Y)
No (0) Yes (1)No (0) 456 44Yes (1) 583 911
(University of St Andrews) 3 / 37
Motivation
Dependence structure and log-linear modelling
I Denote the cell counts as nl , l = 1, . . . ,n.I A Poisson distribution is assumed for the counts so that
E(nl) = µl .I A Poisson log-linear model log(µ) = XDMλ is a GLM that relates
the expected counts to the variables.
Table 1: Smoking (X) and drinking (Y).Smoking (X) Drinking (Y)
No (0) Yes (1)No (0) 456 44Yes (1) 583 911
(University of St Andrews) 4 / 37
Motivation
Dependence structure and log-linear modelling
Sometimes, the dependence structure between {x.1, ..., x.P} (marginaland conditional independence) can be infered by the form of thelog-linear model. For example,
I
log(µij) = λ+ λXi + λYj + λ
XYij
implies that X and Y are dependent.I
log(µij) = λ+ λXi + λYj
implies marginal independence.I In practice, for more than 3 variables, interpreting the dependence
structure by looking at the presence/absence of interaction termsbecomes too difficult.
(University of St Andrews) 5 / 37
Motivation
Dependence structure and log-linear modelling
Joint probabilities for the categorical variables are obtained using theparameters of the log-linear model. In our simple example,
P(X = i ,Y = j) =µi,j∑
i ′ ,j ′=0,1 µi ′ ,j ′.
Then, for instance,
P(smoker, drinker) = P(X = 1,Y = 1) =µ1,1
µ0,0 + µ1,0 + µ0,1 + µ1,1
=exp(λ+ λX1 + λ
Y1 + λ
XY1,1)
exp(λ) + exp(λ+ λX1 ) + ...
I So, in principle, we could fully explore the variables’ dependencestructure by calculating many joint probabilities and consideringthe laws of probability
I or, even better, by considering graphical log-linear models(University of St Andrews) 6 / 37
Motivation
Graphical models
I They allow to visualize and build complex dependence structuresfor the covariates under consideration
I Undirected Graphs and Directed Acyclic Graphs allow to defineand write complex joint distributions through factorization andconditional independence; Lauritzen (2011)
I Neighborhoods of models are easily defined, and it isstraightforward to move in the space of models by adding,removing or replacing edges.
(University of St Andrews) 7 / 37
Motivation
Example of a graphical model
(University of St Andrews) 8 / 37
Motivation
Problems with Log-linear modelling for detectinginteractions
When it is of interest to detect interactions between covariates, linearregression modelling may become problematic.
I In a classical setting, fitting linear models with many parameterssometimes requires an impractically large vector ofobservations for valid inferences (Burton et al., IJE, 2009). Also,identifiability and collinearity problems are often present.
I In Bayesian model comparison, the space of models becomesvast, and model search algorithms like the Reversible Jumpapproach (Green, Bka 1995) require an impractically largenumber of iterations before they converge (Dobra and Massam,St Meth 2010).
(University of St Andrews) 9 / 37
Bayesian clustering with the Dirichlet process
Bayesian clustering with the Dirichlet process
I Partitions the subjects into groups according to their profileI Flexible Bayesian clusteringI Uncertainty with regard to the clustering is evaluatedI Post-processing leads to tractable output
(University of St Andrews) 10 / 37
Bayesian clustering with the Dirichlet process
Notation
I Consider categorical variables x.p, p = 1, ...,P.I For individual i , denote the variable/covariate profile by
xi = (xi1, ..., xiP).I For example,
x.1: smokes, does not smokex.2: drinks, does not drinkx.3: exercises, does not exercise
I Now, for instance, xi=(smokes, drinks, does not exercise).
(University of St Andrews) 11 / 37
Bayesian clustering with the Dirichlet process
Notation
For individual iI zi = c allocates subject, i , to cluster c.I φcp(x) the probability that variable x.p = x , for zi = c.I Given zi = c, x.p has a multinomial distribution with cluster specific
parameters φcp = [φcp(1), ..., φcp(Mp)]I A priori, φcp ∼ Dirichlet(λ1, ..., λMp)I ψc denotes the probability that a subject is assigned to cluster c.
(University of St Andrews) 12 / 37
Bayesian clustering with the Dirichlet process
Statistical Framework
For φ = {φcp, c ∈ N,p = 1, ...,P},I ‘stick-breaking’ prior on the allocation weights ψcI x.1, ..., x.P are assumed independent given the clustering
allocation and parameters...I and calculating joint probabilities for the categorical variables
becomes easy!I Pr(xi |z, φ) =
∏Pp=1 φ
zip (xip) for i = 1,2, ...,n.
I This implies
Pr(xi |φ, ψ) =∞∑
c=1
Pr(zi = c|ψ)P∏
p=1
Pr(xip|zi = c) =∞∑
c=1
ψc
P∏p=1
φcp(xip).
(University of St Andrews) 13 / 37
Bayesian clustering with the Dirichlet process
Using 0-1 variable selection switches
I Identify covariates that contribute more than others to the formation ofclusters. [Tadesse et al. (2005), Chung and Dunson (2009);Papathomas et al. (2012)]
I Cluster specific binary indicators, γcp , so that γcp = 1 when covariate x.p isimportant for allocating subjects to cluster c; otherwise γcp = 0.
I Prior for switches: given ρp, γcp ∼ Bernoulli(1, ρp).I We consider a sparsity inducing prior for ρ with an atom at zero:
ρp ∼ 1{wp=0}δ0(ρp) + 1{wp=1}Beta(αρ, βρ)
where wj ∼ Bernoulli(0.5).Similar to Chung and Dunson (JASA, 2009), but in their set up, covariate observations contribute to thelikelihood through a regression model. In our case, covariate observations contribute directly to the likelihood,and we introduce πp(x).
(University of St Andrews) 14 / 37
Example - Using the R package PReMiuM; Liverani etal. (2015, JSS)
I Simulated observations from 10000 subjects, recording...I 10 binary variables, say {SMO,DRI,EXE ,D,E , ...,H, I, J}.
Example
Example
Table 2: Cluster profiles. In parenthesis the number of subjects typically allocated to each group.Simulation 1
SMO DRI EXE D E F G H I JMedian(ρp) 0.36 0.78 0.32 0.75 0.06 0.05 0.00 0.48 0.57 0.50Group 1 (5465) >< 00 00 00 00 >< >< Group 2 (3159) >< 00 >< 00 00 00 >< ><Group 3 (1376) 00 >< 00 00 00 00 00
Linear graphical model determination
Clustering and linear modelling
I It is not clear how clustering output translates to interactions in alog-linear regression modelling framework.
I Can we assist the process of comparing a large number of linearmodels with the clustering variable selection results?
I The important aspect of a model that combines clustering andvariable selection is that covariates are not chosen inaccordance with size of marginal effect. They are selectedbecause they combine to create distinct groups of subjects.Consequently, we expect that this type of modelling should beable to inform on interactions in a linear model setting.
(University of St Andrews) 18 / 37
Linear graphical model determination
Theoretical results
Theorem 1: Consider random variables x.p and x.q, 1 ≤ p,q ≤ P,p 6= q. If
∑Cc=1 γ
cp × γcq = 0 then x.p and x.q are independent.
Theorem 2: Consider a set of random variables {x.1, . . . , x.P}. If, forsome p ∈ {1, ...,P},
∑Cc=1 γ
cp × γcq = 0, for all q 6= p, then x.p is
independent of {x.1, . . . , x.P} \ x.p.
Proofs: See Papathomas and Richardson (2016, JSPI). Note that theconverse is not true.
The previous Theorems imply the following Corollary,
Corollary: Consider covariate x.p. If∑C
c=1 γcp = 0 then x.p is
independent from all other covariates.
(University of St Andrews) 19 / 37
Linear graphical model determination
Theoretical results
Therefore,I if the selection probability ρp for x.p is zero or close to zero,
something that implies that∑C
c=1 γcp is also zero or close to zero,
we can assume that x.p is independent from all other covariates.I Assuming that our interest lies in exploring interactions, to reduce
the dimensionality of the problem when fitting linear models tosparse contingency tables, x.p could be removed from theanalysis.
(University of St Andrews) 20 / 37
Simulated data using graphical log-linear models
Linear graphical model determination
Construction of matrix T γ
I For iteration it and for each cluster c with more than one subject, formmatrix T c,it , so that element (p1,p2), 1 ≤ p1 < p2 ≤ P is either zero orone, and equal to γcp1(it)× γ
cp2(it). All other matrix cells are empty.
I Sum up all matrices T c,it , weighing by cluster size, to create aninformation matrix T γ ,
T γ =∑
it
∑c
nc,it × Tc,it .
where nc,it is the size of cluster c at iteration it . Therefore, T γ is astraightforward summary of all T c,it matrices into one, with small clusterscontributing less to this summary.
I For ease of interpretation reweight the elements of T γ so that themaximum element is one, T γ = (max{T γ})−1 × T γ .
Matrix T γ is constructed in such a manner so that if element tγ(p1,p2),1 ≤ p1 < p2 ≤ P, is close to zero, this implies that an edge between x.p1 andx.p2 is not likely to be present in a highly supported graphical model.
(University of St Andrews) 22 / 37
Linear graphical model determination
Example. Simulation 1
I Simulated observations from 10000 subjects.I 10 binary categorical variables {A,B,C, ...,H, I, J} are observed.I Observations are simulated in accordance with the log-linear
model
log(µ) = λ+λA+λB+λC+...+λJ+λAB+λBC+λCD+λDA+λHI+λIJ+λHJ+λHIJ
I (For more details see Papathomas and Richardson (2016)) .
(University of St Andrews) 23 / 37
Example. Simulation 1
log(µ) = λ+λA+λB+λC+...+λJ +λAB+λBC+λCD+λDA+λHI+λIJ +λHJ +λHIJ
Example. Simulation 1
log(µ) = λ+λA+λB+λC+...+λJ +λAB+λBC+λCD+λDA+λHI+λIJ +λHJ +λHIJ
Table 2: Cluster profiles. In parenthesis the number of subjects typically allocated to each group.Simulation 1
A B C D E F G H I JMedian(ρp) 0.36 0.78 0.32 0.75 0.06 0.05 0.00 0.48 0.57 0.50Group 1 (5465) >< 00 00 00 00 >< >< Group 2 (3159) >< 00 >< 00 00 00 >< ><Group 3 (1376) 00 >< 00 00 00 00 00
Linear graphical model determination
Example. Simulation 1
T sim1γ =
A B C D E F G H I JA .52 .08 .50 .04 .02 .02 .20 .27 .15B .45 1 .06 .04 .03 .47 .64 .47C .45 .02 .02 .009 .12 .23 .16D .06 .04 .03 .45 .65 .48E .003 .003 .03 .04 .03F .002 .02 .03 .03G .02 .02 .02H 0.61 .56I .74
(University of St Andrews) 26 / 37
Linear graphical model determination
Example. Simulation 1
Table 3: Mixing performance of samplers. Median of iterations to best model is calculated after 30 runs of the reversible jumpMCMC chain. First and third quartiles are given in parentheses. PDV denotes the unrefined model search strategy adopted in
Papathomas et al (2011b). See Figure 2 for the highest posterior probability model.Simulation 1
Acceptance rate Iterations (median) to highest Posterior probabilityas a percentage posterior probability model for highest probability model
(a) Uniformly random (PDV) 5.1 590 (452,821) 0.55(b) Cluster specific 3.8 247 (164,369) 0.55(c) Combined (30%,10%) 5.3 540 (290,674) 0.53(d) Combined (20%,20%) 4.9 403 (312,493) 0.55
(University of St Andrews) 27 / 37
Linear graphical model determination
Real data example
I 30 single nucleotide polymorphisms (SNPs) in chromosomes 6and 15.(Data from 4260 subjects in a genome-wide association study of lung cancer presented in Hung et al. (2008).)
I 12 SNPs were indicated as important by variable selection withinclustering.(Two from chromosome 15 and ten from chromosome 6.)
I SNPs were highly correlated. 3 SNPs included in the competinglog-linear graphical models as representatives.rs8034191 from chromosome 15 and {rs4324798,rs1950081} from chromosome 6.
I Also include age, gender and smoking status in the competinglog-linear graphical models, to search for gene-environmentinteractions.
(University of St Andrews) 28 / 37
Linear graphical model determination
Real data example
I Reducing the number of SNPs from 30 to 12, and then to 3, allowsfor the use of reversible jump MCMC to compare competinggraphical models. The 233 contingency table would be too sparsewith the vast majority of cells equal to zero.
I The highest posterior probability model (P=0.8) is
‘SNP1+SNP2+SNP3+AGE*GENDER*SMOKING’
which does not support the presence of gene-gene orgene-environment interactions.
(University of St Andrews) 29 / 37
Real data example
‘SNP1 + SNP2 + SNP3 + AGE ∗GENDER ∗ SMOKING′
Table 6: Cluster profiles. In parenthesis the number of subjects typically allocated to each group.Genetic-environmental data (GE)
rs8034191 (A) rs4324798 (B) rs1950081 (C) age (D) gender (E) smoking (F)Median(ρp) 0.01 0.00 0.10 0.92 0.82 0.85Cluster 1 (2222) 00 00 00 >< >< Cluster 2 (2059) 00 00 00 >
Real data example
‘SNP1 + SNP2 + SNP3 + AGE ∗GENDER ∗ SMOKING′
T Real dataγ =
S1 S2 S3 AGE GEN SMS1 0.002 .01 .06 0.06 .06S2 .001 .02 0.02 .02S3 .09 .07 .08AGE 1 .98GEN .88
Linear graphical model determination
Real data example
Table 7: Mixing performance of samplers. Median of iterations to best model is calculated after 300 runs of the reversible jumpMCMC chain. First and third quartiles are given in parentheses. PDV denotes the unrefined model search strategy adopted in
Papathomas et al (2011b).Genetic-environmental data [including important (characterized as such by clustering) representative SNPs]
Acceptance rate Iterations (median) to highest Posterior probabilityas a percentage posterior probability model for highest probability model
‘A+B+C+DEF’(a) Uniformly random 6.3 564 (257,1205) 0.53(b) Cluster specific 8.4 196 (83,443) 0.51(c) Combined (30%,10%) 6.9 310 (147,670) 0.51(d) Combined (20%,20%) 7.5 235 (91,516) 0.52
(University of St Andrews) 32 / 37
Relevant work
Relevant work on latent class structures and log-linearmodelling (Johndrow et al.,2014)
I Johndrow et al. (2014) consider standard and novel latent classstructures. The DP is a special case.
I Its rank is defined as the minimum number of clusters required todescribe the joint probability tensor for the categorical covariates.
I Bounds are derived for the rank, in relation to the number and structureof the interactions in a weakly hierarchical log-linear model.
I A massive reduction in the upper bound of the rank is shown, under asparse log-linear model
I The rank of the latent structure depends only on variables that are notmarginally independent.
I A straightforward application gives that an upper bound of the rankcorresponding to simulation 1 is 27, rather than the default 29. The upperbound corresponding to simulation 5 is 28, rather than the default 299.
(University of St Andrews) 33 / 37
Relevant work
Relevant work on latent class structures and log-linearmodelling (Zhou et al., 2015)
I Zhou et al. (2015) also utilize the idea that marginally independentvariables reduce the dimensionality of the problem.
I A PARAFAC factorization is adopted, which can be viewed as a moregeneral representation of the Dirichlet process.
I Dimensionality reduction is achieved with the sparse PARAFAC(sp-PARAFAC) formulation, where marginal independence is modelledwith quantities of similar nature to the πp(x).
I The focus is in providing expressions for parameters of the log-linearmodels, assessing the level of shrinkage, and the convergence of theprobability tensor induced by sp-PARAFAC to the true probability tensor.
I The prior formulation for detecting marginally independent covariatesand reducing dimensionality is also different in the two approaches.
I Different objectives, as we focus on accelerating log-linear modelselection with the Reversible Jump by utilizing the clustering process.
(University of St Andrews) 34 / 37
Additional remarks - References
Summary
I The advantage in utilizing variable selection within partitioning toinform log-linear model selection is mostly pertinent to marginalindependence.
I For sparse contingency tables, this information can lead to thesubstantial reduction of the number of covariates considered,making the exploration of the model space feasible.
I Informing the model search algorithm with T γ often improves theefficiency of the search. Marginal independence is not alwaysdetected, because the converse of the Theorems does not hold.
I Importantly, using T γ to assist the model search never resulted ina worse algorithm, compared to the standard model searchapproach in Papathomas et al. (2011b).
(University of St Andrews) 35 / 37
Additional remarks - References
Further work
I Sparse contingency tables and conditional independenceI Improved model search algorithmsI Sparse contingency tables and identifiability
(University of St Andrews) 36 / 37
Additional remarks - References
Some ReferencesI Chung, Y., Dunson, D.B., 2009. Nonparametric Bayes conditional distribution modelling
with variable selection. J. Am. Stat. Assoc. 104, 1646-60.I Johndrow, J.E., Bhattacharya, A., Dunson, D.B., 2014. Tensor decompositions and sparse
log-linear models. arXiv:1404.0396v1.I Liverani, S., Hastie, D. I., Azizi, L., Papathomas, M. and Richardson, S. (2015) PReMiuM:
An R package for Profile Regression Mixture Models using Dirichlet Processes. Journal ofStatistical Software. , 64, Issue 7, pp 1-30.
I J. T. Molitor, M. Papathomas, M. Jerrett and S. Richardson (2010) Bayesian ProfileRegression with an Application to the National Survey of Children’s Health, Biostatistics,11, 484-498.
I M. Papathomas, J. Molitor, S. Richardson, E. Riboli and P. Vineis (2011) Examining thejoint effect of multiple risk factors using exposure risk profiles: lung cancer in non smokers.Environmental Health Perspectives, 119, 84-91.
I Papathomas, M , Molitor, J, Hoggart, C, Hastie, D and Richardson, S (2012) Exploring datafrom genetic association studies using Bayesian variable selection and the Dirichletprocess: application to searching for gene-gene patterns. Genetic Epidemiology.36:663-674
I Papathomas, M. and Richardson, S. (2016): Exploring dependence between categoricalvariables: benefits and limitations of using variable selection within Bayesian clustering inrelation to log-linear modelling with interaction terms. Journal of Statistical Planning andInference. 173, 47-63
I Zhou, J., Bhattacharya, A., Herring, A.H., Dunson, D.B., 2015. Bayesian factorizations ofbig sparse tensors. J. Am. Stat. Assoc. Accepted manuscript.
(University of St Andrews) 37 / 37
Fast sampling with Gaussian scale-mixturepriors in high-dimensional regression
Bani K Mallick(joint with Anirban Bhattacharya & Antik Charaborty)Department of Statistics, Texas A&M University
April 13, 2017
Outline
I High Dimensional RegressionI Global-Local scale Mixtures of GaussiansI Efficient sampling from structures multivariate Gaussian
distributionsI Applications
Motivation
I High dimensional regression with p covariates and samplesize n
I p is much larger than nI Sparsity through continuous shrinkage priorsI Need efficient computational scheme for p greater than n
problems
Linear model with global-local prior
I Y = Xβ + �, � ∼ N(0, σ2In)item X : n × p matrix of covariates where p potentiallymuch larger than n
I In such setting, one expects β, the vector of regressioncoefficients to be sparse
I Global-local sparse prior on β
Global-local Prior
I βj |λj , τ, σ ∼ N(0, λ2j τ2σ2), (j = 1, · · · ,p)I λj ∼ f , τ ∼ g,σ ∼ hI f , g,h are densities supported on (0,∞)I λ2j : local variance component works for scale mixing
I τ2: global variance component (like the regularizationparameter in the penalized likelihood formulation)
Sparsity Prior βj by scale mixing
I Create different sparsity priors by choosing f thedistribution of λj
I Student-t [Tipping 2001]: f is inverse-gammaI Double-Exponential [Park and Casella, 2008]: f
exponentialI Normal/Jeffreys [Bae and Mallick, 2004]: Jeffrey’s PriorI Horseshoe Prior [Carvalho et al., 2010]: Half-Cauchy
Horseshoe Priors
I Flat Cauchy like tails allow strong signals to remain large(un-shrunk)
I Tall spike at the origin provides severe shrinkage for thezero elements of β
I Due to scale mixture of Gaussian formulation most of theconditional distributions are in explicit form
Posterior computation: high-dimensional regression
I The conditional posterior of β given λ, τ and σ is Gaussian:
β | y , λ, τ, σ ∼ N(A−1X Ty , σ2A−1), A = (X TX + D−1),
where D=τ2diag(λ21, ..., λ2p).
I Need to invert A which is p × pI Covariance no longer diagonal.I The design matrix X distorts the prior geometry.I Need to sample from a high-dimensional Gaussian per
iterationI The p local scale parameters λj have conditionally
independent posteriors: λ = (λ1, . . . , λp)T updated in ablock.
Computational Difficulties
I A standard algorithm to sample from Gaussiandistributions can be found in (Rue, 2001)
I It avoids inverting A and instead performs a Choleskydecomposition of A and a series of linear system solutionsto generate samples
I This is efficient for moderate values of p, obtaining aCholesky decomposition of A at each MCMC step
Algorithm
I A be an n × n Symmetric, Positive Definite MatrixI Cholesky Decomposition: L is a lower triangular matrix
where Lii > 0 and A = LLTAlgorithm Solving Ax = b where A > 0(i) Compute the Cholesky decomposition A = LLT.(ii) Solve Lv = b.(iii) Solve LTx = v .(iv) Return x
I As x = (L−1)Tv = LT(L−1b) = (LLT)−1b = A−1bI The soultions in (ii) and (iii) can be done through forward
and backward substituition due to the triangular nature of L
Computational Difficulties
I It becomes highly expensive for large pI One cannot resort to precomputing the Cholesky factors
since the matrix D changes from one iteration to the otherI The resulting computational bottleneck obscures the
computational advantages of global-local priors when p islarge
General scheme
I Given X ∈
General scheme
I Rue (2001): sampling from N(µ = Q−1b,Σ = Q−1) withdensity
Algorithm Rue (2001)(i) Compute the Cholesky decomposition Q = LLT.(ii) Solve Lv = b.(iii) Solve LTm = v .(iv) Solve LTw = z, where z ∼ N(0, Ip).(vi) Set θ = m + w . Then, θ ∼ N(µ,Σ).
I Original motivation: GMRFs where Q is banded.I Cholesky in step (i) can be computed very fast.
General scheme
Present setting: Q = (X TX + D−1) & b = X TY .
Algorithm Rue (2001)(i) Compute the Cholesky decomposition (X TX + D−1) = LLT.(ii) Solve Lv = X TY .(iii) Solve LTm = v .(iv) Solve LTw = z, where z ∼ N(0, Ip).(vi) Set θ = m + w . Then, θ ∼ N(µ,Σ).
I (X TX + D−1) is a p × p dense matrix.I Cholesky decomposition costly in present setting.I Overall complexity O(p3).
Our proposal
I Sample from Np(µ,Σ) with
Σ = (X TX + D−1)−1, µ = ΣX TY .
I Woodbury matrix identity,
Σ = (X TX + D−1)−1 = D − DX T(XDX T + In)−1XD.
I Can show µ = DX T(XDX T + In)−1Y .
I Sample η ∼ N(0,Σ) & set θ = µ+ η.
I Data augmentation - sample an (n + p) dimensionalquantity.
Our proposal
I P ∈
Our proposal
I P ∈
Our proposal
I P ∈
Our proposal
I P ∈
Our proposal
I P ∈
Our proposal
I P ∈
Our proposal
I P ∈
Our proposal
I P ∈
Our proposal
I P ∈
Our proposal
I P ∈
Our proposal
I P ∈
Our proposal
I Recall P = (XDΦT + In),S = XD and R = D.I η ∼ N(0,Σ)I η = u − STP−1vI µ = DX T(XDX T + In)−1YI µ = STP−1Y .I Finally θ = µ+ ηI θ = u + STP−1(Y − v)
Proposed algorithm
Algorithm Proposed algorithm.(i) Sample u ∼ N(0,D) and δ ∼ N(0, In) independently.(ii) Set v = Xu + δ.(iii) Solve (XDX T + In)w = (Y − v).(iv) Set θ = u + DX Tw .
Overall complexity: O(n2p) if p > n.In p � n settings, reduction from cubic to linear in p.
Time comparison
Table : Absolute time (in seconds) to run 6000 iterations of the Gibbssampler reported for the two algorithms for chosen values of p.
p Timeproposed old
200 5.50 6.05500 8.79 31.03
1000 12.83 160.922000 20.04 944.783000 27.60 2616.804000 35.76 5775.705000 43.99 11314.28
Big gains when p large.
Simulation setting
I Replicated simulation study with horseshoe prior.I n = 200 & p = 5000. True β0 has 5 non-zero entries.I Two signal strengths:
(i) weak - β0S = ±(0.75,1,1.25,1.5,1.75)(ii) moderate - β0S = ±(1.5,1.75,2,2.25,2.5)
I Two types of design matrix:(i) Independent - Xj i.i.d. N(0, Ip)(ii) compound symmetry - Xj i.i.d. N(0,Σ), Σjj′ = 0.5 + 0.5δjj′
I 100 data sets generated.I Compare horseshoe with MCP, SCAD.
Simulation Results
Boxplots of `1, `2 and prediction error across 100 simulation replicates. HSme and HSm respectively denote
posterior point wise median and mean for the horeshoe prior. True β0 is 5-sparse with non-zero entries
±{1.5, 1.75, 2, 2.25, 2.5}. Top row: Σ = Ip (independent). Bottom row: Σjj = 1,Σjj′ = 0.5, j 6= j′ (compound
symmetry).
Simulation Results
Same setting as in Fig 23. True β0 is 5-sparse with non-zero entries±{0.75, 1, 1.25, 1.5, 1.75}.
Confidence/Credible intervals
I Recent focus in the frequentist literature on post selectioninference (POSI), van De Geer et al. (2013), Javanmard &Montanari (2014).
I Provide confidence sets for coefficients of subsets ofvariables.
I Bayesian variable selection by post-processing MCMCoutput with one-group priors (Bondell & Reich, 2012; Hahn& Carvalho, 2015; Li & Pati, 2015)
I Used in conjunction with shrinkage priors.
Simulation Results
p 500 1000
Design Independent Comp Symm Independent Comp Symm
HS LASSO SS HS LASSO SS HS LASSO SS HS LASSO SS
Signal Coverage 931.0 7512.0 823.7 950.9 734.0 804.0 942.0 7812.0 855.1 941.0 772.0 827.4Signal Length 42 46 41 85 71 75 39 41 42 82 76 77
Noise Coverage 1000.0 990.8 991.0 1000.0 981.0 990.6 990.0 991.0 980.9 1000.0 991.0 990.1Noise Length 2 43 40 4 69 73 0.6 42 41 0.7 76 77
Frequentist coverages (%) and 100×lengths of pointwise 95% intervals. Average coverages and lengths are
reported after averaging across all signal variables (rows 1 and 2) and noise variables (rows 3 and 4). Subscripts
denote standard errors (%) for coverages. LASSO and SS respectively stand for the methods in van De Geer et. al.
(2013) and Javanmard & Montanari (2014). The intervals for the horseshoe (HS) are the symmetric posterior
credible intervals.
TCGA Data
I The Cancer Genome Atlas (TCGA) providescomprehensive molecular profiles for each of at least 33different human tumor types (http://cancergenome.nih.gov)(Akbani et al, 2013).
I The data portal has DNA and reverse phase protein arrays(RPPA) expression data with several other clinicalvariables e.g. survival time of the patients.
I Integrative analysis is possible as we have quantitativeprotein expression data over large cohorts of wellcharacterized TCGA patient tumors, with linked DNA andRNA analyses.
Classification Analysis
I We have integrated the RPPA data with the DNAexpression data for our analysis.
I Toward the goal of classification we consider two types ofLung tumors data viz. Lung adenocarcinoma (LUAD) andLung squamous cell carcinoma (LUSC).
I After some clean up of the raw data we end up with having179 proteins and 76 genes which are present in both thetumors. The tumors have 55 and 45 samples respectively.
I Our goal is to classify the tumor groups with the help ofgenes and protein together and simultaneously select theimportant genes and proteins.
Probit model
I Suppose yi ∈ {0,1} and Xi ∈
00 o.w
(2)
I Given a posterior of β predictions can be done using theposterior predictive distribution.
Variable selection
I To select the genes and proteins that are important to thetumor classification we use the Horseshoe shrinkage priorfrom Carvalho et. al. (2009) on the coefficients β.
I Specifically, π(βj | λj , τ) ∼ N(0, λ2j τ2) and λj , τ ∼ C+(0,1)for j = 1 . . . p.
I The latent variable formulation enables simple conditionalGibb’s steps for posterior computation.
I However, we need to employ some post processingscheme to select important variables due to the continuousnature of the prior.
Post processing posterior summaries of β
I Let β̄ denote the posterior mean of β.I The posterior predictive loss || X β̄ − Xβ ||22 can be
minimized subject to a l1 constraint to select constraint.I Decision theoretic justification of such a post processing
step can be found in Hahn & Carvalho (2015).I We minimize the following objective function to select
important variables:
βP =β|| X β̄ − Xβ ||22 +p∑
j=1
λj | βj | . (3)
I Following Chakraborty et. al. (2016) we use λj = 1/β̄j2.
Results
I To compare our results we choose the Lasso penalty(Tibshirani ,1995) with a logistic link function from the Rpackage ncvreg/glmnet.
I We report the selected model size and misclassificationrate for both methods.
I For the Horseshoe probit model the misclassification ratewas 5% and selected model size was 5: the selectedgenes are GSKA-3 and ERBB3, and proteins arePI3KP110ALPHA, MYOSINIIA_pS1943, and ANNEXIN1.
I For the LASSO logistic model they were 5% and 32respectively.
I The variables selected by the Horseshoe probit modelwere also selected by the LASSO logistic model.
The selected variables
I The effect of the GSKA-3 gene as one of theHypoxia-inducible factors resulting in solid cancer wasestablished in Gort et. al. (2008).
I The gene ERBB3 has been seen to be have direct impacton lung cancer. See Sitharaman & Anderson (2008).
I A recent work by Elkabets et. al. (2013) studies in detailthe effect of the PI3KP110ALPHA gene on breast cancerand lung cancer.
I Wong et. al. (2012) developed treatments for thesuppression of the lung cancer cell tumor markers relatedto ANNEXIN1 gene.
Survival Analysis for pan-cancer data
I One of the fundamental interest is to establish therelationship of survival of the patients on different proteins
I Data from different kind of tumorsI As sample size may not be very large in each group so
Bayesian hierarchical model will be useful to borrowstrength
I Furthermore, to deal with high dimensionality, we requireeither penalized approach (frequentist statistics) orshrinkage based approach (Bayesian statistics)
I We apply Bayesian techniques with f Horseshoe priors
Kidney Tumors
I We consider the Kidney tumors: Kidney Chromophobe(KICH) 63 samples, Kidney renal clear cell carcinoma(KIRC) 150 samples, Kidney renal papillary cell carcinoma(KIRP) 124 samples
I All the tumors have 189 proteins
AFT model
I Typically, survival outcomes have two variables: time (t)and censored indicator (δ) which allows the data torepresent either censored or not
I i : Patient, j : Protein, k : Tumor typeI We fit an Accelerated Failure rate (AFT) model:
log(tik ) =p∑
j=1
xijkβjk + �ik
where where tik is the survival time of i-th patient who hasthe k -th cancer, other symbols have their usual meaning,and �i are iid N(0, σ2).
I For Bayesian MCMC we impute the censored data wi{wik = log tik if tik is event timewik > log tik if tik is right censored
Shrinkage Prior
We can carry out a regular shrinkage analysis as in linearmodels. Here we adopt global local Horseshoe prior and fitindividual regression model for each kind of tumor
log tik |βjk , σ2 ∼ N
p∑j=1
xijkβjk , σ2k
βjk |λjk , τ, σ2 ∼ N(0, λ2jkτ2σ2k )
λjk ∼ C+(0,1)τ ∼ C+(0,1)
π(σ2k ) ∼1σ2k
Estimation of Protein Effects
Pan Cancer Model
I Previous plot shows that due to shrinkage power ofHorseshoe prior and due to absence of sufficient numberof samples in each Kidney tumor group almost all of theproteins turn out to be insignificant in explaining thesurvival curve
I So we decide to fit a pan cancer modelI While fitting this model we make use of the idea of
borrowing strength across cancers by allowing the priordistributions of the parameters accordingly.
Pan Cancer Model
log tik |βjk , σ2 ∼ N
p∑j=1
xijkβjk , σ2
βjk |λjk , τ, σ2 ∼ N(bj , λ2jkτ2σ2)
λjk |τ, σ2 ∼ C+(0,1)τ |σ2 ∼ C+(0,1)I(0,1)
π(σ2) ∼ 1σ2
bj ∼ N(0, σ2b), j = 1, . . . ,p
Then,
Corr(βjk , βjk ′) =σ2b
(λ2jkτ2σ2 + σ2b)
12 (λ2jk ′τ
2σ2 + σ2b)12
Estimation
Result
I Some of the proteins which are significant for KIRC tumorgroup are BAK, CRAF_pS338, GAB2, HER3_pY1298,PCADHERIN, PCNA, RAD51, FOXO3A_pS318S321, SF2,DIRAS3
I The effects of these proteins have been discussed in theliterature e.g. see Adams et al (2012) GAB2 - AScaffolding Protein in Cancer, Molecular Cancer Research
I The significant proteins for other tumor groups could befound in the similar fashion
1_Lewin_HasseltIBS_April2017MotivationModelCase Study: mQTL discoverySummary, Acknowledgements
2_Brendan_IBS_20173_Papathomas_Hasselt_2017MotivationBayesian clustering with the Dirichlet processLinear graphical model determinationRelevant workAdditional remarks - References
4_Mallick