Graphical Models: Learning Pradeep Ravikumar Carnegie Mellon University
Graphical Models: LearningPradeep Ravikumar
Carnegie Mellon University
Learning Graphical Models• In many contexts, we do not have an apriori specified graphical model
distribution
• All we have access to are iid samples drawn from *unknown* graphical model distribution
• Would like to estimate graphical model distribution from data:
• graph
• factor functions
• We focus on parametric graphical models where factor functions have a specific parametric form, so this entails estimating some parameters (rather than general functions)
Learning Undirected Graphical Models
We will focus on pairwise graphical models
p(X; , G) =1
Z()exp
(s,t)E(G)
st st(Xs, Xt)
st(xs, xt) : arbitrary potential functions
Ising xs xt
Potts I(xs = xt)Indicator I(xs, xt = j, k)
Graphical Model Selection
Graphical model selection
let G = (V,E) be an undirected graph on p = |V | vertices
pairwise Markov random field: family of prob. distributions
P(x1, . . . , xp; θ) =1
Z(θ)exp
! "
(s,t)∈E
θstxsxt
#
Problem of graph selection: given n independent and identicallydistributed (i.i.d.) samples of X = (X1, . . . , Xp), identify the underlyinggraph structure
Martin Wainwright (UC Berkeley) High-dimensional graph selection September 2009 7 / 36
Given: n samples of X = (X1, . . . , Xp) with distribution p(X;
;G), where
p(X;
) = exp
8<
:X
(s,t)2E(G)
stst(xs, xt)A(
)
9=
;
Problem: Estimate graph G given just the n samples.
?
Samples from binary-valued pairwise MRFs
Independence model θst = 0
Samples from binary-valued pairwise MRFs
Medium coupling θst ≈ 0.2
Samples from binary-valued pairwise MRFs
Strong coupling θst ≈ 0.8
Graphical Model Selection
Graphical model selection
let G = (V,E) be an undirected graph on p = |V | vertices
pairwise Markov random field: family of prob. distributions
P(x1, . . . , xp; θ) =1
Z(θ)exp
! "
(s,t)∈E
θstxsxt
#
Problem of graph selection: given n independent and identicallydistributed (i.i.d.) samples of X = (X1, . . . , Xp), identify the underlyinggraph structure
Martin Wainwright (UC Berkeley) High-dimensional graph selection September 2009 7 / 36
Given: n samples of X = (X1, . . . , Xp) with distribution p(X;
;G), where
p(X;
) = exp
8<
:X
(s,t)2E(G)
stst(xs, xt)A(
)
9=
;
Problem: Estimate graph G given just the n samples.
?
Learning Graphical Models
Learning Graphical Models
• Two Step Procedures:
Learning Graphical Models
• Two Step Procedures:
‣ 1. Model Selection; estimate graph structure
Learning Graphical Models
• Two Step Procedures:
‣ 1. Model Selection; estimate graph structure
‣ 2. Parameter Inference given graph structure
Learning Graphical Models
• Two Step Procedures:
‣ 1. Model Selection; estimate graph structure
‣ 2. Parameter Inference given graph structure
• Score Based Approaches: search over space of graphs, with a score for graph based on parameter inference
Learning Graphical Models
• Two Step Procedures:
‣ 1. Model Selection; estimate graph structure
‣ 2. Parameter Inference given graph structure
• Score Based Approaches: search over space of graphs, with a score for graph based on parameter inference
• Constraint-based Approaches: estimate individual edges by hypothesis tests for conditional independences
Learning Graphical Models
• Two Step Procedures:
‣ 1. Model Selection; estimate graph structure
‣ 2. Parameter Inference given graph structure
• Score Based Approaches: search over space of graphs, with a score for graph based on parameter inference
• Constraint-based Approaches: estimate individual edges by hypothesis tests for conditional independences
• Caveats: (a) difficult to provide guarantees for estimators; (b) estimators are NP-Hard
Learning Graphical Models
• State of the art methods are based on estimating neighborhoods
‣ Via high-dimensional statistical model estimation
‣ Via high-dimensional hypothesis tests
Ising Model Selection
Graphical model selection
let G = (V,E) be an undirected graph on p = |V | vertices
pairwise Markov random field: family of prob. distributions
P(x1, . . . , xp; θ) =1
Z(θ)exp
! "
(s,t)∈E
θstxsxt
#
Problem of graph selection: given n independent and identicallydistributed (i.i.d.) samples of X = (X1, . . . , Xp), identify the underlyinggraph structure
Martin Wainwright (UC Berkeley) High-dimensional graph selection September 2009 7 / 36
Given: n samples of X = (X1, . . . , Xp) with distribution p(X;
;G), where
p(X;
) = exp
8<
:X
(s,t)2E(G)
stst(xs, xt)A(
)
9=
;
Problem: Estimate graph G given just the n samples.
?
p(X; ) = exp
8<
:X
(s,t)2E(G)
stXsXt A()
9=
;
Ising Model Selection
Given: n samples of X = (X1, . . . , Xp) with distribution p(X;
;G), where
p(X;
) = exp
8<
:X
(s,t)2E(G)
stst(xs, xt)A(
)
9=
;
Problem: Estimate graph G given just the n samples.
p(X; ) = exp
8<
:X
(s,t)2E(G)
stXsXt A()
9=
;
Applications: statistical physics, computer vision, social network analysis
US Senate 109th Congress
Banerjee et al, 2008
Ising Model Selection
• Just computing the likelihood of a known Ising model is NP Hard (since the normalization constant requires summing over exponentially many configurations)
Z() =
X
x21,1p
exp
X
st
st
x
s
x
t
!
Ising Model Selection
• Just computing the likelihood of a known Ising model is NP Hard (since the normalization constant requires summing over exponentially many configurations)
• Estimating the unknown Ising model parameters as well as graph structure might seem to be NP Hard as well
Z() =
X
x21,1p
exp
X
st
st
x
s
x
t
!
Ising Model Selection
• Just computing the likelihood of a known Ising model is NP Hard (since the normalization constant requires summing over exponentially many configurations)
• Estimating the unknown Ising model parameters as well as graph structure might seem to be NP Hard as well
• On the other hand, it is tractable to estimate the node-wise conditional distributions, of one variable conditioned on the rest of the variables
Z() =
X
x21,1p
exp
X
st
st
x
s
x
t
!
Neighborhood Estimation in Ising Models
p(Xr|XV \r; , G) =exp(
tN(r) 2 rtXrXt)
exp(
tN(r) 2 rtXrXt) + 1
For Ising models, node conditional distribution is just a logistic regression model:
Neighborhood Estimation in Ising Models
• So instead of estimating graph structure constrained global Ising model, we could estimate structure constrained local node-conditional distributions —logistic regression models
p(Xr|XV \r; , G) =exp(
tN(r) 2 rtXrXt)
exp(
tN(r) 2 rtXrXt) + 1
For Ising models, node conditional distribution is just a logistic regression model:
Neighborhood Estimation in Ising Models
• So instead of estimating graph structure constrained global Ising model, we could estimate structure constrained local node-conditional distributions —logistic regression models
• But would node-conditional distributions uniquely specify a consistent joint, or even be consistent with any joint at all?
p(Xr|XV \r; , G) =exp(
tN(r) 2 rtXrXt)
exp(
tN(r) 2 rtXrXt) + 1
For Ising models, node conditional distribution is just a logistic regression model:
Conditional and Joint Distributions
• Would node-conditional distributions uniquely specify a consistent joint, or even be consistent with any joint at all?
Conditional and Joint Distributions
• Would node-conditional distributions uniquely specify a consistent joint, or even be consistent with any joint at all?
• In general: no!
Conditional and Joint Distributions
• Would node-conditional distributions uniquely specify a consistent joint, or even be consistent with any joint at all?
• In general: no!
• But for the Ising model and node-wise logistic regression models: yes!
Conditional and Joint Distributions
• Would node-conditional distributions uniquely specify a consistent joint, or even be consistent with any joint at all?
• In general: no!
• But for the Ising model and node-wise logistic regression models: yes!
• Theorem (Besag 1974, R., Wainwright, Lafferty 2010): An Ising model uniquely specifies and is uniquely specified by a set of node-wise logistic regression models.
Neighborhood Estimation in Ising Models
• Global graph constraint of sparse, bounded degree graphs is equivalent to local constraint of bounded node-degrees (number of neighbors)
• Estimate node neighborhoods via constrained logistic regression models, and stich node-neighborhoods together to form global graph
p(Xr|XV \r; , G) =exp(
tN(r) 2 rtXrXt)
exp(
tN(r) 2 rtXrXt) + 1
For Ising models, node conditional distribution is just a logistic regression model:
Graph selection via neighborhood regression
Observation: Recovering graph G equivalent to recovering neighborhood set N(s)for all s ∈ V .
Method: Given n i.i.d. samples X(1), . . . , X(n), perform logistic regression of
each node Xs on X\s := Xs, t = s to estimate neighborhood structure bN(s).
1 For each node s ∈ V , perform ℓ1 regularized logistic regression of Xs on theremaining variables X\s:
bθ[s] := arg minθ∈Rp−1
(1n
nX
i=1
f(θ; X(i)\s )
| z + ρn ∥θ∥1|z
)
logistic likelihood regularization
2 Estimate the local neighborhood bN(s) as the support (non-negative entries) of
the regression vector bθ[s].
3 Combine the neighborhood estimates in a consistent manner (AND, or ORrule).
Martin Wainwright (UC Berkeley) High-dimensional graph selection September 2009 21 / 36
Empirical behavior: Unrescaled plots
0 100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1
Number of samples
Pro
b.
su
cce
ss
Star graph; Linear fraction neighbors
p = 64
p = 100
p = 225
Plots of success probability versus raw sample size .Martin Wainwright (UC Berkeley) High-dimensional graph selection September 2009 22 / 36
Sufficient conditions for consistent model selectiongraph sequences Gp,d = (V, E) with p vertices, and maximum degree d.
edge weights |θst| ≥ θmin for all (s, t) ∈ E
draw n i.i.d, samples, and analyze prob. success indexed by (n, p, d)
Theorem
Under incoherence conditions, for a rescaled sample size (RavWaiLaf06)
θLR(n, p, d) :=n
d3 log p> θcrit
and regularization parameter ρn ≥ c1 τ!
log pn , then with probability greater
than 1 − 2 exp"− c2(τ − 2) log p
#→ 1:
(a) Uniqueness: For each node s ∈ V , the ℓ1-regularized logistic convexprogram has a unique solution. (Non-trivial since p ≫ n =⇒ not strictly convex).
(b) Correct exclusion: The estimated sign neighborhood $N(s) correctlyexcludes all edges not in the true neighborhood.
(c) Correct inclusion: For θmin ≥ c3τ√
dρn, the method selects the correctsigned neighborhood.
Consequence: For θmin = Ω(1/d), it suffices to have n = Ω(d3 log p).
(R., Wainwright, Lafferty, 2010)
Assumptions
Define Fisher information matrix of logistic regression:Q∗ := Eθ∗
!∇2f(θ∗;X)
".
A1. Dependency condition: Bounded eigenspectra:
Cmin ≤ λmin(Q∗SS), and λmax(Q∗
SS) ≤ Cmax.
λmax(Eθ∗ [XXT ]) ≤ Dmax.
A2. Incoherence There exists an ν ∈ (0, 1] such that
|||Q∗ScS(Q∗
SS)−1|||∞,∞ ≤ 1 − ν.
where |||A|||∞,∞ := maxi#
j |Aij |.
bounds on eigenvalues are fairly standard
incoherence condition:
! partly necessary (prevention of degenerate models)! partly an artifact of ℓ1-regularization
incoherence condition is weaker than correlation decay
Martin Wainwright (UC Berkeley) High-dimensional graph selection September 2009 26 / 36
Other Undirected Graphical Models
• Similar estimators work for other undirected parametric graphical models as well
• Discrete/Categorical Graphical Models (Jalali, Ravikumar, Sanghavi, Ruan 2011)
• Gaussian Graphical Models (Ravikumar, Raskutti, Wainwright, Yu 2011)
• Exponential Family Graphical Models (Yang, Ravikumar, Allen, Liu 2015)
Example: Mixed Graphical ModelsReferences
Experiments: Cancer Genomic and Transcriptomic DataCombine ‘Level III RNA-sequencing’ data and ‘Level II non-silent somatic mutationand level III copy number variation data’ for 697 breast cancer patients.
ABCC11
CXCL17
CEACAM6
CEACAM5
TSPYP5 RPS24P1
ASSP6ADLICANP PLIN4CXCL13
PDZK1
GREB1
PGR
OFDYP2
CDY2A
ARSEP
RBMY2XP
CDY9P
OFDYP13
CDY15P
FAM8A8P
GYG2P
XKRYP5
SEDLP5 OFDYP14
CDY4PCDY21P
EIF1AY
CDY22P
SEDLP4
XKRYP6
XKRYP2
OFDYP1
BPY2
OFDYP9
RBMY1H
ACTGP2
CDY13P CDY5P
XKRYNLGN4Y
TAF9P1 HDHD1BP
VCYCYCSP46
STSPCDY2B
CDY6P
CXCL9RBMY2QP
RBMY2NP ADIPOQ
TSPYP3 UBDPREX1SERPINA1
GATA3
FAM5C
SERPINA6
HCP5P14
CPB1
CALML5 CST9
CHAD
KLK11DHRS2
KIAA1972
NAT1 KLK6KLK5
EHF
CLEC3A
YESP
PVALB
GALNT6
ELF5
SH3BGRL
KRT23
NKAIN1
PRAME
NPC2STC1
WNK4CHGB
CARTPT RAP1AP RIMS4 CP
CDY7P
CDY8PCDH3
SEDLP3 IFI6
ADH1B FABP4MRP63P10
C4orf7 ARSDP
RBMY2OP
CD36
PCSK1
PIK3CA
TSPAN1
PYY
CBLN2VSTM2A
SYT1UGT2B11
BCAS1 TRH
DCDS100A14
CACNG4
SLC5A8
ABCA12
FAM106A
FADS2GLYATL2
CYP2A7
KRT6BTCN1
SFRP1
SOX10
SERPINB5 GABRP
COL17A1
TMC5
TGIF2LY
S100A6
USP12P3
CST5
SLC30A8
T2R55
CYP2A6
CYP2B7P1
PS5
NELL2
S100A9
S100A7
ADAM6
LOC96610
IGJ S100A8
KRT5
DSC3
FABP7
KRT14
DSG3
USP9YP2
SMCYTBL1YP
CYorf15A
ZNF381P
RPS4Y2P CYorf14
PRY
CDY10P
TSPYP4 CYorf15B
OFDYP5
TAF9P2 RBMY1B
CASKP
HSFY2
OFDYP6 TSPYP2
SMCYPUSP9YP1
RBMY2EP RBMY2KP RBMY2HP
AMELY
SRY
PSMA6P AYP1p1
AQP10
ATP8B2 ADAM32
UTY
ADAM5
OFDYP7
RBMY2TP FAM8A9P
RBMY1A1 OFDYP4
PCDH11Y
HSFY1XKRYP1
RBMY2SP BCORL2
KRT18P10
RBMY1J DAZ1
RBMY2VP RBMY2BP
RBMY2WP
DAZ2
TSPYP1 TBL1YPRKY
TSPY2RBMY2GP
RPS4Y1
AQP3
MUC2
GSTM1 COX6C MUC6
CGA GSTM3
ALBSMR3B
SLPI
PKP1
DLK1
TFF3
CNTNAP2 CYP4X1
CYP4Z1
SPDEF
TMEM150C RPL8
FOXA1
TFF1
C1orf64
CRISP3 C3orf57
A2ML1
THSD4
GSTT1
TP53
SLC7A5
LBP
IFIT1
GOLGA2LY1
OFDYP10
C8orf85 XKRYP3
XKRY2
OFDYP16
CDY16P
RBMY1D MX1
C20orf114
VCY1B
TMSB4Y ZFY
OFDYP3
CYorf16
RBMY2MP
CDY3P
KALP
RBMY1A3P
RBMY2UP
PRY2
RBMY1F
CDY11P
RBMY2AP
RBMY1E
OFDYP8
CDY12P
USP9YAPXLP
GPR143P FAM8A7P
RBMY2JP
DDX3Y
OFDYP15 OFDYP11 CDY17P
CDY18P OFDYP12 BPY2BRBMY3AP
SFPQP
CDY1PPP1R12BP
CDY20P
CSPG4LYP1
RBMY2DP
CYCSP48
RBMY2CP DAZ4CDY1B
XKRYP4
CSPG4LYP2
CDY14P DAZ3
CDY23P
GOLGA2LY2
BPY2CCDY19P PARP4P RBMY2YP
TSPY1
TPGM - Ising graphical model
(Yellow) Gene expression via RNA-sequencing, count-valued
(Blue) Genomic mutation, binary mutation status
Well known components: (DLK1, THSD4) - (TP53)
(UT Austin) Mixed Graphical Models via Exponential Families AISTATS 2014 22 / 25
References
Experiments: Cancer Genomic and Transcriptomic DataCombine ‘Level III RNA-sequencing’ data and ‘Level II non-silent somatic mutationand level III copy number variation data’ for 697 breast cancer patients.
ABCC11
CXCL17
CEACAM6
CEACAM5
TSPYP5 RPS24P1
ASSP6ADLICANP PLIN4CXCL13
PDZK1
GREB1
PGR
OFDYP2
CDY2A
ARSEP
RBMY2XP
CDY9P
OFDYP13
CDY15P
FAM8A8P
GYG2P
XKRYP5
SEDLP5 OFDYP14
CDY4PCDY21P
EIF1AY
CDY22P
SEDLP4
XKRYP6
XKRYP2
OFDYP1
BPY2
OFDYP9
RBMY1H
ACTGP2
CDY13P CDY5P
XKRYNLGN4Y
TAF9P1 HDHD1BP
VCYCYCSP46
STSPCDY2B
CDY6P
CXCL9RBMY2QP
RBMY2NP ADIPOQ
TSPYP3 UBDPREX1SERPINA1
GATA3
FAM5C
SERPINA6
HCP5P14
CPB1
CALML5 CST9
CHAD
KLK11DHRS2
KIAA1972
NAT1 KLK6KLK5
EHF
CLEC3A
YESP
PVALB
GALNT6
ELF5
SH3BGRL
KRT23
NKAIN1
PRAME
NPC2STC1
WNK4CHGB
CARTPT RAP1AP RIMS4 CP
CDY7P
CDY8PCDH3
SEDLP3 IFI6
ADH1B FABP4MRP63P10
C4orf7 ARSDP
RBMY2OP
CD36
PCSK1
PIK3CA
TSPAN1
PYY
CBLN2VSTM2A
SYT1UGT2B11
BCAS1 TRH
DCDS100A14
CACNG4
SLC5A8
ABCA12
FAM106A
FADS2GLYATL2
CYP2A7
KRT6BTCN1
SFRP1
SOX10
SERPINB5 GABRP
COL17A1
TMC5
TGIF2LY
S100A6
USP12P3
CST5
SLC30A8
T2R55
CYP2A6
CYP2B7P1
PS5
NELL2
S100A9
S100A7
ADAM6
LOC96610
IGJ S100A8
KRT5
DSC3
FABP7
KRT14
DSG3
USP9YP2
SMCYTBL1YP
CYorf15A
ZNF381P
RPS4Y2P CYorf14
PRY
CDY10P
TSPYP4 CYorf15B
OFDYP5
TAF9P2 RBMY1B
CASKP
HSFY2
OFDYP6 TSPYP2
SMCYPUSP9YP1
RBMY2EP RBMY2KP RBMY2HP
AMELY
SRY
PSMA6P AYP1p1
AQP10
ATP8B2 ADAM32
UTY
ADAM5
OFDYP7
RBMY2TP FAM8A9P
RBMY1A1 OFDYP4
PCDH11Y
HSFY1XKRYP1
RBMY2SP BCORL2
KRT18P10
RBMY1J DAZ1
RBMY2VP RBMY2BP
RBMY2WP
DAZ2
TSPYP1 TBL1YPRKY
TSPY2RBMY2GP
RPS4Y1
AQP3
MUC2
GSTM1 COX6C MUC6
CGA GSTM3
ALBSMR3B
SLPI
PKP1
DLK1
TFF3
CNTNAP2 CYP4X1
CYP4Z1
SPDEF
TMEM150C RPL8
FOXA1
TFF1
C1orf64
CRISP3 C3orf57
A2ML1
THSD4
GSTT1
TP53
SLC7A5
LBP
IFIT1
GOLGA2LY1
OFDYP10
C8orf85 XKRYP3
XKRY2
OFDYP16
CDY16P
RBMY1D MX1
C20orf114
VCY1B
TMSB4Y ZFY
OFDYP3
CYorf16
RBMY2MP
CDY3P
KALP
RBMY1A3P
RBMY2UP
PRY2
RBMY1F
CDY11P
RBMY2AP
RBMY1E
OFDYP8
CDY12P
USP9YAPXLP
GPR143P FAM8A7P
RBMY2JP
DDX3Y
OFDYP15 OFDYP11 CDY17P
CDY18P OFDYP12 BPY2BRBMY3AP
SFPQP
CDY1PPP1R12BP
CDY20P
CSPG4LYP1
RBMY2DP
CYCSP48
RBMY2CP DAZ4CDY1B
XKRYP4
CSPG4LYP2
CDY14P DAZ3
CDY23P
GOLGA2LY2
BPY2CCDY19P PARP4P RBMY2YP
TSPY1
TPGM - Ising graphical model
(Yellow) Gene expression via RNA-sequencing, count-valued
(Blue) Genomic mutation, binary mutation status
Well known components: (DLK1, THSD4) - (TP53)
(UT Austin) Mixed Graphical Models via Exponential Families AISTATS 2014 22 / 25
Poisson-Ising Models
Example: Poisson Graphical Models
• MicroRNA network learnt from The Cancer Genome Atlas (TCGA) Breast Cancer Level II Data
Case Study: Biological Results
SPGM miRNA Network:
An Important Special Case: Poisson Graphical Model
Joint Distribution:
P(X ) = exp
8
<
:
X
s
s
X
s
+X
(s,t)2E
st
X
s
X
t
+X
s
log(Xs
!) A()
9
=
;
.
Node-conditional Distributions:
P(Xs
|XV \s) / exp
8
<
:
0
@s
+X
t2N(s)
st
X
t
1
A
X
s
+ log(Xs
!)
9
=
;
,
Pairwise variant discussed as “Poisson auto-model” in (Besag, 74).
Learning Directed Graphical Models
Moralization
• The undirected graphical corresponding to moralized graph is the smallest undirected graphical model that includes directed graphical model distribution
• Learning undirected graphical model structure given iid samples from directed graphical model would estimate moralized graph
Recall: Moralization
1X
2X
3X
X 4
X 5
X61X
2X
3X
X 4
X 5
X6
(a) (b)
1X
2X
3X
X 4
X 5
X61X
2X
3X
X 4
X 5
X6
(a) (b)
Learning Directed Graphical Models
• Two Step Process:
• Learn the undirected “moralized” graph
• Orient the edges using conditional independence tests
• PC algorithm — Spirtes, Glymour & Scheines (2000)
• Open Problem: Not very scalable algorithms for the second step (or for overall directed graphical model estimation)