Gaussian Graphical Models with latent structure

Penalized Maximum Likelihood Inference forSparse Gaussian Graphical Models with

Latent Structure

Christophe Ambroise, Julien Chiquet and Catherine Matias

Laboratoire Statistique et Genome,La genopole - Universite d’Evry

Statistique et sante publique, le 13 janvier 2009

Ambroise, Chiquet, Matias 1

Inferring Sparse Networks with LatentStructure

Christophe Ambroise, Julien Chiquet and Catherine Matias

Laboratoire Statistique et Genome,La genopole - Universite d’Evry

Statistique et sante publique, le 13 janvier 2009


Biological networksDifferent kinds of biological interactions

Families of networksI protein-protein

interactions,I metabolic pathways,I regulation network.

lexA

dinI

recF

rpD rpH

SsB

recA

umD

rpS

Regulation example : SOS Network E. Coli

Let us focus on regulatory networks . . . and look for influencenetwork





lexA

dinI

recF

rpD rpH

SsB

recA

umD

rpS







lexA

dinI

recF

rpD rpH

SsB

recA

umD

rpS




What questions?

Network

Inference

Supervised

Un-supervised

StructureDegree

distri-bution

Communityanalysis

Stat.model

Spectralclustering

How to find the interactions?What knowledge the structurecan provide?

Given a new node, what are theinteraction with the known nodes?

Given two nodes, do theyinteract?

Communities’ characteristics?Ambroise, Chiquet, Matias 3

What questions?

Network

Inference

Supervised

Un-supervised

StructureDegree

distri-bution

Communityanalysis

Stat.model

Spectralclustering

How to find the interactions?What knowledge the structurecan provide?

Given a new node, what are theinteraction with the known nodes?

Given two nodes, do theyinteract?

Communities’ characteristics?Ambroise, Chiquet, Matias 3

ProblemInfer the interactions between genes from microarray data

Microarray gene expression data,p genes, n experiments Which ones interact/co-express?

G0 G1

G2

G3

G4

G5

G6

G7

G8

G9

Major Issues

I combinatory: 2p2

possible graphsI dimension problem: n� p

Here, we reduce p to a number of fixed genes of interest



Microarray gene expression data,p genes, n experiments

Inference

Which ones interact/co-express?

G0 G1

G2

G3

G4

G5

G6

G7

G8

G9

Major Issues

I combinatory: 2p2





Microarray gene expression data,p genes, n experiments

Inference

Which ones interact/co-express?

G0 G1

G2

G3

G4

G5

G6

G7

G8

G9

Major Issues

I combinatory: 2p2




Our ideas to tackle these issues

Introduce prior taking the topology of the network intoaccount for better edge inference

G0 G1

G2

G3

G4

G5

G6

G7

G8

G9

Relying on biological constraints

1. few genes effectively interact (sparsity),2. networks are organized (latent structure).




G0 G1

G2

G3

G4

G5

G6

G7

G8

G9






A1 A2

A3

B1

B2

B3

B4

B5

C1

C2




Outline

Give the network a modelGaussian graphical modelsProviding the network with a latent structureThe complete likelihood

Inference strategy by alternate optimizationThe E–step: estimation of the latent structureThe M–step: inferring the connectivity matrix

Numerical ExperimentsSynthetic dataBreast cancer data


Outline





Outline





Outline





GGMsGeneral settings

The Gaussian modelI Let X ∈ Rp be a random vector such as X ∼ N (0p,Σ);I let (X1, . . . , Xn) be an i.i.d. size–n sample (e.g., microarray

experiments);I let X be a n× p matrix such as (Xk)ᵀ is the kth row of X;I let K = (Kij)(i,j)∈P2 := Σ−1 be the concentration matrix.

The graphical interpretation

Xi ⊥⊥ Xj |XP\{i,j} ⇔ Kij = 0⇔ edge (i, j) /∈ network,

since rij|P\{i,j} = −Kij/√KiiKjj .

K describes the graph of conditional dependencies.


GGMsGeneral settings

The Gaussian modelI Let X ∈ Rp be a random vector such as X ∼ N (0p,Σ);I let (X1, . . . , Xn) be an i.i.d. size–n sample (e.g., microarray

experiments);I let X be a n× p matrix such as (Xk)ᵀ is the kth row of X;I let K = (Kij)(i,j)∈P2 := Σ−1 be the concentration matrix.

The graphical interpretation

Xi ⊥⊥ Xj |XP\{i,j} ⇔ Kij = 0⇔ edge (i, j) /∈ network,

since rij|P\{i,j} = −Kij/√KiiKjj .

K describes the graph of conditional dependencies.


GGMs and regressionNetwork inference as p independent regression problems

One may use p different linear regressions

Xi = (X\i)ᵀα+ ε, where αj = −Kij/Kii,

Meinshausen and Bulhman’s approach (06)Solve p independent Lasso problems (`1–norm enforcessparsity):

α = arg minα

1n

∥∥Xi −X\iα∥∥2

2+ ρ ‖α‖`1 ,

where Xi is the ith column of X, and X\i is the full matrix with ithcolumn removed.

Major drawback: need of a symmetrization step to obtain afinal estimate of K.






α = arg minα

1n

∥∥Xi −X\iα∥∥2

2+ ρ ‖α‖`1 ,








α = arg minα

1n

∥∥Xi −X\iα∥∥2

2+ ρ ‖α‖`1 ,




GGMs and LassoSolving p penalized regressions⇔maximize the penalized pseudo-likelihood

Consider the approximation P(X) =∏pi=1 P(Xi|X\i).

PropositionThe solution to

K = arg maxK,Kij 6=Kji

log L(X; K) + ρ ‖K‖`1 , (1)

with

L(X; K) =p∑i=1

( n∑k=1

log P(Xki |Xk

\i; Ki)),

shares the same null-entries as the solution of the pindependent penalized regressions.

Those p terms are not independent, as K is not diagonal ! Still requires the post-symmetrization


GGMs and LassoSolving p penalized regressions⇔maximize the penalized pseudo-likelihood

Consider the approximation P(X) =∏pi=1 P(Xi|X\i).

PropositionThe solution to

K = arg maxK,Kij 6=Kji

log L(X; K) + ρ ‖K‖`1 , (1)

with

L(X; K) =p∑i=1

( n∑k=1

log P(Xki |Xk

\i; Ki)),

shares the same null-entries as the solution of the pindependent penalized regressions.

Those p terms are not independent, as K is not diagonal ! Still requires the post-symmetrization


GGMs and penalized likelihood

The penalized likelihood of the Gaussian observationsUse a penalty term

n

2(log det(K)− Tr(SnK))− ρ‖K‖`1 ,

where Sn is the empirical covariance matrix.

Banerjee et al. Model selection through sparse maximumlikelihood estimation for multivariate Gaussian, JMLR, 2008.


GGMs and penalized likelihood

The penalized likelihood of the Gaussian observationsUse a penalty term

n

2(log det(K)− Tr(SnK))− ρ‖K‖`1 ,

where Sn is the empirical covariance matrix.

Natural generalizationUse different penalty parameters for different coefficients

n

2(log det(K)− Tr(SnK))− ‖ρZ(K)‖`1 ,

where ρZ(K) = (ρZi,Zj (Kij))i,j is a penalty function dependingon an unknown underlying structure Z.


Outline





The concentration matrix structureModelling connection heterogeneity

Assumption: there exists a latent structure spreading thevertices into a set Q = {1, . . . , q, . . . , Q} of classes of connectivity.

The classes of connectivityDenote Z = {Zi = (Zi1, . . . , ZiQ)}i where Ziq = 1{i∈q} are thelatent independent variables, with

I α = {αq}, the prior proportions of groups,I (Zi) ∼M(1, α), a multinomial distribution.

A mixture of Laplace distributionsAssume Kij |Z independent. Then Kij | {ZiqZj` = 1} ∼ fq`(·),where

fq`(x) =1

2λqèxp

{− |x|λq`

}, q, ` ∈ Q.







fq`(x) =1

2λqèxp

{− |x|λq`

}, q, ` ∈ Q.







fq`(x) =1

2λqèxp

{− |x|λq`

}, q, ` ∈ Q.


Some possible structures

Figure: From Affiliation to Bipartite

A1 A2

A3

B1

B2

B3

B4

B5

C1

C2

ExampleModular (affiliation) networkTwo kinds of Laplace distributions

1. intra-cluster q = `, fin(·;λin);

2. inter-cluster q 6= `, fout(·;λout).


Some possible structures

Figure: From Affiliation to Bipartite

A1 A2

A3

B1

B2

B3

B4

B5

C1

C2

ExampleModular (affiliation) networkTwo kinds of Laplace distributions

1. intra-cluster q = `, fin(·;λin);

2. inter-cluster q 6= `, fout(·;λout).


Outline





Looking for a criteria. . .

We wish to infer non-null entries of K knowing the data. Thenour strategy is

K = arg maxK�0

P(K|X) = arg maxK�0

log P(X,K).

Marginalization over Z

Because distribution of K is known conditional on the structure !

K = arg maxK�0

log∑Z∈ZLc(X,K,Z),

where Lc(X,K,Z) = P(X,K,Z) is complete-data likelihood.

An EM–like strategy is used hereafter to solve this problem.




K = arg maxK�0


log P(X,K).



K = arg maxK�0







K = arg maxK�0


log P(X,K).



K = arg maxK�0





The complete likelihood

Proposition

logLc(X,K,Z) =n

2(log det(K)− Tr(SnK))− ‖ρZ(K)‖`1

−∑

i,j∈P,i 6=jq,`∈Q

ZiqZjl log(2λq`) +∑

i∈P,q∈QZiq logαq + c,

where Sn is the empirical covariance matrix andρZ(K) =

(ρZiZj (Kij)

)(i,j)∈P2 is defined by

ρZiZj (Kij) =∑q,`∈Q

ZiqZj`Kij

λq`.



Proposition

logLc(X,K,Z) =n

2(log det(K)− Tr(SnK))− ‖ρZ(K)‖`1

−∑





(ρZiZj (Kij)



ZiqZj`Kij

λq`.

Part concerning K: PML with a LASSO-type approach.



Proposition

logLc(X,K,Z) =n

2(log det(K)− Tr(SnK)) − ‖ρZ(K)‖`1

−∑





(ρZiZj (Kij)



ZiqZj`Kij

λq`.

Part concerning Z: estimation with a variational approach.


Outline





An EM strategy

The conditional expectation to maximize

Q(K|K(m)

)= E

{logLc(X,K,Z)|X; K(m)

},∑Z∈Z

P(Z|X,K(m)

)logLc(X,K,Z)

=∑Z∈Z

P(Z|K(m)

)logLc(X,K,Z).

ProblemI No closed-form of Q

(K|K(m)

)because P(Z|K) cannot be

factorized.I We use variational approach to approximate P(Z|K).


An EM strategy

The conditional expectation to maximize

Q(K|K(m)

)= E

{logLc(X,K,Z)|X; K(m)

},∑Z∈Z

P(Z|X,K(m)

)logLc(X,K,Z)

=∑Z∈Z

P(Z|K(m)

)logLc(X,K,Z).

ProblemI No closed-form of Q

(K|K(m)

)because P(Z|K) cannot be

factorized.I We use variational approach to approximate P(Z|K).


Outline





Variational estimation of the latent structureDaudin et. al, 2008

PrincipleUse an approximation R(Z) of P(Z|K) in the factorized form,Rτ (Z) =

∏iRτ i(Zi) where Rτ i is a multinomial distribution with

parameters τ i.

I Maximize a lower bound of the log-likelihood

J (Rτ (Z)) = L(X,K)−DKL(Rτ (Z)‖P(Z|K)).

I Using its tractable form, we have

J (Rτ (Z)) =∑Z

Rτ (Z)Lc(X,K,Z) +H(Rτ (Z)).





parameters τ i.




J (Rτ (Z)) =∑Z


This term plays the role of E(Lc(X,K,Z)|X,K(m))





parameters τ i.




J (Rτ (Z)) =∑Z


This term plays the role of E(Lc(X,K,Z)|X,K(m))

Maximizing J leads to a fix-point relationship for τ


Outline





The M–stepSeen as a penalized likelihood problem

We aim at solvingK = arg max

K�0Qτ (K),

where

Penalized likelihood problem

Qτ (K) ={n

2(log det(K)− Tr(SnK))− ‖ρτ (K)‖`1 + Cst

},

Friedman, Hastie, Tibshirani. Sparse inverse covariance estimationwith the Lasso, Biostatistics, 2007.


We deal with a more complex penalty term here.


The M–stepSeen as a penalized likelihood problem

We aim at solvingK = arg max

K�0Qτ (K),

where

Penalized likelihood problem

Qτ (K) ={n

2(log det(K)− Tr(SnK))− ‖ρτ (K)‖`1 + Cst

},

Friedman, Hastie, Tibshirani. Sparse inverse covariance estimationwith the Lasso, Biostatistics, 2007.


We deal with a more complex penalty term here.


Let us work on the covariance matrix

PropositionThe maximization problem over K is equivalent to the following,dealing with the covariance matrix Σ:

Σ = arg max‖(Σ−Sn)·/P‖∞≤1

log det(Σ),

where ·/

is the term-by-term division and

P = (pij)i,j∈P =2n

∑q,`

τiqτj`λq`

.

The proof use some optimization, primal/dual tricks


A Block-wise resolution

Denote

Σ =

[Σ11 σ12

σᵀ12 Σ22

], Sn =

[S11 s12

sᵀ12 S22

], P =

[P11 p12

pᵀ12 P22

], (2)

where Σ11 is a (p− 1)× (p− 1) matrix, σ12 is a p− 1 lengthcolumn vector and Σ22 is a scalar.

Each column of Σ satisfies (by det of Schur complement)

σ12 = arg min{‖(y−s12)·/p12‖∞≤1}

{yᵀΣ

−1

11 y},


A `1–norm penalized writing

PropositionSolving the block-wise problem is equivalent to solve thefollowing dual problem

minβ

∥∥∥∥12Σ

1/2

11 β − Σ−1/2

11 s12

∥∥∥∥2

2

+ ‖p12 ? β‖`1 ,

where ? is the term-by-term product. Vectors σ12 and β arelinked by

σ12 = Σ11β/2.

A LASSO-like formulation with existing costless algorithms


The full EM algorithm

while bQτ ( bK(m)) has not stabilized do

//THE E-STEP: LATENT STRUCTURE INFERENCEif m = 1 then

// First passApply spectral clustering on the empirical covariance S to initialize bτ

elseCompute bτ with via fix-point algorithm, using bK(m−1)

end

//THE M-STEP: NETWORK INFERENCEConstruct the penalty matrix P according to bτwhile bΣ(m)

has not stabilized do

for each column of bΣ(m)do

Compute bσ12 by solving the LASSO–like problem with path-wisecoordinate optimization

endend

Compute bK(m) by block-wise inversion of bΣ(m)

m← m+ 1end







end





endend


m← m+ 1end







end





endend


m← m+ 1end







end





endend


m← m+ 1end


Outline





Outline





Simulations settings

Five inference methods

1. InvCorEdge estimation based on empirical correlation matrix inversion.

2. GeneNet (Strimmer et al.)Edge estimation based on partial correlation with shrinkage.

3. GLasso (Friedman et al.)Edge estimation uses a uniform penalty matrix.

4. “perfect” SIMoNe (best results our method can aspire to)Edge estimation uses a penalty matrix constructed according to the theoretic

node classification.

5. SIMoNe (Statistical Inference for MOdular NEtworks)Edge estimation uses a penalty matrix constructed according to the estimated

node classification, iteratively.


Test simulation setup

Simulated Graphs

I Graphs simulated using an affiliation model (two sets ofparameters: intra-groups and inter-groups connections)

I p = 200 nodes p(p− 1)/2 = 19900 possible interactions.I 50 graphs (repetitions) were simulated per situation.I Gene expression data (i.e., Gaussian samples) was then

simulated using the sampled graph:1. Favorable setting (n = 10p),2. Middle case (n = 2p)3. Unfavorable setting (n = p/2)

Unstructured graph

I When no structure SIMoNe is comparable to GeneNet andGLasso


Concentration matrix and structure

(a) (b) (c)

Figure: Simulation of the structured sparse concentration matrix.Adjacency matrix without (a), with (b) columns reorganizedaccording the affiliation structure and corresponding graph (c).


Example of graph recoveryFavorable case

Figure: Theoretical graph and SIMoNe estimationAmbroise, Chiquet, Matias 33

Example of graph recoveryFavorable case

Figure: Theoretical graph and SIMoNe estimationAmbroise, Chiquet, Matias 33

Precision/Recall CurvesDefinition

Precision =TP

TP + FP= Proportion of true positives among all positives

Recall =TP

TP + FN= Proportion of true positive among all edges


Precision/Recall CurvesFavorable setting – n = 10p

I With n� p, PerfectSIMoNe and SIMoNeperform equivalently

I When 3p > n > p thestucture is partiallyrecovered, SIMoNeimproves the edgesselection.

I When n ≤ p all methodsperform poorly. . .

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Recall

Prec

ision

SIMoNeGLassoPerfectGeneNetInvcor

Figure: GeneNet, GLasso, PerfectSIMoNe, SIMoNe, InvCor


Precision/Recall CurvesFavorable setting – n = 6p




0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Recall

Prec

ision




Precision/Recall CurvesMiddle case – n = 3p




0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Recall

Prec

ision




Precision/Recall CurvesMiddle case – n = 2p




0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Recall

Prec

ision


Figure: GeneNet, GLasso, PerfectSIMoNe, SIMoNe


Precision/Recall CurvesUnfavorable case – n = p




0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

Recall

Prec

ision

SIMoNeGLassoPerfectGeneNet



Precision/Recall CurvesUnfavorable case – n = p/2




0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Recall

Prec

ision

SIMoNeGLassoPerfectGeneNet



Outline





First results on real a datasetPrediction of the outcome of preoperative chemotherapy

Two types of patients

1. Patient response can be classified as either a pathologiccomplete response (PCR)

2. or residual disease (Not PCR).

Gene expression data

I 133 patients (99 not PCR, 34 PCR)I 26 identified genes (differential analysis)


First result on real a datasetPrediction of the outcome of preoperative chemotherapy

AMFR

BB_S4

BECNI

BTG3

CA12

CTNND2

E2F3

ERBB4

FGFRIOP

FLJ10916

FLJI2650

GAMT

GFRAI

IGFBP4

JMJD2B

KIA1467

MAPT

MBTP_SI

MELK

METRN

PDGFRA

RAMPI

RRM2

SCUBE2

THRAP2

ZNF552

Full SampleAmbroise, Chiquet, Matias 37


AMFR

BB_S4

BECNI

BTG3

CA12

CTNND2

E2F3

ERBB4

FGFRIOP

FLJ10916

FLJI2650

GAMT

GFRAI

IGFBP4

JMJD2B

KIA1467

MAPT

MBTP_SI

MELK

METRN

PDGFRA

RAMPI

RRM2

SCUBE2

THRAP2

ZNF552

Not PCRAmbroise, Chiquet, Matias 37


AMFR

BB_S4

BECNI

BTG3

CA12

CTNND2

E2F3

ERBB4

FGFRIOP

FLJ10916

FLJI2650

GAMT

GFRAI

IGFBP4

JMJD2B

KIA1467

MAPT

MBTP_SI

MELKMETRN

PDGFRA

RAMPI

RRM2

SCUBE2

THRAP2

ZNF552

PCRAmbroise, Chiquet, Matias 37

Conclusions

To sum-up

I We proposed an inference strategy based on apenalization scheme given by an underlying unknownstructure.

I The estimation strategy is based on a variational EM

algorithm, in which a LASSO-like procedure is embedded.I Preprint on arxiv.I R package SIMoNe

Perspectives

I Consider alternative prior more biologically relevant: hubs,motifs.

I Time segmentation when dealing with temporal data


Conclusions

To sum-up

I We proposed an inference strategy based on apenalization scheme given by an underlying unknownstructure.

I The estimation strategy is based on a variational EM

algorithm, in which a LASSO-like procedure is embedded.I Preprint on arxiv.I R package SIMoNe

Perspectives

I Consider alternative prior more biologically relevant: hubs,motifs.

I Time segmentation when dealing with temporal data


Penalty choice (1)

Let Ci denote the connectivity component of i in the trueconditional dependency graph, and Ci the correspondingcomponent resulting from the estimate K.

PropositionFix some ε > 0 and choose the penalty parameters λ such that,for all q, ` ∈ Q,

2p2Fn−2

2nλq`

(maxi 6=j

SiiSjj −1λ2q`

)−1/2

(n− 2)1/2

≤ ε,where 1− Fn−2 is the c.d.f. of a Students’s t-distribution withn− 2 degrees of freedom. Then

P(∃k, Ck * Ck) ≤ ε. (3)


Penalty choice (2)

It’s enough to choose λq` such as

λq`(ε) ≥2n

(n− 2 + t2n−2

(ε

2p2

))1/2

×

maxi 6=j

ZiqZj`=1

SiiSjj

−1/2

tn−2

(ε

2p2

)−1

.


Penalty choice (3)

Practically,

I Relax the λq` in the E–step (variational inference), thusmaking variational EM in the E-step.

I Fix the λq` in the M-step, adapting the above rule to thecontext.E.g., for an affiliation structure, we fix the ratio λin/λout = 1.2 and either let the

value 1/λin vary when considering precision/recall curves for synthetic data, or fix

this parameter relying on the above rule when dealing with real data


Gaussian Graphical Models with latent structure

Documents

network intoaccount

sos network

julien chiquet

recf regulation network

g9 p genes

regulatory networks

biological constraints1

sparse networks