Reconstructing gene regulatory networks with probabilistic models Marco Grzegorczyk Dirk Husmeier.

Reconstructing gene regulatory networks

with probabilistic models

Marco GrzegorczykDirk Husmeier

Regulatory network

Network unknown

High-throughput experiments

Postgenomic

data

Machine learning

Statistics

Overview

• Introduction

• Bayesian networks

• Comparative evaluation

• Integration of biological prior knowledge

• A non-homogeneous Bayesian network for non-stationary processes

• Current work

Overview

• Introduction





• Current work

Elementary molecular biological processes

Description with differential equations

Rates

Concentrations

Kinetic parameters q

Given: Gene expression time series

Can we infer the correct gene regulatory network?

Parameters q known: Numerically integrate the differential equations for different hypothetical networks

Model selection for known parameters q

Gene expression time series predicted with different modelsMeasured gene

expression time series

Highest likelihood: best model

Compare

Model selection for unknown parameters q

Gene expression time series predicted with different modelsMeasured gene

expression time series

Highest likelihood: over-fitting

Bayesian model selection

Select the model with the highest posterior probability:

This requires an integration of the whole parameter space:

This integral is usually intractable

Marginal likelihoods for the alternative pathways

Computational expensive, network reconstruction ab initio unfeasible

Overview

• Introduction





• Current work

Objective: Reconstruction of regulatory networks ab initio

Higher level of abstraction: Bayesian networks

Bayesian networks

A

CB

D

E F

NODES

EDGES

•Marriage between graph theory and probability theory.

•Directed acyclic graph (DAG) representing conditional independence relations.

•It is possible to score a network in light of the data: P(D|M), D:data, M: network structure.

•We can infer how well a particular network explains the observed data.

),|()|(),|()|()|()(

),,,,,(

DCFPDEPCBDPACPABPAP

FEDCBAP

Bayes net

ODE model

[A]= w1[P1] + w2[P2] + w3[P3] +

w4[P4] + noise

Linear model

A

P1

P2

P4

P3

w1

w4

w2

w3

Nonlinear discretized model

P1

P2

P1

P2

Activator

Repressor

Activator

Repressor

Activation

Inhibition

Allow for noise: probabilities

Conditional multinomial distribution

Model Parameters q

Integral analytically tractable!

Example: 2 genes 16 different network structures

Best network: maximum score

Identify the best network structure

Ideal scenario: Large data sets, low noise

Uncertainty about the best network structure

Limted number of experimental replications, high noise

Sample of high-scoring networks


Feature extraction, e.g. marginal posterior probabilities of the edges


Feature extraction, e.g. marginal posterior probabilities of the edges

High-confident edge

High-confident non-edge

Uncertainty about edges

Can we generalize this scheme to more than 2 genes?

In principle yes.

However …

Number of structures

Number of nodes

Complete enumeration unfeasible Hill climbing

increasesAccept move when

Configuration space of network structures

Local optimum


MCMC Local change

If accept

If accept with probability

Algorithm converges to

Madigan & York (1995), Guidici & Castello (2003)


Problem: Local changes small steps slow convergence, difficult to cross valleys.


Problem: Global changes large steps low acceptance slow convergence.


Can we make global changes that jump onto other peaks and are likely to be accepted?

Conventional scheme New scheme

MCMC trace plots

Plot of against iteration number

Overview

• Introduction





• Current work

Cell membran

nucleus

Example: Protein signalling pathway

TF

TF

phosphorylation

-> cell response

Evaluation on the Raf signalling pathway

From Sachs et al Science 2005

Cell membrane

Receptor molecules

Inhibition

Activation

Interaction in signalling pathway

Phosphorylated protein

Flow cytometry data

• Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins

• 5400 cells have been measured under 9 different cellular conditions (cues)

• Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments

Simulated data or “gold standard” from the literature



From Perry Sprawls

ROC curve

5 FP counts

BN

GGM

RN

ROC curveFP

TP

Four different evaluation criteria

DGE UGE

TP for fixed FP

Area under the curve (AUC)

Synthetic data, observations

Relevance networksBayesian

networksGraphical Gaussian models

Synthetic data, interventions

Cytometry data, interventions

Overview

• Introduction





• Current work

Can we complement microarray data with prior knowledge from public data bases like KEGG?

KEGG pathwayMicroarray data

How do we extract prior knowledge from a collection of KEGG pathways?

Total number of times the gene pair [i,j ] is included in the extracted pathways

Total number of edges i j that appear in the extracted pathways

=

Example: Extract 20 pathways, 10 contain [i,j ], 8 contain i j

B = 8/10 = 0.8i,j

Relative frequency of edge occurrence

Prior knowledge from KEGG

Raf network

0.25

00.5

0

0.5

0.87

0

1

0.5

0 0

0.5

0

10.71

0

0

Prior distribution over networks

Deviation between the network M and the prior knowledge B:

Prior knowledge ε [0,1]

Graph ε {0,1}

Hyperparameter

Hyperparameter β trades off data versus prior knowledge


β



β small



β large

Sample networks and hyperparameters from the posterior distribution

Revision

Prior distribution

Marginal likelihood

Integral analytically tractable for Bayesian networks

Application to the Raf pathway:

Flow cytometry data and KEGG

ROC curveFP

TP

Four different evaluation criteria

DGE UGE

TP for fixed FP

Area under the curve (AUC)

β

Overview

• Introduction





• Current work

Example: 4 genes, 10 time points

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Standard dynamic Bayesian network: homogeneous model

Our new model: heterogeneous dynamic Bayesian network. Here: 2 components

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Our new model: heterogeneous dynamic Bayesian network. Here: 3 components

We have to learn from the data:

• Number of different components

• Allocation of time points

Two MCMC strategies

q

k

h

Number of components (here: 3)

Allocation vector

Synthetic study: posterior probability of the number of components

Circadian clock in Arabidopsis thaliana Collaboration with the Institute of Molecular Plant

Sciences (Andrew Millar)

• Focus on 9 circadian genes.•2 time series T20 and T28 of microarray gene expression data from Arabidopsis thaliana.• Plants entrained with different light:dark cycles10h:10h (T20) and 14h:14h (T28)

macrophage

cytomegalovirus

Interferon gamma

Macrophage

Cytomegalovirus (CMV)

Interferon gamma IFNγ

InfectionTreatment

Collaboration with DPM

macrophage

IFNγ12 hour time course measuring total RNA

0 1 2 3 4 5 6 7 8 9 10 11 12

72 Agilent Arrays

Time series statistical analysis (using EDGE)

Clustering Analysis

30 min sampling

24 samples per group:

• Infection with CMV

• Pre-treatment with IFNγ

• IFNγ + CMV

CMV

Posterior probability of the number of components

IRF1

IRF2

IRF3

Literature “Known” interactions between three cytokines: IRF1, IRF2 and IRF3

Evaluation: Average marginal posterior probabilities of

the edges versus non-edges


IRF1

IRF2

IRF3

Gold standard known Posterior probabilities of true interactions

AUROC scores

New modelBGeBDe

Collaboration with the Institute of Molecular Plant

Sciences at Edinburgh University

2 time series T20 and T28 of microarray gene expression data from Arabidopsis thaliana.

- Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4,

ELF3, GI, PRR9, PRR5, and PRR3

- Both time series measured under constant light condition

at 13 time points: 0h, 2h,…, 24h, 26h

- Plants entrained with different light:dark cycles

10h:10h (T20) and 14h:14h (T28)

Circadian rhythms in Arabidopsis thaliana

Gene expression time series plots (Arabidopsis data T20 and T28)

T28 T20

Posterior probability of the number of components

Predicted network

Blue – activation

Red – inhibition

Black – mixture

three different line widths - thin = PP>0.5- medium = PP>0.75- fat = PP>0.9

Overview

• Introduction





• Current work

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Standard dynamic Bayesian network: homogeneous model

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Heterogeneous dynamic Bayesian network

Heterogenous dynamic Bayesian network with node-specific breakpoints

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Evaluation on synthetic data

X

Y(1) Y(2) Y(3)

f: three phase-shifted sinusoids

BGe

Heterogeneous BNet without/with nodespecific

breakpoints

AUROC

Four time series for A. thaliana under different experimental conditions (KAY,KDE,T20,T28)

Blue – activation

Red – inhibition

Black – mixture

three different line widths - thin = PP>0.5- medium = PP>0.75- fat = PP>0.9

Network obtained for merged data

KAY_LL KDE_LL T20 T28

datadata data datadata data

Monolithic Separate

Propose a compromise between the two

M1 M221

D1 D2

M*

MII

DI

. . .

Compromise between the two previous ways of combining the data

Original work with Adriano:

Poor convergence and mixing due too strong coupling effects.

Marco’s current work:

Improve convergence and mixing by weakening the coupling.

Mean absolute deviation of edge posterior probabilities (independent BN inference)

KAY KDE T20 T28

KAY --- 0.14 0.15 0.14

KDE 0.14 --- 0.19 0.15

T20 0.15 0.19 --- 0.10

T28 0.14 0.15 0.10 ---

Mean absolute deviation of edge posterior probabilities (coupled BN inference)

KAY KDE T20 T28

KAY --- 0.11 0.12 0.11

KDE 0.11 --- 0.13 0.11

T20 0.12 0.13 --- 0.06

T28 0.11 0.11 0.06 ---

Mean absolute deviation of edge posterior (independent BN - coupled BN)

KAY KDE T20 T28

KAY --- 0.03 0.03 0.03

KDE 0.03 --- 0.05 0.03

T20 0.03 0.05 --- 0.04

T28 0.03 0.03 0.04 ---

Summary

• Differential equation models





• Current work

Adriano Werhli

Marco Grzegorzcyk

Thank you!

Any questions?

Reconstructing gene regulatory networks with probabilistic models Marco Grzegorczyk Dirk Husmeier.

Documents

particular network

gene expression time

high noisesample of

highest posterior probability

observed data

edgessample of high

networkssample of high

graph theory