Inferring gene regulatory networks from transcriptomic profiles

Inferring gene regulatory networks from transcriptomic profiles

Dirk Husmeier

Biomathematics & Statistics Scotland

Overview

• Introduction

• Application to synthetic biology

• Lessons from DREAM

Network reconstruction from postgenomic data

Accuracy

Computational complexity

Methods based on correlation and

mutual information

Conditional independence graphs

Mechanistic models

Bayesian networks

Accuracy


Methods based on correlation and

mutual information


Mechanistic models

Bayesian networks

direct

interaction

common

regulator

indirect

interaction

co-regulation

Pairwise associations do not take the context of the systeminto consideration

Shortcomings

Accuracy


Methods based on correlation and mutual

information


Mechanistic models

Bayesian networks

Conditional Independence Graphs (CIGs)

jjii

ijij

)()(

)(111

1

2

2

1

1

Direct interaction

Partial correlation, i.e. correlation

conditional on all other domain variables

Corr(X1,X2|X3,…,Xn)

Problem: #observations < #variables

Covariance matrix is singular

strong partial

correlation π12

Inverse of the covariance

matrix

Accuracy



information


Mechanistic models

Bayesian networks

Model Parameters q

Probability theory Likelihood

1) Practical problem: numerical optimization

q

2) Conceptual problem: overfitting

ML estimate increases on increasing the network complexity

Overfitting problem

True pathway

Poorer fit to the data

Poorer fit to the data

Equal or better fit to the data

Regularization

E.g.: Bayesian information criterion

Maximum likelihood parameters

Number of parameters

Number of data points

Data misfit term Regularization term

Complexity Complexity

Likelihood BIC

Model selection: find the best pathway

Select the model with the highest posterior probability:

This requires an integration over the whole parameter space:

Problem: huge computational costs

q

Accuracy



information


Mechanistic models

Bayesian networks

Friedman et al. (2000), J. Comp. Biol. 7, 601-620

Marriage between

graph theory

and

probability theory

Bayes net

ODE model

Model Parameters q

Bayesian networks: integral analytically tractable!

UAI 1994

[A]= w1[P1] + w2[P2] + w3[P3] +

w4[P4] + noise

Linearity assumption

A

P1

P2

P4

P3

w1

w4

w2

w3

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Homogeneity assumption

Accuracy



information


Mechanistic models

Bayesian networks

Example: 4 genes, 10 time points

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Standard dynamic Bayesian network: homogeneous model

Limitations of the homogeneity assumption

Our new model: heterogeneous dynamic Bayesian network. Here: 2 components

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Our new model: heterogeneous dynamic Bayesian network. Here: 3 components

Learning with MCMC

q

k

h

Number of components (here: 3)

Allocation vector

Learning with MCMC

q

k

h

Number of components (here: 3)

Allocation vector

Non-homogeneous model

Non-linear model

[A]= w1[P1] + w2[P2] + w3[P3] +

w4[P4] + noise

BGe: Linear model

A

P1

P2

P4

P3

w1

w4

w2

w3

Can we get an approximate nonlinear model without data discretization?

y

x

Can we get an approximate nonlinear model without data discretization?

Idea: piecewise linear model

y

x

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Inhomogeneous dynamic Bayesian network with common changepoints

Inhomogenous dynamic Bayesian network with node-specific changepoints

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

NIPS 2009

Non-stationarity in the regulatory process

Non-stationarity in the network structure

Flexible network structure .

Flexible network structure with regularization



ICML 2010

Morphogenesis in Drosophila melanogaster

• Gene expression measurements over 66 time steps of 4028 genes (Arbeitman et al., Science, 2002).

• Selection of 11 genes involved in muscle development.

Zhao et al. (2006),

Bioinformatics 22

Transition probabilities: flexible structure with regularization

Morphogenetic transitions: Embryo larva larva pupa pupa adult

Overview

• Introduction



Can we learn the switch Galactose Glucose?

Can we learn the network structure?

NIPS 2010

Node 1

Node i

Node p

Hierarchical Bayesian model

Node 1

Node i

Node p

Hierarchical Bayesian model

Exponential versus binomial prior distribution

Exploration of various information sharing options

Task 1:Changepoint detection

Switch of the carbon source:Galactose Glucose

Galactose Glucose

Task 2:Network reconstruction

PrecisionProportion of identified interactions

that are correct

Recall Proportion of true interactions that

we successfully recovered

BANJO: Conventional homogeneous DBN TSNI: Method based on differential equations

Inference: optimization, “best” network

Sample of high-scoring networks


Feature extraction, e.g. marginal posterior probabilities of the edges

Galactose

Glucose

Prior Coupling Average AUC

None None 0.70

Exponential Hard 0.77

Binomial Hard 0.75

Binomial Soft 0.75

Average performance over both phases:Galactose and glucose

How are we getting from here …

… to there ?!

Overview

• Introduction



DREAM:Dialogue for Reverse Engineering

Assessments and Methods

International network reconstruction competition: June-Sept 2010

Network # Transcription Factors

# Genes # Chips

Network 1 (in silico)

195 1643 805

Network 2 99 2810 160

Network 3 334 4511 805

Network 4 333 5950 536

Marco GrzegorczykUniversity of Dortmund

Germany

Frank Dondelinger BioSS / University of Edinburgh

United Kingdom

Sophie LèbreUniversité de Strasbourg

France

Our team

Andrej AderholdBioSS / University of St Andrews

United Kingdom

Our model:Developed for time series

Data:Different experimental conditions, perturbations (e.g. ligand injection), interventions (e.g. gene knock-out,

overexpression), time points

How do we get an ordering of the genes?

PCA

SOM

No time series Use 1-dim SOM to get a chip order

Ordering of chips changepoint model

Problems with MCMC convergence


# Genes # Chips


195 1643 805

Network 2 99 2810 160

Network 3 334 4511 805

Network 4 333 5950 536



# Genes # Chips


195 1643 805

Network 2 99 2810 160

Network 3 334 4511 805

Network 4 333 5950 536

PNAS 2009

[A]= w1[P1] + w2[P2] + w3[P3] +

w4[P4] + noise

Linear model

A

P1

P2

P4

P3

w1

w4

w2

w3

L1 regularized linear regression



# Genes # Chips


195 1643 805

Network 2 99 2810 160

Network 3 334 4511 805

Network 4 333 5950 536



# Genes # Chips


195 1643 805

Network 2 99 2810 160

Network 3 334 4511 805

Network 4 333 5950 536

Assessment

Participants Had to submit rankings of all interactions

OrganisersComputed areas under 1)Precision-recall curves

2)ROC curves (plotting sensitivity=recall against specificity)

Uncertainty about the best network structure

Limited number of experimental replications, high noise






High-confident edge

High-confident non-edge

Uncertainty about edges

ROC curves

True positive rate

Sensitivity

False positive rate

Complementary specificity

Definition of metrics

Total number of true edges

Total number of predicted edges

Total number of non-edges

Total number of true edges

The relation between Precision-Recall (PR) and ROC curves


Better performance Better

performance

Assessment

Participants Had to submit rankings of all interactions

OrganisersComputed areas under 1)Precision-recall curves

2)ROC curves (plotting sensitivity=recall against specificity)

Proportion of recovered true

edges

Proportion of avoided non-edges

AUROC = 0.5

Joint work with Wolfgang Lehrach on ab initio prediction of protein interactions

AUROC= 0.61,0.67,0.67

ICML 2006


Better performance Better

performance

Potential advantage of Precision-Recall (PR) over ROC curves

Large number of negative examples (TN+FP)

Large change in FP may have a small effect on the false positive rate

Large change in FP has a strong effect on the precision

Small difference

Large difference

Room for improvement:Higher-dimensional changepoint process

Perturbations

Experimental conditions

Inferring gene regulatory networks from transcriptomic profiles

Documents