Inferring gene regulatory networks from transcriptomic profiles Dirk Husmeier Biomathematics & Statistics Scotland
Jan 03, 2016
Inferring gene regulatory networks from transcriptomic profiles
Dirk Husmeier
Biomathematics & Statistics Scotland
Overview
• Introduction
• Application to synthetic biology
• Lessons from DREAM
Network reconstruction from postgenomic data
Accuracy
Computational complexity
Methods based on correlation and
mutual information
Conditional independence graphs
Mechanistic models
Bayesian networks
Accuracy
Computational complexity
Methods based on correlation and
mutual information
Conditional independence graphs
Mechanistic models
Bayesian networks
direct
interaction
common
regulator
indirect
interaction
co-regulation
Pairwise associations do not take the context of the systeminto consideration
Shortcomings
Accuracy
Computational complexity
Methods based on correlation and mutual
information
Conditional independence graphs
Mechanistic models
Bayesian networks
Conditional Independence Graphs (CIGs)
jjii
ijij
)()(
)(111
1
2
2
1
1
Direct interaction
Partial correlation, i.e. correlation
conditional on all other domain variables
Corr(X1,X2|X3,…,Xn)
Problem: #observations < #variables
Covariance matrix is singular
strong partial
correlation π12
Inverse of the covariance
matrix
Accuracy
Computational complexity
Methods based on correlation and mutual
information
Conditional independence graphs
Mechanistic models
Bayesian networks
Model Parameters q
Probability theory Likelihood
1) Practical problem: numerical optimization
q
2) Conceptual problem: overfitting
ML estimate increases on increasing the network complexity
Overfitting problem
True pathway
Poorer fit to the data
Poorer fit to the data
Equal or better fit to the data
Regularization
E.g.: Bayesian information criterion
Maximum likelihood parameters
Number of parameters
Number of data points
Data misfit term Regularization term
Complexity Complexity
Likelihood BIC
Model selection: find the best pathway
Select the model with the highest posterior probability:
This requires an integration over the whole parameter space:
Problem: huge computational costs
q
Accuracy
Computational complexity
Methods based on correlation and mutual
information
Conditional independence graphs
Mechanistic models
Bayesian networks
Friedman et al. (2000), J. Comp. Biol. 7, 601-620
Marriage between
graph theory
and
probability theory
Bayes net
ODE model
Model Parameters q
Bayesian networks: integral analytically tractable!
UAI 1994
[A]= w1[P1] + w2[P2] + w3[P3] +
w4[P4] + noise
Linearity assumption
A
P1
P2
P4
P3
w1
w4
w2
w3
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Homogeneity assumption
Accuracy
Computational complexity
Methods based on correlation and mutual
information
Conditional independence graphs
Mechanistic models
Bayesian networks
Example: 4 genes, 10 time points
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Standard dynamic Bayesian network: homogeneous model
Limitations of the homogeneity assumption
Our new model: heterogeneous dynamic Bayesian network. Here: 2 components
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Our new model: heterogeneous dynamic Bayesian network. Here: 3 components
Learning with MCMC
q
k
h
Number of components (here: 3)
Allocation vector
Learning with MCMC
q
k
h
Number of components (here: 3)
Allocation vector
Non-homogeneous model
Non-linear model
[A]= w1[P1] + w2[P2] + w3[P3] +
w4[P4] + noise
BGe: Linear model
A
P1
P2
P4
P3
w1
w4
w2
w3
Can we get an approximate nonlinear model without data discretization?
y
x
Can we get an approximate nonlinear model without data discretization?
Idea: piecewise linear model
y
x
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Inhomogeneous dynamic Bayesian network with common changepoints
Inhomogenous dynamic Bayesian network with node-specific changepoints
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
NIPS 2009
Non-stationarity in the regulatory process
Non-stationarity in the network structure
Flexible network structure .
Flexible network structure with regularization
Flexible network structure with regularization
Flexible network structure with regularization
ICML 2010
Morphogenesis in Drosophila melanogaster
• Gene expression measurements over 66 time steps of 4028 genes (Arbeitman et al., Science, 2002).
• Selection of 11 genes involved in muscle development.
Zhao et al. (2006),
Bioinformatics 22
Transition probabilities: flexible structure with regularization
Morphogenetic transitions: Embryo larva larva pupa pupa adult
Overview
• Introduction
• Application to synthetic biology
• Lessons from DREAM
Can we learn the switch Galactose Glucose?
Can we learn the network structure?
NIPS 2010
Node 1
Node i
Node p
Hierarchical Bayesian model
Node 1
Node i
Node p
Hierarchical Bayesian model
Exponential versus binomial prior distribution
Exploration of various information sharing options
Task 1:Changepoint detection
Switch of the carbon source:Galactose Glucose
Galactose Glucose
Task 2:Network reconstruction
PrecisionProportion of identified interactions
that are correct
Recall Proportion of true interactions that
we successfully recovered
BANJO: Conventional homogeneous DBN TSNI: Method based on differential equations
Inference: optimization, “best” network
Sample of high-scoring networks
Sample of high-scoring networks
Feature extraction, e.g. marginal posterior probabilities of the edges
Galactose
Glucose
Prior Coupling Average AUC
None None 0.70
Exponential Hard 0.77
Binomial Hard 0.75
Binomial Soft 0.75
Average performance over both phases:Galactose and glucose
How are we getting from here …
… to there ?!
Overview
• Introduction
• Application to synthetic biology
• Lessons from DREAM
DREAM:Dialogue for Reverse Engineering
Assessments and Methods
International network reconstruction competition: June-Sept 2010
Network # Transcription Factors
# Genes # Chips
Network 1 (in silico)
195 1643 805
Network 2 99 2810 160
Network 3 334 4511 805
Network 4 333 5950 536
Marco GrzegorczykUniversity of Dortmund
Germany
Frank Dondelinger BioSS / University of Edinburgh
United Kingdom
Sophie LèbreUniversité de Strasbourg
France
Our team
Andrej AderholdBioSS / University of St Andrews
United Kingdom
Our model:Developed for time series
Data:Different experimental conditions, perturbations (e.g. ligand injection), interventions (e.g. gene knock-out,
overexpression), time points
How do we get an ordering of the genes?
PCA
SOM
No time series Use 1-dim SOM to get a chip order
Ordering of chips changepoint model
Problems with MCMC convergence
Network # Transcription Factors
# Genes # Chips
Network 1 (in silico)
195 1643 805
Network 2 99 2810 160
Network 3 334 4511 805
Network 4 333 5950 536
Problems with MCMC convergence
Network # Transcription Factors
# Genes # Chips
Network 1 (in silico)
195 1643 805
Network 2 99 2810 160
Network 3 334 4511 805
Network 4 333 5950 536
PNAS 2009
[A]= w1[P1] + w2[P2] + w3[P3] +
w4[P4] + noise
Linear model
A
P1
P2
P4
P3
w1
w4
w2
w3
L1 regularized linear regression
Problems with MCMC convergence
Network # Transcription Factors
# Genes # Chips
Network 1 (in silico)
195 1643 805
Network 2 99 2810 160
Network 3 334 4511 805
Network 4 333 5950 536
Problems with MCMC convergence
Network # Transcription Factors
# Genes # Chips
Network 1 (in silico)
195 1643 805
Network 2 99 2810 160
Network 3 334 4511 805
Network 4 333 5950 536
Assessment
Participants Had to submit rankings of all interactions
OrganisersComputed areas under 1)Precision-recall curves
2)ROC curves (plotting sensitivity=recall against specificity)
Uncertainty about the best network structure
Limited number of experimental replications, high noise
Sample of high-scoring networks
Sample of high-scoring networks
Feature extraction, e.g. marginal posterior probabilities of the edges
Sample of high-scoring networks
Feature extraction, e.g. marginal posterior probabilities of the edges
High-confident edge
High-confident non-edge
Uncertainty about edges
ROC curves
True positive rate
Sensitivity
False positive rate
Complementary specificity
Definition of metrics
Total number of true edges
Total number of predicted edges
Total number of non-edges
Total number of true edges
The relation between Precision-Recall (PR) and ROC curves
The relation between Precision-Recall (PR) and ROC curves
Better performance Better
performance
Assessment
Participants Had to submit rankings of all interactions
OrganisersComputed areas under 1)Precision-recall curves
2)ROC curves (plotting sensitivity=recall against specificity)
Proportion of recovered true
edges
Proportion of avoided non-edges
AUROC = 0.5
Joint work with Wolfgang Lehrach on ab initio prediction of protein interactions
AUROC= 0.61,0.67,0.67
ICML 2006
The relation between Precision-Recall (PR) and ROC curves
Better performance Better
performance
Potential advantage of Precision-Recall (PR) over ROC curves
Large number of negative examples (TN+FP)
Large change in FP may have a small effect on the false positive rate
Large change in FP has a strong effect on the precision
Small difference
Large difference
Room for improvement:Higher-dimensional changepoint process
Perturbations
Experimental conditions