Methods for evaluating inference algorithms

1

Methods for evaluating inference

algorithms

June, 2005Omer Berkman

Tel Aviv University, Israel.

2

Introduction A central goal of molecular biology is to understand

the regulatory mechanisms of gene transcription and protein synthesis.

One major goal of functional genomics research is to take large sets of biological data and elucidate functional interactions between elements in a causal pathway or network.

The invention of DNA microarrays, which measure the abundance of thousands of mRNA targets simultaneously, has been an important milestone.

Many approaches to the reverse engineering of genetic regulatory networks from gene expression data have been explored.

3

Reminder of the problem

When can genetic networks be identified from microarray data?

According to Zak et al. (2002) – not too soon: Identifying the network architecture may only be possible when a rich microarray time course is coupled with a priori biological information.

Interactions between thousands of genes have to be learned from short time series of typically only about a dozen measurements.

4

It is not clear whether an elicitation of biological structures from expression data is at all possible (whether the posterior probabilities over structures can be expected to be sufficiently informative).

Most studies assess the inference results by comparing predicted regulatory interactions with those known from the biological literature.

This approach is controversial due to the absence of known gold standards, which renders the estimation unreliable and difficult.

Is the inference possible?

5

Having detected an interaction, authors tend to delve into the literature to confirm their findings.

They back up their results with interactions between proteins which are similar to the ones encoded by genes studied in the performed experiment.

First, sequence similarity does not necessarily imply similar functions.

Second, there is an inherent arbitrariness in deciding what is similar.

One may therefore suspect that some of the reported true interactions are spurious and do not really support the interactions detected in the reverse engineering procedure.

Biological literature – problem one

6

The second and more serious drawback is the difficulty in estimating the false detection rate.

This is because on predicting a gene interaction that is not supported by the literature, it is impossible to decide, without further expensive interventions in the form of multiple gene knock-out experiments, whether the algorithm has discovered a new, previously unknown interaction, or whether it has flagged a spurious edge.

Biological literature – problem two

7

Simulation motivation Some of the predicted interactions are

biologically reasonable, some appear to be unreasonable, but the validity of the vast majority is difficult to assess.

Testing all of the predicted interactions experimentally would take decades.

No suitable approach has been formulated for evaluating the effectiveness at recovering models of complex biological systems from limited data.

To overcome this limitation, recent works used biologically reasonable simulated data.

8

Simulation approach benefits

Enables evaluation of different algorithms. Enables better understanding of the inference

algorithms. Enables parameters setting such as noise level,

network topology and sampling interval. Enables learning with unlimited data. Gives indications about yet-unrealistic

environments. Enables a priori to design experiments and

data-collection protocols that are amenable to functional network inference.

9

Simulation approach - illustration

Build a known network. Create a simulator in which we know all the

rules. Sample data from it. Present the sampled data to inference

algorithm. Evaluate the algorithm output.

Smith et al. 2002

10

The first approach for evaluating the effectiveness of inference algorithms at recovering models from limited data.

They created data simulator, sampled the simulated data, and used the sampled data to evaluate the effectiveness of the algorithm they developed.

The first application to modeling complex systems at multiple biological levels of organization beyond genetic regulation.

Smith et al. (2002) - introduction

11

BRAINSIM – Smith et al. (2002)Design A simulator which models an

electrophysiological activity (0-400 Hz) and expression level of 100 genes (0-50 arbitrary units).

90% of the simulated variables are unregulated by other variables in the system and are included simply as distracters.

Smith et al. 2002

12

BRAINSIM – Initialization

Activity begins at a random value. The initial level of a gene is called its ‘target’

value, intends to correspond to its constitutive expression level.

This value is selected as a random value in the range 0-10 for up-regulated genes, 40-50 for down-regulated genes and 0-50 for the unregulated genes.

13

1. Degradation - from all genes a constant amount, chosen to be 4, is added or subtracted to move closer to their target levels.

2. Regulation - from all regulated genes a proportion, chosen to be 0.2, of their regulator’s level is added or subtracted.The unregulated genes added or subtracted a random number, chosen to be 0-5, to simulate regulation by other unmeasured processes.

3. Noise – A random amount, chosen to be 0-6, was added to or subtracted from each gene to simulate stochasticity in gene expression.

BRAINSIM – Dynamics rules

14

BRAINSIM – Simulated DataGenerated data for Activity, Gene1, Gene10 and Gene12. Note the regulation effects and the time lags, as well as the random walk of the unregulated gene.

Data sampled every 5 time steps, time 45 – 150, causing loss of information.

Quartile discretized values, causing another information loss but making the data more robust.

15

NETWORKINFERENCE – Smith et al. (2002)

Design Takes a collection of observed data as input. Searches for Bayesian networks that are good

at explaining the observed data without unnecessary complexity.

Designed in a manner that is capable of searching for Dynamic Bayesian Networks from temporal data.

16

Bayesian networks definitions

Representation of a joint probability distribution which consists of two components:

G is a directed acyclic graph (DAG) whose vertices correspond to the random variables.

θ describes a conditional distribution for each

variable, given its parents in G.

17

Bayesian networks illustration

The network structure implies that the joint distribution has the product form

P(A,B,C,D,E) = P(A)P(B|A,E)P(C|B)P(D|A)P(E)

A graph-based model that captures properties of conditional independence between variables.

Encodes the Markov assumption: Each variable is independent of its non-descendants, given its parents in the graph.

18

Find the parameters:

From Bayes rule we have:

And:

Learning network from data

* *arg max { ( | )}, arg max { ( | *, )},GG P G D P G D

( )( | ) ( | )

( )

P MP G D P D G

P D

( ) ( | ) ( ),

( | ) ( | , ) ( | )

GP D P D G P G

P D G P D G P G d

19

Bayesian networks properties

Can capture many types of relationships between variables.

Owing to their probabilistic nature, BN algorithms are capable of handling noisy data.

Effective handling of hundreds of variables.

20

BN model genetic networks

BN were first applied to this problem by Friedman (2000), Pe’er and Hartemink (2001).

At a qualitative level, the structure of a BN describes the relationships between variables in the form of conditional independence relations.

At a quantitative level, relationships between the interacting agents are described by conditional probability distributions.

The probabilistic nature of this approach is capable of handling noise inherent in both the biological processes and the microarray experiments.

21

Bayesian networks drawbacks

Networks with the same skeleton but different edge directions can have the same marginal likelihood P(D|M). Thus it’s impossible to distinguish between them on the basis of the data.

The acyclicity constraint rules out recurrent structures. However, feedback is an essential feature of biological systems.

22

Dynamic Bayesian networks illustration

Recurrent network comprising two genes with feedback that interact with each other

An interaction between genes is not instantaneous. Its effect happens with a time delay after its cause.

Equivalent DBN obtained by unfolding the recurrent network in time

Husmeier 2003

23

NETWORKINFERENCE – Search strategy

Identifying the highest-scoring network has been shown to be hard (Chickering 1996).

NETWORKINFERENCE uses simulated annealing (with extension for re-annealing to avoid becoming trapped in local maxima).

NETWORKINFERENCE also allows to modify the prior over network structures by specifying sets of links that are required to be present or required to be absent.

24

Simulated annealing overview

The concept is based on the process of annealing - melt is slowly cooled so that the system at any time is approximately in thermodynamic equilibrium.

If the change in energy is negative the new configuration is accepted. If the change in energy is positive it is accepted with a certain probability.This processes is then repeated sufficient times to give good sampling statistics for the current temperature, and then the temperature is decremented and the process repeated until a frozen state is achieved.

If the initial temperature of the system is too low or cooling is done insufficiently slowly the system may become trapped in a local minimum energy state.

25

Simulated annealing illustration

Computational Science Education Project

http://www.phy.ornl.gov/csep

26

NETWORKINFERENCE – Results

Sampling rate of 5

Recovery: 89±0.1%, Correctness: 98±0.1%

Sampling rate of 10

Recovery: 27±0.3%, Correctness: 30±0.4%

27

It is possible to sort out noise and distracters from a specific regulatory network of interest.

It is possible to evaluate and thus develop inference algorithms by using simulated data.

Shows how sampling influences the inference. Simulator and inference are based on

distinguished models Used very simple model for simulator. The inference procedure was tested for an

unrealistically large training set Conclusions are not convincing enough.

Smith et al. (2002) - discussion

28

The objective of this study is to test the viability of the BN paradigm in a realistic simulation study.

First, gene expression data are simulated from a realistic biological network.

Then, interaction networks are inferred from these data using DBN and Bayesian learning with Markov Chain Monte Carlo.

Husmeier (2003) - abstract

29

One has to resort to heuristic optimization methods.

Markov Chain Monte Carlo (Chib and Greenberg 1995): Given a network structure Mold, a new structure Mnew is proposed, with proposal probability Q(Mnew|Mold).

Inference - MCMC

Husmeier 2003

30

To avoid an explosion of the model complexity, parameters are tied such that the transition probabilities between time slices are the same.

Edges within a time slice are not allowed. A limitation was set on the maximum number of

edges converging on a single node.

DBN settings

Nachman

2004

31

Simulation study

Data are generated from a known Bayesian network.

New networks are learned and compared with the true network.

32

Evaluating inference Sensitivity is the proportion of recovered true

edges:

TP / (TP + FN)

Complementary specificity is the proportion of erroneously recovered spurious edges:

FP / (TN + FP)

Plotting the ensuing sensitivity against the corresponding complementary specificity gives the Receiver Operator Characteristic.

33

Evaluation on synthetic data - structure

Using a structure of sub-network of the yeast cell cycle, taken from Friedman et al. (2000).

In a second series of simulations, 38 redundant nodes were added to this network as confounders, giving a total of 50 nodes.

34

Evaluation on synthetic data - distribution

First conditional probability distribution: Noisy regulation according to a binomial distribution Excitation: P(on|on) = 0.9, P(on|off) = 0.1; Inhibition: P(on|on) = 0.1, P(on|off) = 0.9; Noisy XOR co-regulation:

P(on|on,on) = P(on|off,off) = 0.1, P(on|off,on) = P(on|on,off) = 0.9;

Second conditional probability distributions: Stochastic interaction where all parameters were chosen at random (binomial or trinomial).

35

Evaluation on synthetic – results (ROC curves)

Structure 1

Structure 2

T=100 T=30 T=7

Solid line shows noisy regulation, dashed line shows stochastic interaction

36

And still, this is an upper bound

The model used for inference is identical to the simulator model.

The true continuous signals are typically sampled at discrete time points, which loses information, especially if the sampling intervals are not matched to the relaxation times of the true biological processes.

Gene expression ratios are typically discretized, which inevitably adds noise and causes further loss of information.

37

Evaluation on realistic simulated data

The study applies the model regulatory

network proposed by Zak et al. (2001), but in contrast to it, the

system was augmented by

adding 41 spurious, unconnected genes (giving a total of 50 genes), which were

up- and down-regulated at random.

Husmeier 2003

38

Realistic simulated data – two experiments The first experiment (followed closely Zak’s

study): Ligand was injected at time 1000 min. Then, 12 data points were collected over 4000 min in equidistant intervals.

The second experiment (adopted a sampling strategy different from Zak): focusing on the time immediately after external perturbation, when the system is in disequilibrium. The sampler therefore collected 12 data points over a shorter interval of only 500 min immediately after ligand injection, between times 1100 and 1600 min.

39

Inference results

True Learned

Solid arrows show true edges and dashed arrows represent spurious edges

The most restrictive prior (maximum fan-in = 2). The threshold θ was chosen so as to obtain the

same number of edges between non-spurious nodes as in the true network.

40

Realistic simulated data – Results

Experiment1

Experiment2

Fan-in=2 Fan-in=3 Fan-in=4

Averaged over three MCMC simulations. The solid line shows the ROC curve obtained from gene expression data alone, while the dash-dotted line shows the ROC curve obtained when including sequence information.

41

Realistic simulated data – Discussion

The inclusion of available prior knowledge improves the performance of the inference.

Sampling over a long time interval, which covers the system in equilibrium, gives areas that are small, and the low slope of the curves at the left-hand side implies that even the dominant true edges are obscured by a large proportion of spurious edges.

Sampling to a short time interval, when the system is in a perturbed non-equilibrium state, gives areas that are significantly increased, and the larger slope of the curves implies that the dominant true edges are obscured by far fewer spurious edges.

42

Discussion The search for new genetic interactions is hard,

but it is significantly more effective than a search from tabula rasa.

Simulations quantify how much can be learned from the data in the described unfavourable situation.

It demonstrates how the network inference performance varies with the training set size, the prior assumptions, the experimental sampling strategy and the inclusion of biological information.

43

References Friedman et al., “Learning the structure of dynamic probabilistic networks”,

Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 1998.

Friedman et al., “Using Bayesian Networks to Analyze Expression Data”, Journal of Computational Biology, 2000.

Husmeier Dirk, “Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks”, Bioinformatics 2003.

Nachman et al., “Inferring quantitative models of regulatory networks from expression data”, Bioinformatics 2004.

Smith et al. “Evaluating functional network inference using simulations of complex biological systems”, Bioinformatics 2002.

Yu et al.,“Advances to Bayesian network inference for generating causal networks from observational biological data”, Bioinformatics 2004.

Zak et al.,“Local Identifiability: when can genetic networks be identified from microarray data?”, Proceedings of the Third International Conference on Systems Biology 2004.