System Identification methods for Reverse Engineering Gene Regulatory Networks by Zhen Wang A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen’s University Kingston, Ontario, Canada October 2010 Copyright c Zhen Wang, 2010
88
Embed
System Identification methods for Reverse Engineering Gene
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
5.1 Interaction Matrix summed over 100 synthetic datasets by FOS: forthe target gene at jth column, the ijth entry of the matrix denotesthe number times of this regulation from regulator gene on the ith rowdiscovered in 100 synthetic datasets. Entries in bold are the actualregulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Interaction Matrix summed over 100 synthetic datasets by PCI: forthe target gene at jth column, the ijth entry of the matrix denotesthe number times of this regulation from regulator gene on the ith rowdiscovered in 100 synthetic datasets. Entries in bold are the actualregulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Comparisons of the inferred networks of Synthetic Data by using FOSand PCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Comparisons of the inferred networks of Brainsim Simulated Data byusing FOS and PCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Comparisons of the inferred networks of yeast Saccharomyces cerevisiaeData by using FOS, PCI and other two available studies. . . . . . . . 64
2.2 Brief illustration of Gene Expression. . . . . . . . . . . . . . . . . . . 92.3 Schematic illustration of one simple gene regulatory network. . . . . . 102.4 Steps of a cDNA microarray experiment . . . . . . . . . . . . . . . . 122.5 A simple Bayesian Network Model: five genes; there is an edge directed
from A to D, A is the parent of D and D is its child; . . . . . . . . . 19
3.1 A simple example explaining the relationship between regulation weightmatrix and GRN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Predefined network Structure for the synthetic data . . . . . . . . . . 273.3 Network Structure of the GRN simulated in Brainsim Songbird Data 293.4 The Target Pathways of these 14 genes available from KEGG . . . . . 31
4.1 Structure of a PCI model . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Structure of a multiple input/single output PCI model . . . . . . . . 444.3 Structure of the modified PCI model . . . . . . . . . . . . . . . . . . 45
5.1 System Identification of FOS: Starred points are actual system outputsand solid lines denote the estimated system output using identified model. 49
5.2 System Identification of PCI: Starred points are actual system outputsand solid lines denote the estimated system output using identifiedmodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 The histograms of the number of times that one pair of regulation isdiscovered from 100 synthetic dataset by (a) FOS (b) PCI . . . . . . 55
5.4 The final estimated networks of Synthetic Data by (a) FOS (b) PCI.Solid links are correctly discovered, TP; dashed links are missing ones,FN; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 The histogram of topping 50 significant regulation discovered from 750Brainsim Songbird datasets by FOS. . . . . . . . . . . . . . . . . . . 58
5.6 The final estimated networks of Brainsim songbird Data using FOS.Dashed lines means the regulation that FOS could not recover. . . . . 58
vii
5.7 The histogram of topping 50 significant regulation discovered from 750Brainsim Songbird datasets by PCI . . . . . . . . . . . . . . . . . . . 59
5.8 The final estimated networks of Brainsim songbird Data using PCI.Dashed lines means the regulation that PCI could not recover. . . . . 59
5.9 The yeast cell cycle pathway inferred from Spellman data using differ-ent methods: (a) FOS (b) PCI (c) Kim [35], and (d) Zhang [85]. . . . 63
viii
Chapter 1
Introduction
1.1 Gene Regulatory Networks
Genes are the basic physical and functional units of heredity. They carry all the
information relevant to what the organism is like, how it survives, and how it behaves
in an environment [67]. Proteins are the building blocks that are essential parts
of living cells. They are the products of genes: a gene will be first transcribed to
an intermediate messenger ribonucleic acid (mRNA), and the mRNA molecule next
translated into a specific protein. Genes in cells do not function individually and
are controlled through intricate interconnections of cellular components, such like
proteins. The gene transcription process is controlled by a collection of proteins
called Transcription Factors (TFs), which can determine when and how much the
specific genes are expressed, and it is also affected by different types of enzymes,
a group of proteins that catalyze reactions [82]. These proteins are production of
corresponding genes, which will then serve as TFs or enzymes that accede to the gene
expression processes of their target genes. The process of genes interacting with each
1
1.2. MOTIVATION 2
other can be described as a Gene Regulatory Network (GRN). Research on GRNs
can provide useful explanations about why the behavior of one gene coincides with
the variations of some other genes.
GRNs are likely the most important organizational level in the cell where inter-
nal signals and the external environment are integrated in terms of corresponding
timed expression levels of genes [10]. They act as biochemical computers in cellular
processes, organizing the level of expression for each gene in the network by control-
ling whether and at what rate that gene will be transcribed. As a result, the type
and amount of proteins are produced differently in different cells in order to make
corresponding cells function properly.
Temporal gene expression data are observations of genetic activity levels over a
number of points of time. The advent of new high throughput technologies, such
as Microarrays, for acquiring gene expression data has made a wealth of molecular
data available. Reverse engineering GRNs, refers to the discovery of the principles
and structures of GRNs using gene expression data; it has received a great deal of
attention in recent years. Computational methods were applied to mine meaningful
interactions between genes.
1.2 Motivation
Reverse Engineering GRNs is an important issue in Bioinformatics, and can yield
remarkable improvements of understanding of biological systems on several fronts:
(i) clarification of and to understand the complex mechanisms of development and
evolution in living organisms [13]; (ii) description of the underlying network structure
of gene regulation pathways [78]; (iii) detection of pathways initiators which are
1.2. MOTIVATION 3
potential reasons of particular genetic disease, and extraction of possible drug targets
[26] and (iv) providing information on possible novel regulations for future research.
Deriving a GRN from gene expression data, however, is often difficult, due to the lack
of complete knowledge of the processes and parameters of the biological system and
its environment.
Numerous computational methods have been developed and investigated to con-
struct GRNs from gene expression data. Popular reverse engineering methods, in-
cludes Association Networks [19, 5], Boolean Networks [31], Bayesian Networks and
Dynamic Bayesian Networks [22, 62]. These methods build upon mathematic or
statistic algorithms to reconstruct networks using correlation, mutual information, or
conditional dependence between genes, respectively. System identification algorithms
are a category of reverse engineering methods that have been applied mainly in en-
gineering domain [57]. GRNs are biological systems that reflect the interconnected
relationships of genes, where temporal measurement of gene expression data can be
obtained as time series signals. Therefore, system identification algorithms have the
ability to build models that reveal the dynamic behaviors of gene regulation. They
fit models of dynamic systems to temporal data, and typically represent quantitative
aspects. These data-driven approaches can construct models from measured input-
output data, giving the best fit to the gene expression data. The inferred models
utilize the target gene in a network as the output and regulating genes as the inputs.
As a result, a structural gene network is obtained. Several system identification ap-
proaches using different models: linear modeling [18, 79], and models consisting of
ordinary differential equations [15, 64], have been discussed recently for inferring gene
regulatory networks.
1.3. OBJECTIVES 4
1.3 Objectives
In this thesis, two system identification algorithms, Fast Orthogonal Search (FOS)
and Parallel Cascade Identification (PCI), are discussed and implemented to build
dynamic models of GRNs. Both FOS and PCI were originally developed for nonlinear
system identification [37, 39], and have been applied in other engineering fields.
Interactive dynamic models of a synthetic dataset, a songbird simulated dataset,
and a real biological dataset, through FOS and PCI are devised. GRNs that capture
the time course variations of genes based on their regulators’ expressions are built for
all the models. The performance of the two approaches is compared with each other,
as well as with other published methods in the literature for verification.
1.4 Contribution
The primary contributions of this work are reported here:
• Two system identification algorithms, FOS and PCI, are presented for building
dynamic models that can capture genetic regulation information. To the best
of the author’s knowledge, neither FOS nor PCI has been used for this purpose
before in the literature.
• A modification on PCI algorithm is proposed. For the case of multiple in-
put/single output system, the original PCI algorithm considers only one input
signal for the dynamic system at a time; multiple input signals are added and
have equal weights. Yet, the modified method is able to treat multiple input
signals simultaneously starting from the dynamic system.
1.5. ORGANIZATION OF THESIS 5
• A method for building a sparse model of gene regulation from PCI is proposed.
As the gene regulatory networks are known to be sparse [48], a fully connected
model does not capture the biological system well.
• Three datasets are used to evaluate and compare the algorithms performances
for capturing GRNs.
– A time-delayed gene regulatory pathway of arbitrary structures was de-
signed. Its corresponding temporal artificial dataset was generated through
a stochastic function.
– A simulated temporal gene expression dataset, was produced using Brain-
sim simulator introduced by [73]. It has 100 genes plus another term
named activity, and represents gene interactions in response to the singing
behavior in a songbird.
– A biological dataset, comprising a subset of yeast Saccharomyces cere-
visiae, which includes the expression levels of 14 cell-cycle regulated genes
over time, were also used.
1.5 Organization of thesis
This thesis is organized as follows. Chapter 2 reviews the fundamental concepts
of molecular biology underlying GRNs. Microarray gene expression measurements
and required preprocessing approaches are discussed. Moreover, a review of related
network inference algorithms is provided. In chapter 3 the datasets that are used for
this study and their required preprocessing steps are introduced. Then in the following
two chapters, a complete description on the theory and implementations of discussed
1.5. ORGANIZATION OF THESIS 6
approaches, Fast Orthogonal Search and Parallel Cascade Identification, are given.
The statistic criteria used for evaluation of each method are also introduced and the
resulting networks are studied to illustrate the performances of discussed algorithms.
Conclusions and future directions of this research are presented in Chapter 6.
Chapter 2
Background
2.1 Basic Concepts in Molecular Biology
A cell is the most basic unit of a organism and also is the smallest unit making up our
bodies. There are tens of thousands of different types of cells, each of which has unique
functions; however, all cells share similarities. The most important shared feature of
cells is that they contain hereditary information in the form of Deoxyribonucleic
acid (DNA) molecular for almost all species1, and have the basic mechanisms for
translating genetic messages into the protein. Proteins are the fundamental structural
and functional units in cells and can act as structural components, enzyme catalysts,
and antibodies [82].
DNA is shaped as a double helix structure shown in Figure 2.1(a), and consists
of two long polymers made from repeating units called nucleotides [82]. These two
polymers are complementary, and the sequence in one strand is completely determined
by the sequence of nucleotides in the other strand. This feature has been recognized
1Some viruses have been discovered that they have RNA genomes.
7
2.1. BASIC CONCEPTS IN MOLECULAR BIOLOGY 8
(a) (b)
Figure 2.1: (a) Double Helix structure of Deoxyribonucleic acid; (b) Pairing rules forA, T, C, G [82]
as one of science’s most famous statements when Watson and Crick first presented
the structure of DNA helix in 1953. The four nucleotides on the DNA, adenine(A),
guanine(G), cytosine(C) and thymine(T), only bond to their complimentary base [82].
Adenine in one strand can only bond with thymine in the other strand, and similarly
guanine has to bond with cytosine, Figure 2.1(b) [82].
A segment of DNA, called a gene, stores genetic codes. A gene consists of a
long combination of four different nucleotide bases. The sequence of nucleotides
in a gene determine the structures of its protein products. According to central
dogma of molecular biology, producing a protein from information in a gene is a two-
step process: transcription and translation. Figure 2.2 summarizes the process of
expressing a protein-encoding gene [82].
The transcription process is to create an equivalent messenger RNA (mRNA) copy
of a portion of DNA. Hence, the information on a gene is transcribed into an mRNA
molecule. An mRNA polymerase enzyme can recognize and bind to a specific site
of DNA molecule, which signals the initiation of transcription. In the translation
2.1. BASIC CONCEPTS IN MOLECULAR BIOLOGY 9
Figure 2.2: Brief illustration of Gene Expression.
step, mRNA produced by transcription is decoded by the ribosome to make a specific
amino acid chain, which later will fold into a protein [82]. This complete process
where a gene gives rise to a protein is called gene expression.
DNA can be compared with a recipe in the gene expression process, due to its
storage of code to instruct other components of cells. Different portions of genes are
active in different cells; as a result their protein products can be drastically different.
The type and amount of proteins produced in each particular cell are extremely
important for the cell to function properly.
The process of gene expression is controlled by a collection of proteins named
transcription factors (TFs). These TFs can decide when, where and at which rate a
particular gene is expressed. Because of the enrollment of different TFs, which them-
selves are protein products of expressed genes, genes are under regulatory control and
comprise complex interactions known as Gene Regulatory Networks (GRNs) [78]. A
brief description is shown in Figure 2.3. Gene1 first is transcribed into mRNA1, and
2.2. MICROARRAYS GENE EXPRESSION MEASUREMENT 10
Figure 2.3: Schematic illustration of one simple gene regulatory network.
then translated to Prontein1 which serves as the TF of Gene2. Therefore, the ex-
pression process of Gene2 is determined by the product of the expression of Gene1,
and Gene1 is defined as its regulator. Furthermore, the expression processes of both
Gene2 and Gene3 are controlled by their common TF, Protein2, which is the expres-
sion product protein of Gene2. Therefore, Gene2 has a self-regulation relationship in
this network, and it also functions as the regulator of Gene3. Once Prontein2 binds
to the specific state of DNA, gene transcription of Gene3 will be activated.
2.2 Microarrays Gene Expression Measurement
Microarrays are a collection of single stranded DNA segments deposited or synthe-
sized on a solid surface. They can monitor the mRNA abundance of genes in a high
throughout fashion [69]. The single stranded DNA segments are called probes and
are complementary to specific RNA species based on the central dogma of molecular
biology [78]. Studies discovered that the amount of mRNA is proportional to the
2.2. MICROARRAYS GENE EXPRESSION MEASUREMENT 11
transcription rate of its corresponding gene [66]. Therefore, the relative transcrip-
tion rate of genes can be calculated through the measurement of their corresponding
mRNA levels. In this section, DNA Microarray experiments are briefly reviewed
because gene expression data has been an important element in advance of reverse
engineering GRNs.
Based on the type of probes used in experiments, Microarrays can be categorized
into two classes, cDNA Microarrays and oligonucleotide Microarrays [70]. cDNA
Microarray is a widely used technology in which two samples are usually analyzed
simultaneously in a comparative fashion. To measure expression levels of genes using
cDNA Microarray, mRNA is extracted from test cell and reference cell, and then
reverse transcribed into cDNA and labeled with fluorescent dyes. The test and ref-
erence cells labeled with dyes that are activated at different frequencies, referred to
as red and green respectively. Two fluorescently labeled samples are then mixed and
the mixture is hybridized on Microarray chips. Finally Microarrays are scanned and
the resulting images are analyzed to calculate gene expression values. The steps of
cDNA microarray is shown in Figure 2.4.
In oligonucleotide Microarray technology, genes on the microarray are represented
by a set of 14 to 20 short sequences of DNA, called oligonucleotide, each of which con-
sists of two probes named perfect match (PM) and miss match (MM). DNA sequences
in every pair of PM and MM are identical, except for one nucleotide in the center of
each sequence. PM is the exact sequence of the selected fragment of the gene. In this
approach, there is no need for using reference samples. First oligonucleotide arrays
are built onto microarray chips. Then mRNA is converted to fluorescently labeled
cDNA followed by hybridization of labeled cDNA samples to Microarray. Finally, the
2.3. PROCESSING MICROARRAY GENE EXPRESSION DATA 12
Figure 2.4: Steps of a cDNA microarray experiment
microarray is scanned and the resulting images are analyzed. Because the correct
gene will only hybridize to the PM, while incorrect hybridization affects both PM
and MM, the expression level of each gene is the average difference between PM and
MM [20]. Affymetrix GeneChip is one of the most widely adopted oligonucleotide
microarray technologies.
2.3 Processing Microarray Gene Expression Data
Due to the effects arising from the variations in the Microarray technologies and ex-
periment setups, preprocessing of gene expression measurements is required for more
reliable data analysis. Accurate preprocessing procedures improve the comparability
of expression data. Microarray data preprocessing usually includes the following steps
2.3. PROCESSING MICROARRAY GENE EXPRESSION DATA 13
[25]:
• Missing Values:
It is estimated that a microarray dataset has more than 5% missing values, af-
fecting more than 60% of the genes [14]. Since many data analysis methods such
as principal component analysis, support vector machines and artificial neural
networks require complete datasets, accurate estimation of missing value is an
important preprocessing step in microarray analysis. Obviously, repetition of
identical experiments can be adopted to solve the missing value issue; however,
this method is costly and time consuming [77]. A series of numerical methods
have been developed to estimate missing values: (1) replacing missing values
with constants; (2) replacing missing values with averages over time [3]; (3) K-
nearest neighbor replacement method [77]; (4) bayesian principal components
analysis replacement method [59]; (5) support vector regression impute method
[80]; (6) least squares formulation based replacement method [34]. Consider-
ing the complexities of different missing values estimation algorithms, simple
averaging is utilized in this thesis.
• Gene Selection:
Gene expression data analysis usually focuses on differentially expressed genes
(DEG). In a microarray experiment, the majority of genes, have constant expres-
sion levels cross time. These genes do not convey any significant information,
on the contrary, they will decrease the efficiency and increase the computational
cost. As such, several methods are developed to select significant genes: the
most simplest way to identify DEGs is by setting a threshold value for detecting
variation of genes; statistic hypothesis tests can also be used for detecting DEGs,
2.4. NETWORK RECONSTRUCTION ALGORITHMS 14
such like t-test [11] and maximal likelihood analysis [27]; fold change analysis,
significant genes can be determined based on relative increase or decrease in
their expression profiles [56].
• Interpolation:
Microarray gene expression dataset usually contains much fewer number of time
points than that of genes. This is partly time consuming nature, and cost of
designing experiment and acquiring data. The accuracies of many temporal
data analysis methods depend on the availability of training samples in time.
Interpolation can increase the number of samples by adding new data points
within the range of original known measurement. Many interpolation methods
are available in numerical analysis [53]: nearest neighbor interpolation, linear
interpolation, spline interpolation, and polynomial interpolation. Appropriate
interpolation can provide more reasonable data samples for analysis.
2.4 Network Reconstruction Algorithms
Given temporal gene expression data acquired under different experimental condi-
tions, a model of the gene interactions can be built through different reverse engi-
neering methods. A gene regulatory network, therefore, is constructed. The GRN is
represented as a graphic model, whose nodes stand for a set of genes and connections
take on different meanings through different models. Providing an accurate reverse
engineering tool that captures a global view of gene regulation is a challenging topic
in Systems Biology.
2.4. NETWORK RECONSTRUCTION ALGORITHMS 15
Many reverse engineering techniques have been proposed for building gene reg-
ulatory networks. Following different criteria, these techniques can be summarized
into several groups. Gardner and Faith [23] used the mathematical graphical models
described them into four categorizations: Association networks, Boolean networks,
Bayesian Networks, and Differential Equations. Karlebach and Shamir [29] roughly
divided various computational models for reverse engineering GRNs into three classes
based on their learning strategies: logical models which allow people to obtain a basic
understanding, continuous models to manipulate behaviors that depending on finer
timing and exact molecular concentrations, and single-molecule level models follow-
ing the observation that the functionality of regulatory networks is often affected
by noise. Another broad classification of deterministic models and stochastic mod-
els, has also been proposed by [68]. Sima [72] reviewed different network inference
methods in two classes based on whether or not they can infer dynamical interaction
between genes. In this section, representative reverse engineering methods, Associa-
tion networks, Boolean networks, and Bayesian Networks, and their advantages and
disadvantages are briefly reviewed. The notation of Genei to describe a gene that is
associated with a random variable Xi, whose gene expression levels are denoted as
Xi(t) at time point t, t = 0, . . . , T .
2.4.1 Association Networks
Association networks are amongst the simplest models for reverse engineering GRNs.
They represent GRNs using an undirected graph with edges weighted by similarities
or relevances. Popular relevance measures are covariance-based measures such as
Pearson correlation, and entropy-based measures such as mutual information.
2.4. NETWORK RECONSTRUCTION ALGORITHMS 16
Pearson correlation, developed by Karl Pearson, is one of the most common and
most useful measures of the linear dependence between two time series variables.
It is a coefficient calculated by dividing the covariance of the two variables by the
product of their standard deviations. The value of the coefficient ranges between −1
and 1. The closer the coefficient is to either −1 or 1, the stronger the correlation
between the variables. If Pearson correlation coefficient is 0, these two variables are
linearly independent. To calculate the Pearson correlation coefficient between two
genes Gene1 and Gene2, the following formula is available
ρ(X1, X2) =
∑Tt=0(X1(t)−X1(t))(X2(t)−X2(t))
T√σX1σX2
, (2.1)
where Xi and σXiare the mean and the standard deviation of random variable Xi,
i = 1, 2.
Pearson correlation only gives a perfect value when two variables are linearly
related. In contrast to this, mutual information, can detect nonlinear correlations. It
is frequently adopted as an index to quantify the mutual dependence of two variables.
The mutual information of two random variables X1 and X2 associate with two genes
is
I(X1;X2) =
T∑t=0
T∑t=0
p(X1(t), X2(t))log
(p(X1(t), X2(t))
p(X1(t))p(X2(t))
), (2.2)
where p(·) is the probability, calculated by the frequencies of corresponding variable.
The greater the mutual information is, the more relevant these two variables are. If
the mutual information is zero, these two variables are irrelevant.
Both Pearson correlation and mutual information have long been used in System
Biology to infer gene regulatory networks. D’haeseleer et al. [19] defined the distance
measure based on residue variance as d(X1, X2) = 1 − ρ(X1, X2)2, where d = 0 if
they are perfectly correlated and d = 1 if they are uncorrelated. Based on mutual
2.4. NETWORK RECONSTRUCTION ALGORITHMS 17
information, a method called ARACNE was proposed by Basso et al. [5], and it has
been used for inferring genetic networks in human B cells. Simplicity and low com-
putational costs are the major advantages of association networks. The limitations
of such models are that they can not reflect causalities and do not take into account
that multiple genes could enroll in the regulation.
2.4.2 Boolean Networks
Boolean Networks were first proposed by Kauffman [31, 30] for the purpose of model-
ing gene regulation, and since then they have been extensively investigated in System
Biology; (1) the mapping to study the qualitative properties of continuous biochemi-
cal control networks using logical structure is further discussed [33, 32]; (2) a model
based on the boolean genetic networks is built as a conceptual framework to identify
new drug targets for cancer treatment [26]; (3) and Liang et al. [50] had described an
algorithm for inferring genetic network from time series of gene expression patterns
using Boolean network model, and Akutsu et al. devises a simpler algorithm for the
same problem [2].
A Boolean network uses binary variables Xi ∈ {0, 1} that denote the tran-
script levels of Genei in the network as ”off” or ”on”, and edges made up of
simple Boolean operations FB, ”AND” ”OR” and ”NOT”. A simple example is
Xi(t + 1) = FBi (X1(t), . . . , XN(t)). The goal of reverse engineering a Boolean net-
work is to find the Boolean function FBi for each gene so that the gene expression
profile can be explained by this model. Two primary strategies were proposed to learn
the connectivity of genes in Boolean Networks. The first one computes the mutual
information between sets of two or more genes and tries to find the smallest set of
2.4. NETWORK RECONSTRUCTION ALGORITHMS 18
input genes that provides complete information on the output gene [50]. The other
one looks for the most parsimonious set of input genes whose expression variations
are coordinated or consistent with the output gene [2].
In contrast to Association Networks, Boolean networks successfully capture the
dynamics of gene regulation. However, Boolean networks are limited because changes
in gene expression levels over time can not be simply represented adequately by two
states and the discritization process from the continuous gene expression levels to
the binary data is not trivial. Furthermore, solving Boolean networks requires large
amount of experimental data because it does not place constraints on the form of the
Boolean interaction functions [23]. To determine a complete set of Boolean functions
from data, all possible combinations of input expression have to be considered. For
a fully connected Boolean network with N genes, it would require approximately 2N
data points to infer all Boolean functions [17] since each gene can be either ”off” or
”on” independently. Both Association networks and Boolean networks are simple ap-
proaches to provide models of gene regulation [6], compared with Bayesian Networks
and System Identification methods that will be discussed.
2.4.3 Bayesian Networks
A Bayesian network(BN) is a probabilistic graphical model that represents a set of
variables and their conditional dependencies via a directed acyclic graph. Such a
model consists of two components, the structure G a directed acyclic graph and the
parameters Θ a set of parameters of conditional distribution of each variable given
the rest of variables. In the graphical structure of the BN given in Figure 2.5, its
nodes stand for genes A,B,C,D,E and edges correspond to conditional dependencies
2.4. NETWORK RECONSTRUCTION ALGORITHMS 19
between genes. The absence of an edge between two genes means that those genes
are conditionally independent given their parent genes, for example, B are D are
conditionally independent given their parent genes A and E. BNs follow the first
order Markov assumption that each variable is conditionally dependent on its parent
only. The joint distribution over the set of genes is also calculated, which can be
rewritten as the product form of probability of each gene given its parents. BNs can
not deal with continuous values. Therefore, the probability of one gene is calculated
by frequencies of discretized expression levels over time.
Figure 2.5: A simple Bayesian Network Model: five genes; there is an edge directedfrom A to D, A is the parent of D and D is its child;
The problem of learning BNs ends up with learning these two components, struc-
ture learning and parameter learning. To construct a BN, using score-based ap-
proaches, is to determine a score function based on posterior probability of BN given
the data, which is then used as the criterion for selecting the optimal set of parents
for each variable. However, this selection procedure is computational costly, because
there are too many possible local structures. Several searching algorithms such like
After C and D are available, gm could be calculated by using eq(4.11).
gm =C(m)
D(m,m), for m = 0, . . . ,M. (4.11)
It has been proved in [37] by Korenberg that the MSE of the model defined by
eq(4.5) can be expressed as follows:
error = y2(t)−M∑
m=0
g2mD(m,m). (4.12)
4.1. FAST ORTHOGONAL SEARCH 39
Comparing eq(4.7) and eq(4.12), Q(M + 1) in eq(4.10), the amount of reduction of
MSE by adding a new term aM+1pM+1(t), is of this form
Q(M + 1) = g2M+1D(M + 1,M + 1). (4.13)
To select the (M+2)th term pM+2(t) we only need to carry out the above procedure
for m = M + 2. We do not need to repeat previous calculations for m ≤ M + 1. As
mentioned above, FOS will continue to select and add the optimal candidate term
to reduce the MSE of the model until it reaches some stopping criteria. In [37, 38],
two stopping criteria have been mentioned to terminate FOS. One is that once all
candidate function terms have been selected from the candidate functional set, FOS
will stop searching. The other one is based on a statistic significance test: FOS will be
terminated if adding a further term can not reduce MSE more than white gaussian
noise. Suppose we already selected M terms, for a given candidate function term
pM+1(t), its corresponding value of Q(M+1) can be calculated by eq(4.13). It can be
shown that if e(t) is a zero-mean, independent Gaussian noise, then the correlation
coefficient r is given by
r =
(Q(M + 1)
y2(t)−∑Mm=0 Q(m)
) 12
<2√
T − R + 1, (4.14)
with probability of around 0.95 confidence interval (C.I.) for sufficiently long record
length T − R + 1 [71]. Note that the denominator of R.H.S. of eq(4.14) is the stan-
dard deviation of r. Moreover, here 2 is an approximated value of 1.96, based on
− 1.96√T − R + 1
< r <1.96√
T − R + 1. Therefore, eq(4.14) can be rewritten in a more
general way:
Q(M + 1) >K
T − R + 1
(y2(t)−
M∑m=0
Q(m)
). (4.15)
4.1. FAST ORTHOGONAL SEARCH 40
For example, if we set K = 4, FOS will end up with a 95% C.I. [45] and if K is
chosen as 10.9, the C.I. will be 99.9% [42].
4.1.3 Network Construction using FOS
Implementing FOS for gene network reverse engineering, we model the interactions of
one gene at a time in the network. Moreover, we “assume” that the rate of change of
a gene in time is only dependent on the rate of change of its regulators at the previous
time point. Consider the gene expression data consisting of N gene expression profiles
over T time points, focusing on one gene Genej , it is treated as the output of the
system(the target gene of the network), and the remaining N−1 genes constitute the
candidate function set ξ = {Gene1, . . . , Genej−1, Genej+1, . . . , GeneN}. When adding
time series property to the system, because of the assumption that only the previous
time point of regulator genes is treated as the turn-on of regulation performance, the
candidate functional set is ξ = {Gene1(t), . . . , Genej−1(t), Genej+1(t), . . . , GeneN(t)}and output is Genej(t+ 1), t = 1, . . . , T − 1. Here, we do not permit self regulation,
therefore the form defined by eq(eq4.3) does not include the output y terms. The
time lag for input is 1, therefore R = 1.
Through FOS, corresponding MSE reduction Q, for all candidate functions in ξ
are calculated and compared. The candidate function resulting in the maximum value
of Q is selected to be added to the model and deleted from the candidate functional
set ξ. Obviously, FOS will always select a time series to estimate the studied gene
expression profile. This procedure is iteratively repeated until either of two stopping
criteria is met: (i) adding a new function does not result in a larger reduction of MSE
than white gaussian noise; or (ii) ξ is empty. The identified model is utilized to predict
4.2. PARALLEL CASCADE IDENTIFICATION 41
Genej using the selected genes, which are defined as regulators of Genej . Once all
the genes Genej , j = 1, . . . , N have been studied as the target, a network consisting
of all genes is constructed, whose nodes stand for genes, edges denote the regulations
between genes and arrows of the edges describe the direction of the regulation. Note
that the model built through FOS is highly dependent on the predefined candidate
basis function set. One could define complex basis functions like cross-products to
construct a more complicated network.
4.2 Parallel Cascade Identification
Parallel Cascade Identification (PCI) builds a model of input/output relationship of a
system using a number of cascades, each of which has a dynamic component, capable
to capture the memory of a system, followed by a static polynomial component, which
enables an accurate estimation of the system output, as shown in Figure 4.1 [39].
PCI starts by approximating the system utilizing the first cascade. The difference
of the actual system output, y(t), with the first cascade output, z1(t), is called the
residue, y1(t). The residue is then treated as the output of a new system that will
be approximated by the second cascade. The residue is again computed, and another
cascade is added. The process continues until it reaches a desired threshold for the
approximation error.
For a system represented as eq(4.1), following the Stone-Weierstrass theorem [47],
it can be approximated with a finite order Volterra series2, that is
ys(n) = k0 +M∑
m=1
Vm, n = 0, 1, . . . (4.16)
2The Volterra series were developed in 1887 by Vito Volterra. It is a model for non-linear behavior,similar to the Taylor series. But it has the ability to capture ‘memory’ effects.
4.2. PARALLEL CASCADE IDENTIFICATION 42
Figure 4.1: Structure of a PCI model
where M is the order of the Volterra series and for m ≥ 1, the mth order Volterra
functional is of this form
Vm =
R∑i1=0
· · ·R∑
im=0
km(i1, . . . , im)x(n− i1) · · ·x(n− im), (4.17)
where km is the mth order symmetric Volterra kernel which can be seen as a higher
order impulse response of the system and R + 1 is the memory length, which means
that the series output ys(n) only depends on input delays from 0 to R lags.
Consider a time series y(t) as the system output and x(t) as the input, t = 0, . . . , T ,
and assume that y(t) depends on input delays from 0 to R, PCI starts with the first
cascade to approximate the system. Let yi(t) be the residue after the ith cascade has
been added to the parallel cascade model. Thus, y0(t) = y(t). Obviously, following
its definition, the following equation holds:
yi(t) = yi−1(t)− zi(t), i = 1, 2, . . . . (4.18)
Consider fitting the ith cascade to the residue yi−1(t), i = 1, 2, · · · , the procedure
4.2. PARALLEL CASCADE IDENTIFICATION 43
of PCI is shown in Figure 4.1 and could be briefly described as follows:
1. Define a candidate function pool hi for a possible impulse response of the
dynamic system in the ith cascade and is of length R. hi consists of cross-
correlation functions of different orders between the input, x(t), and the residue,
yi−1(t). The cross-correlation functions are computed over a segment of the in-
put and output signals extending from t = R to t = T . For example, the
first-order cross correlation function is
φxyi−1(j) � 1
T − R + 1
T∑t=R
yi−1(t)x(t− j), and (4.19)
2. Randomly select the impulse response hi(j) from the pre-defined candidate
function pool, and the output of the dynamic component, ui(t), is calculated
by the following equation:
ui(t) =R∑
j=0
hi(j)x(n− j). (4.20)
3. ui(t) is then treated as the input of the static system. By fitting a static P(·)from the input ui(t) to the residue yi−1(t), a cascade is completely constructed.
The cascade output zi(t) = P[ui(t)].
4. Calculate the MSE of the estimated model, i.e. the mean square value of the
new residue over t = R, . . . , T , y2i (t) = (yi−1(t)− zi(t))2 = y2i−1(t)− z2i (t).
5. Repeat this procedure until MSE reduction caused by adding new cascade is
less than a threshold. Similar to the stopping criteria of FOS, when trying to
add a further cascade, the correlation coefficient r =
√z2i+1(t)/y
2i (t) is required
to follow |r| < 2/√T −R + 1 with probability of around 95%.
4.2. PARALLEL CASCADE IDENTIFICATION 44
Prior to reverse engineering GRNs using PCI, the multiple-input case is necessary
to be discussed. The multiple inputs case introduced in [39] is briefly reviewed here,
and shown in Figure 4.2. For example, consider two input signals, x1(t) and x2(t),
Figure 4.2: Structure of a multiple input/single output PCI model
the differences of PCI procedure from the single input case are:
• In Step 1, the candidate set for impulse response will also include a further
term, the cross-correlation of residue yi−1 with both x1(t) and x2(t).
• In Step 2, to include both inputs in the system, the output of linear system is
calculated by
wi(t) = ui(t)± Cx2(t− A), (4.21)
where the sign is chosen randomly, C is a convergent constant defined as
y(i−1)2(t)
y2(t), and the integer A is selected randomly from 0, · · · , R.
To include three or more inputs in the system, the output of linear system is calculated
4.2. PARALLEL CASCADE IDENTIFICATION 45
by
wi(t) = ui(t)±∑i
≥ 2Cxi(t− Ai), (4.22)
where Ai is randomly selected from {0, . . . , R} and C follows the previous definition.
4.2.1 Network Construction using PCI
For reverse engineering of gene networks, the time lag is set as R = 1. To approximate
the system, for the multiple-input case, if all input genes are assigned the same
coefficients C, even though an acceptable mathematical model can be generated to
predict the time series of the output, this model is not a good representation of genetic
regulation. Since PCI randomly selects the impulse response, here a modification is
made to PCI in this work as shown in Figure 4.3. First the system output y(t)
Figure 4.3: Structure of the modified PCI model
is the gene expression levels of Genej over time, and the input of the system is
cascade, every time we generate a vector Hi of impulse responses corresponding to
the input vector instead of only one impulse response in the original PCI. Assuming
R = 1, the output of the dynamic system is ui(t) = HiX(t− 1), and is directly used
as the input of static polynomial system.
Empirical data indicate that gene regulatory networks should be sparse, and the
average number of upstream regulators of per gene is less than two [48]. Unlike FOS
in which a criteria can be set to terminate the procedure once the maximum number
of accepted regulators is met, PCI will generate a relatively full matrix, except for its
diagonal (which are zeros as self-regulation is not allowed in the models). A method
is developed to reduce the number of estimated links by PCI. The regulation from
Genei to Genej is defined significant if the entry Rij of the regulation weight matrix
R has a greater absolute value compared with all the rest of the entries in the same
column. For example, if Rij is outside of the range of k standard deviation from the
mean of the corresponding column, it will be kept for further studies.
4.3 Assessment of Network Inferences
In order to evaluate the performances of the proposed methods, FOS and PCI, for
identifying gene regulatory networks from the datasets, statistical measures are em-
ployed for this purpose. For predictive analysis, confusion matrix (Table 4.1), is a
table with two rows and two columns that reports the number of True Positives, False
Positives, True Negatives and False Negatives.
• True Positive (TP): the interaction that exists in both the actual network and
inferred network by the reverse engineering methods;
4.3. ASSESSMENT OF NETWORK INFERENCES 47
Table 4.1: Confusion Matrix
actual links total
predicted linksTrue Positives False Positives P’
False Negatives True Negatives N’
total P N
• False Positive (FP): the interaction that does not exist in the actual network
but was falsely inferred by reverse engineering methods;
• True Negative (TN): the interaction that does not exist in either the actual
network or the inferred network;
• False Negative (FN): the interaction that does exist in the actual network but
is not inferred by the reverse engineering methods.
Moreover, three other criteria Precision (pre), Sensitivity (sen) and Specificity (spc)
are also employed as the evaluation methods, and defined as
precision =TP
TP + FP=
# of correctly estimated interactions
# of all estimated interactions,
sensitivity =TP
TP + FN=
# of correctly estimated interactions
# of all actual interactions,
specificity =TN
TN + FP
=# of possible interactions do not exsit in actual or estimated networks
# of possible interactions do not exsit in the actual network.
Chapter 5
Implementation and Results
Both FOS and PCI are implemented using MATLAB. In this chapter, details of their
implementations and the results of reverse engineered networks using each dataset,
described in Chapter 3, are provided. First, the temporal synthetic dataset is used
to evaluate the performances of FOS and PCI. Then, Brainsim songbird data will be
analyzed and its resulting networks will be compared with the actual network. In the
end, FOS and PCI will be applied on the yeast datasets and the inferred networks
will be compared with the target network from KEGG and two previous network
inference studies [35, 85] on the same data.
5.1 Analysis of the Temporal Synthetic Dataset
To evaluate the performances of FOS and PCI for learning the system network struc-
ture, 100 synthetic datasets were generated using the structure shown in Figure 3.2.
The only differences among all synthetic datasets are the influences of the noise value
E in eq(3.1). It is expected that both FOS and PCI should identify the underlying
48
5.1. ANALYSIS OF THE TEMPORAL SYNTHETIC DATASET 49
system network structure in all datasets. Implement FOS and PCI on this dataset
and build up two models individually.
Every synthetic dataset is composed of nine genes over 100 time points. The
stopping criteria for FOS was set to K = 10.9 or that at most two regulators for
each gene have been selected. The actual and estimated gene expressions using the
built models by FOS and PCI are shown in Figures 5.1 and 5.2, respectively. In these
Figure 5.1: System Identification of FOS: Starred points are actual system outputsand solid lines denote the estimated system output using identified model.
figures, the solid lines show the estimated gene expressions while the stars () denote
5.1. ANALYSIS OF THE TEMPORAL SYNTHETIC DATASET 50
actual system outputs. The system approximation errors are ∼ 0.001. The values
of MSE only provide a mathematical view of model accuracies. From Figures 5.1
Figure 5.2: System Identification of PCI: Starred points are actual system outputsand solid lines denote the estimated system output using identified model.
and 5.2, it is obvious that both methods perform well constructing estimated models.
Only one gene, Gene4, is not estimated well by models constructed by either method.
The reason for this is that to generate the synthetic datasets, the process starts by
assigning random values to Gene4 as its expression levels to generate expression values
of other genes. PCI seems to have fitted the system better than FOS due to the fact
5.1. ANALYSIS OF THE TEMPORAL SYNTHETIC DATASET 51
that PCI does include more function terms to estimate the model (possible eight
terms) compared to FOS (two terms at most).
5.1.1 Network Inference
Due to the pre-set stopping criteria, the regulatory weight matrix Rf provided by FOS
is very sparse, at most two nonzero entries in each column. The type of regulation is
defined as inhibition if the weight from the source gene to the target gene is negative,
and activation if it is positive. Yet Rp, the regulation matrix generated by PCI, is
relatively full, whose entry at ijth position denotes the weight of regulation from
regulator gene at ith row to the target gene at jth column. The criteria introduced
in section 4.2.1 is utilized to reduce the size of network. As a result, its regulation
weight matrix will become more sparse.
Finally, 100 inferred gene regulatory networks are available for each method. All
resulting links are summarized to decide which regulations are to be kept as significant
ones as one matrix. In theory, there are 72 possible regulations in a network of nine
genes. The summed regulation matrices are shown in Tables 5.1 and 5.2 for the
inferred models through FOS and PCI, respectively. From Tables 5.1 and 5.2, one
could conclude that,
• All the 100 synthetic datasets do have similar structures.
• Both FOS and PCI perform steadily on these 100 synthetic datasets.
• The criteria proposed to threshold the inferred network by PCI is reasonable,
and can remove the insignificant regulations.
The histogram of the number of times a link is reverse engineered in the 100
5.1. ANALYSIS OF THE TEMPORAL SYNTHETIC DATASET 52
1 2 3 4 5 6 7 8 9
1 100 0 5 14 20 17 100 17
2 13 100 10 11 11 10 0 14
3 14 16 19 100 100 9 0 16
4 20 7 100 17 8 17 100 13
5 100 13 0 8 16 11 0 12
6 18 14 0 20 14 100 0 12
7 14 20 0 8 13 16 0 100
8 10 19 0 9 15 4 18 16
9 11 11 0 21 16 21 19 0
Table 5.1: Interaction Matrix summed over 100 synthetic datasets by FOS: for thetarget gene at jth column, the ijth entry of the matrix denotes the numbertimes of this regulation from regulator gene on the ith row discovered in100 synthetic datasets. Entries in bold are the actual regulations.
5.1. ANALYSIS OF THE TEMPORAL SYNTHETIC DATASET 53
1 2 3 4 5 6 7 8 9
1 100 0 12 0 0 0 1 0
2 0 97 0 19 0 0 0 0
3 0 0 27 100 97 0 0 0
4 0 0 92 0 9 0 100 0
5 100 0 0 25 1 0 0 0
6 0 0 0 6 0 100 0 0
7 0 0 0 2 0 0 0 100
8 0 0 0 6 0 1 0 0
9 0 0 0 0 0 0 0 0
Table 5.2: Interaction Matrix summed over 100 synthetic datasets by PCI: for thetarget gene at jth column, the ijth entry of the matrix denotes the numbertimes of this regulation from regulator gene on the ith row discovered in100 synthetic datasets. Entries in bold are the actual regulations.
5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 54
synthetic datasets is shown in Figure 5.3. There are two clearly separated parts in
each histogram. Therefore, a threshold can be set to identify significant interactions
and build an inferred network. A pair of regulation is accepted if and only if it
appearers in more than a threshold number out of 100 datasets. Threshold is set
to be 90 for both FOS and PCI. The filtered regulations are used to build the final
networks for each method.
Figure 5.4 shows the identified networks by FOS and PCI. Both methods are able
to reverse engineer most of the true regulations. Out of 11 true regulations, FOS can
recover 10 links, while PCI recovered nine. Regulation of Gene6 by Gene8 is missing
in both estimated models, and PCI did not find regulation of Gene8 by Gene1. To
describe their performances more clearly, precision, sensitivity and specificity are
calculated as shown in Table 5.3.
Table 5.3: Comparisons of the inferred networks of Synthetic Data by using FOS andPCI
Fast Orthogonal Search Parallel Cascade Identification
Sensitivity 1011
= 91% 911
= 82%
Precision 1010
= 100% 99= 100%
Specificity 6161
= 100% 6161
= 100%
5.2 Analysis of the Brainsim Songbird Dataset
Brainsim Songbird dataset by Smith [73] is a popular benchmark dataset used for
evaluating different network inference algorithms. FOS and PCI were also applied to
5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 55
(a)
(b)
Figure 5.3: The histograms of the number of times that one pair of regulation isdiscovered from 100 synthetic dataset by (a) FOS (b) PCI
5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 56
(a)
(b)
Figure 5.4: The final estimated networks of Synthetic Data by (a) FOS (b) PCI. Solidlinks are correctly discovered, TP; dashed links are missing ones, FN;
5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 57
750 such Brainsim datasets as mentioned in Chapter 3. All the datasets have the same
underlying network structure. The network structure for 100 genes and one activity
term in each dataset is reverse engineered. Stopping criteria for FOS is set asK = 10.9
and the maximum number of regulators is 2, and for PCI, k = 1.5. Therefore,
similar to the previous section, 750 regulation weight matrices are generated for either
method, FOS and PCI.
5.2.1 Network Inference for Songbird data
To discover the significant regulations, all 750 regulation matrices reverse engineered
by FOS are summed. Note that for 100 genes there are more than 10k possible regu-
lations, and too many regulations that only appear one or twice out of 750 datasets,
therefore we only plot the histograms of the 50 most significant regulations, shown in
Figure 5.5. The threshold of 300 is used to select most significant regulations, which
should be comparable to the number of actual connections that is 11. By setting
the threshold, we have 11 significant regulations, which are used to build the final
network, shown in Figure 5.6.
For the implementation of PCI on Songbird Data, the histogram of the 50 most
significant regulations inferred out of 750 datasets is given in Figure 5.7. Due to the
criteria used to make the regulation weight matrix sparse, only a few regulations are
considered as significant ones. Therefore, most of insignificant regulations have been
removed and the histogram follows a more uniform distribution, and the threshold is
set as 600. This results in 10 significant regulations to build the network, shown in
Figure 5.8.
By comparing the inferred network by FOS Figure 5.6 with the original network
5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 58
Figure 5.5: The histogram of topping 50 significant regulation discovered from 750Brainsim Songbird datasets by FOS.
Figure 5.6: The final estimated networks of Brainsim songbird Data using FOS.Dashed lines means the regulation that FOS could not recover.
5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 59
Figure 5.7: The histogram of topping 50 significant regulation discovered from 750Brainsim Songbird datasets by PCI
Figure 5.8: The final estimated networks of Brainsim songbird Data using PCI.Dashed lines means the regulation that PCI could not recover.
5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 60
structure Figure 3.3, it is observed that 10 out of all 11 inferred interactions are
truly captured and only one extra interaction is inferred: the regulation of Gene 5
by Activity. The co-regulation of Gene 6 by Gene 3 is missed by both methods,
which was either not predicted by previous studies of Brainsim Songbird Data [73];
because Gene 3 and Gene 5 control Gene 6 in a coordinated fashion with the lower
expression level of the pair serving as the limiting factor in the regulation of Gene 5, it
is found that Gene 5 had a lower expression level than Gene 3 in 89% of the temporal
cases, thus, Gene 5 nearly always serves as the effective regulator [73]. Analyzing the
inferred network through PCI Figure 5.8, 6 out of all 7 inferred interactions exist in
the actual network and 1 extra interaction from Gene 1 to Gene 5 is inferred. Five
interactions are missed. Both incorrectly inferred interaction using FOS and PCI is
the regulation of Gene 5. Both FOS and PCI are able to reverse engineer most of the
true regulations. To evaluate the accuracies of the obtained networks by FOS and
PCI, the criteria ‘precision’, ‘sensitivity’ and ‘specificity’ are calculated again, whose
results are shown in Table 5.4. As shown, FOS performed better than PCI with more
correctly detected regulations.
Fast Orthogonal Search Parallel Cascade Identification
Sensitivity 1011
= 91% 611
= 55%
Precision 1011
= 91% 67= 86%
Specificity 1008810089
≈ 100% 1008410089
≈ 100%
Table 5.4: Comparisons of the inferred networks of Brainsim Simulated Data by usingFOS and PCI
5.3. ANALYSIS OF YEAST SACCHAROMYCES CEREVISIAE DATASET 61
5.3 Analysis of Yeast Saccharomyces Cerevisiae
Dataset
A biological data consisting of 14 genes from yeast Saccharomyces cerevisiae [74],
including three time series, was ultimately used to evaluate the efficiency of these two
reverse engineering methods. The pathway of these genes in KEGG shown in Figure
3.4 is regarded as the target network used to compare and evaluate the performances
of FOS and PCI. Since CLN3 only works at the start of the cell cycle, we will not
consider its regulators for both methods, FOS and PCI. Stopping criteria used for
analyzing this data are the same as previous two datasets; for FOS is set as K = 10.9
and the maximum number of regulators is 2, and for PCI, k = 1.5. For this data, two
individual networks are inferred by the two methods.
5.3.1 Network Inference
As discussed in Chapter 3, the KEGG pathway is treated as the target network for
comparison. Complexes including one or several genes are considered as a ′gene′
in the network. There are 10 complexes, including CLN3/CDC28, SWI4/SWI6,
MBP1/SWI6, CLN1/CLN2/CDC18, and CLB5/CLB6/CDC28. Other nodes that
are made of one single gene only, CDC20, CDC6, SIC1, FAR1, and FUS. The follow-
ing assumptions are made:
• Genes CLN3 and CDC28 are only considered as possible regulators, as they
are starters of the cell cycle network.
• All discovered links from any gene in one complex to any other genes in a
different complex will be considered as one regulation. For example, if FOS or
5.3. ANALYSIS OF YEAST SACCHAROMYCES CEREVISIAE DATASET 62
PCI result in three regulations from the genes in the complex CLN3/CDC28 to
the genes in the complex SWI4/SWI6, CLN3 → SWI4, CLN3 → SWI6, and
CDC28 → SWI4, still only one regulation is used to construct the resulting
network. The weight of this regulation equals the maximum value of the weights
of these three regulations.
• All regulations among genes in the same complex will be ignored.
• If there exist two regulations between two complexes with different directions,
the weights of these regulations will be compared, and only the direction of one
regulation with the higher weight will be kept, which, therefore, determines the
directionality of regulation of these two complexes. For example, between a
complex cplxi and cplxj , if Rij and Rji are both nonzero and Rij > Rji, then
the directionality is determined as cplxi → cplxj . This interpretation is based
on the biological assumption that a small variation in the regulator gene will
result in a large change in the target gene.
The corresponding networks of the yeast dataset using FOS and PCI are shown
in Figure 5.9 (a) and (b). They are also compared with the two previous studies
[35, 85]. Details of their methods are not discussed here; instead, their results are
adopted for comparisons and their resulting networks are shown in Figure 5.9 (c) and
(d), respectively.
By comparing the inferred networks using FOS and PCI with the KEGG pathway,
it is observed that more than forty percent of the interactions in the target network
are inferred by FOS and PCI. While two interactions are captured by Kim et al. [35]
and three are captured by Zhang et al. [85]. Also, reverse engineered results using
FOS and PCI outperform the previous studies in terms of predicting more correctly
5.3. ANALYSIS OF YEAST SACCHAROMYCES CEREVISIAE DATASET 63
Figure 5.9: The yeast cell cycle pathway inferred from Spellman data using differentmethods: (a) FOS (b) PCI (c) Kim [35], and (d) Zhang [85].
5.3. ANALYSIS OF YEAST SACCHAROMYCES CEREVISIAE DATASET 64
estimated and misdirected interactions. Using the information from all four reverse
engineering approaches of cell cycle pathway of the yeast data, ‘precision’, ‘sensitivity’
and ‘specificity’ are calculated and displayed in Table 5.5, as a summary of Figure
5.9. Because different from synthetic dataset and songbird dataset, yeast dataset does
FOS PCI Kim[35] Zhang[85]
TPs 4 5 2 3
FPs 8 7 8 8
Sensitivity 40% 50% 20% 30%
Precision 29% 36% 15% 27%
Specificity 85% 85% 85% 86%
Table 5.5: Comparisons of the inferred networks of yeast Saccharomyces cerevisiaeData by using FOS, PCI and other two available studies.
not have replicate samples for analysis, its inferred results are less statistically sound
and hard to be evaluated. Even though, their absolute values are not very high, they
show significant improvement to the previously reported studies [35, 85].
Chapter 6
Summary and Conclusions
Reverse engineering gene regulatory networks from gene expression data is an im-
portant but challenging area of research in systems biology. In this thesis, Fast
Orthogonal Search and Parallel Cascade Identification, two system identification ap-
proaches, inspired by engineering systems, are introduced and employed to construct
GRNs using temporal gene expression data. The fast convergence time of FOS O(n2)
makes it an attractive approach to analyze large scale data. FOS searches all possible
regulator genes from a candidate set; it selects the optimal one, adds it to the model
and deletes it from the candidate set, iteratively. The selection procedure guarantees
that the searching will always select the most significant regulator from the exist-
ing possible regulators. The other approach, PCI, considers all possible regulators
simultaneously, but by assigning with different weights to them. A modification to
this algorithm was proposed to make the regulation weight matrix generated by PCI
sparse.
To evaluate the reliability and efficiency of FOS and PCI for inferring causal regu-
latory interactions from temporal gene expression data, a synthetic data is generated
65
CHAPTER 6. SUMMARY AND CONCLUSIONS 66
and used. FOS can recover 10 out of 11 actual regulations in this dataset, and PCI
using the proposed criteria can infer a sparse network and recover nine out of 11
true regulations. Via three statistical evaluation criteria ‘sensitivity’, ‘precision’ and
‘specificity’ as well as mean square error, the accuracies of the inferred structures
through both methods are quantified.
FOS and PCI are also applied to the Brainsim songbird data, a temporal simulated
dataset with known structure that models the singing behavior in a songbird. The
inferred structures quantified via the criteria ‘sensitivity’, ‘precision’ and ‘specificity’,
indicates a good performance of these two network inference approaches; only one out
of all inferred interactions is a false regulation using either approach, 10 true network
regulations can be recovered through FOS and six using PCI.
Finally the efficiencies of FOS and PCI for learning the network structure are
evaluated using a biological data, the temporal expression values of 14 genes in yeast
Saccharomyces cerevisiae cell cycle data reported in [74]. The networks inferred from
yeast data by FOS and PCI are compared to the KEGG pathway of the yeast as
the target network and two other yeast network inference studies on the same data
using evaluation criteria ‘sensitivity’, ‘precision’ and ‘specificity’. Even though, the
absolute values of these criteria are not high, compared with the two previous studies,
the results demonstrate a good performance of both FOS and PCI.
In conclusion, both FOS and PCI, can deal with continuous gene expression data,
capture their dynamics, and build deterministic models. By modeling the input/out
relationship, they can infer the causality of the gene regulatory networks by assigning
the input as the regulators of the output.
6.1. FURTHER DIRECTIONS 67
6.1 Further directions
Design and application of methods for reverse engineering of gene regulatory networks
from gene expression data, is a key aspect in systems biology. We proposed an idea
that can apply system identification algorithms well known for mathematics and
engineering into reverse engineering methods. A few future directions for this work
are listed below:
• Studying alternative basis functions to gene expression profile functions used as
the input in this work, to approximate the regression model of the association
between a given gene and that of its potential regulators in FOS and PCI.
• Considering biological knowledge to determine transcription factors or potential
gene regulators, assigning the gene expression functions of the potential genes
with higher probability to be selected as potential regulators. This can be done
by dividing the candidate functional set into several subsets; therefore, FOS
could start searching from the subset of a higher relevance and PCI can build
different groups of cascades by using different subsets.
• Generalizing the proposed model to a model that allows different regulators
of a gene regulate their target gene with different time lags, instead of 1 time
lag assumption used in this work. This can result in a more flexible network
inference model with higher accuracy.
• Incorporating biological information to determine the maximum number of po-
tential gene regulators for a given gene, instead of defining equal maximal num-
ber of regulators for all genes. Since FOS always selects regulators for a given
target gene, this prior knowledge can make an improvement.
6.1. FURTHER DIRECTIONS 68
• Because Parallel Cascade Identification randomly assigns coefficients to the in-
put in the dynamic system, applications of alternative algorithms used to gen-
erate the impulse response might decrease the computation time of PCI model.
• Further studying possible approaches for defining the significance of a regulation