Abstract Computational Reconstruction of Biological Networks by Yuk Lap Yip 2009 Networks describe the interactions between different objects. In living systems, knowing which biological objects interact with each other would deepen our understanding of the functions of both individual objects and their working modules. Due to experimental limi- tations, currently only small portions of these interaction networks are known. This thesis describes methods for computationally inferring the complete networks based on the known portions and related data. These methods exploit special data properties and problem structures to achieve high accuracy. The training set expansion method handles sparse and uneven training data by learning from information-rich regions of the network, and propagating the information to help learn from the information-poor regions. The multi- level learning framework combines information at different levels of a concept hierarchy, and lets the predictors at the different levels to propagate information between each other. Combined optimization between levels allow the integrated use of data features at different levels to improve prediction accuracy and noise immunity. Finally, proper incorporation of heterogeneous data facilitates the identification of interactions uniquely detectable by each kind of data. This thesis also describes some work on data integration and tool sharing, which are crucial components of network analysis studies.
252
Embed
Abstract Computational Reconstruction of Biological ...kevinyip/papers/Networks_Yale2009.pdf · Abstract Computational Reconstruction of Biological Networks by Yuk Lap Yip 2009 Networks
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Abstract
Computational Reconstruction of Biological Networks
by
Yuk Lap Yip
2009
Networks describe the interactions between different objects. In living systems, knowing
which biological objects interact with each other would deepen our understanding of the
functions of both individual objects and their working modules. Due to experimental limi-
tations, currently only small portions of these interaction networks are known. This thesis
describes methods for computationally inferring the complete networks based on the known
portions and related data. These methods exploit special data properties and problem
structures to achieve high accuracy. The training set expansion method handles sparse
and uneven training data by learning from information-rich regions of the network, and
propagating the information to help learn from the information-poor regions. The multi-
level learning framework combines information at different levels of a concept hierarchy,
and lets the predictors at the different levels to propagate information between each other.
Combined optimization between levels allow the integrated use of data features at different
levels to improve prediction accuracy and noise immunity. Finally, proper incorporation of
heterogeneous data facilitates the identification of interactions uniquely detectable by each
kind of data. This thesis also describes some work on data integration and tool sharing,
which are crucial components of network analysis studies.
Computational Reconstruction of Biological Networks
The boat is not the material it is made from, but something else, much
more interesting, which organises the material of the planks: the boat is the
relationship between the planks. Similarly, the study of life should never be
restricted to objects, but must look into their relationships.
– Antoine Danchin, “The Delphic Boat” [45]
In computer science, graphs are used to represent object relationships, with each node
representing an object and each edge between two nodes stating that the two corresponding
objects have a certain relationship. Graphs are also called networks in some domains. For
example, in a computer network, each node is a machine and there is an edge between
two nodes if the machines are physically or logically connected. In a social network of
friends, each node is a person and there is an edge between two nodes if the two persons
know each other. In some contexts, the term “network” specifically means a graph with
weighted edges [24]. We shall not make such a distinction here, and shall treat “graph” and
1
CHAPTER 1. INTRODUCTION 2
“network” as synonyms.
This thesis is about the reconstruction of biological networks by computational means.
In these networks, each node is a biological object, and an edge represents a specific type of
interaction between two biological objects. For example, in a protein interaction network,
each node is a protein, and there is an edge between two nodes if the corresponding proteins
have a physical interaction. In a gene regulatory network, each node is a gene and its
proteins, and there is a directed edge from a node to another if the former regulates the
transcription of the latter. There are many other types of interesting biological networks,
such as metabolic networks, genetic interaction networks and co-evolution networks. A
brief introduction of the underlying biological concepts, as well as other concepts useful for
understanding the content of this thesis, is given in Chapter 2.
Knowing the interconnections between the objects in these networks is an important
first step to the greater goal of understanding the complex dynamics inside the biological
systems [106]. For example, the knowledge of what interaction partners a protein has can
help identify its function [186]. Analyzing the whole interaction network can provide in-
sights into the structures and mechanisms of physical binding, which cannot be obtained by
studying single objects alone [105]. Large-scale genetic and gene-drug interaction networks
are also useful in drug discovery [141].
While it would be ideal to have full access to the biological networks, currently only
small portions of them have been revealed experimentally [90, 177]. On the other hand,
in the past decade many high-throughput experimental techniques have been developed
and popularized to provide different kinds of information about the biological objects, most
notably gene expression measured by microarray [162] and sequence information by second-
generation sequencing [170]. The huge amount of data generated from these experiments
CHAPTER 1. INTRODUCTION 3
contain very rich information that can be utilized in predicting the unobserved portions of
the biological networks. As such, computational reconstruction of biological networks has
become an important research topic in bioinformatics [36, 171, 196, 199].
Network reconstruction can be formally cast as a machine-learning problem of the
following general form. The inputs to the problem are:
• A number of objects, each described by a vector of feature values. Some additional
features may be available for pairs of objects, such as the likelihood for a pair of objects
to interact according to some physical experiments that are not totally reliable.
• A gold standard positive set of known interactions.
• A gold standard negative set of known non-interactions.
The goal is to learn a predictor from the inputs, so that when presented any two
arbitrary objects i and j, it will predict the chance that (i, j) is an edge of the network.
Since the network reconstruction problem fits in a standard machine learning setting,
one could tackle it by applying an existing learning algorithm. Indeed, there have been
studies that use support vector machines [17, 174], Bayesian approaches [73, 98] and other
standard machine learning methods [13, 59] to reconstruct biological networks.
While standard machine learning methods could make accurate predictions in some
cases, we claim that if some domain knowledge about the problem structure and data
properties is available, it is possible to design learning algorithms that make good use of
the knowledge to achieve higher prediction accuracy. In the first part of this thesis, we
demonstrate how this abstract idea is turned into practice in several studies.
CHAPTER 1. INTRODUCTION 4
In Chapter 3, we study the problem of supervised learning of protein interaction net-
works. We discuss several major difficulties in this problem, namely the large number of
node pairs (18 million for the 6,000 nodes of yeast), the small number of known interac-
tions and non-interactions, and the uneven distribution of these gold-standard examples
across the different nodes, and the existence of sub-class structures. We tackle the prob-
lem by building local models with training set expansion [203], which consists of semi-
supervised methods [34] that augment the original training sets by propagating information
from information-rich regions of the training network to information-poor regions. We show
that the resulting algorithms outperform a series of state-of-the-art algorithms when tested
on multiple benchmark datasets.
In Chapter 4, we continue to explore the idea of training set expansion for the protein
interaction network. In addition to making horizontal expansion (generating more training
examples for other nodes), in this case the expansion is also vertical (generating more
training examples for nodes at other levels). This idea is inspired by a special hierarchical
structure of the protein interaction network: each protein interaction involves corresponding
domain interactions, which in turn involve residue interactions. Each of the three levels of
interactions contains unique data features for learning the corresponding network. We
show that by considering all three levels of network reconstruction together, it is possible
to improve the prediction accuracy at each level [204].
In Chapter 5, we focus on the protein and domain levels, and study how inference
at the domain level could be affected by the errors at the protein level, which is a real
concern given the high false positive and false negative rates of protein interaction networks
constructed from high-throughput experiments, which are commonly used as the protein
level input. We propose different methods to perform consistent predictions at the two
CHAPTER 1. INTRODUCTION 5
levels, using maximum likelihood and constrained optimization. The resulting algorithms
display improved noise immunity.
In Chapter 6, we switch to the problem of predicting gene regulatory networks in
an unsupervised setting, which is more realistic for organisms that are not well studied.
We consider two types of data features, namely steady-state gene expression profiles after
gene knockout, and dynamic expression time series after an initial perturbation. While the
two types of data provide complementary information for predicting gene regulation, many
existing algorithms have overlooked the potential of the gene knockout expression profiles.
By developing a new procedure for identifying gene regulation from such profiles, we were
able to combine the information hidden in the two types of data and make more accurate
predictions. The effectiveness of the algorithm was demonstrated in a public challenge
using benchmark datasets, in which our algorithm achieved the best accuracy among 28
other teams [202].
While the content in the first part of the thesis represents work in the core step of a
typical network reconstruction study, we emphasize that a successful study also implicitly
involves a lot of non-trivial tasks before and after the actual reconstruction stage. In the
second and third parts of the thesis, we explore two of them, namely data integration and
software sharing.
In the second part of the thesis, we discuss our work on data gathering and integration,
which is a difficult task as biological data are distributed and involve multiple naming
conventions and data formats. We study two different approaches to integrating biological
data. In Chapter 7, we use the knowledge representation formalism of semantic web [20]
to build a common platform called YeastHub [39] for integrating heterogeneous data from
different sources.
CHAPTER 1. INTRODUCTION 6
We treat the integration of data by semantic web as a long-term endeavor, as it involves
collaborative ontology building, large-scale data conversion, infrastructure and application
software design and development, and extensive training for software engineers and users.
To provide a short-term solution, in Chapter 8, we discuss our work in using Web 2.0 tech-
niques [140] in integrating life sciences data. With the simple user-friendly interfaces, biol-
ogists could easily build reusable modules for performing their daily data integration tasks
without writing any programs or scripts. We explore the potential and current limitations
of such techniques in several applications in public health and molecular biology [40, 165].
At the end of the chapter, we compare the two data integration approaches, and suggest
possible future directions.
While it is scientifically significant to demonstrate the effectiveness of new algorithms on
some specific datasets, the research community would benefit much more if the algorithms
are made publicly accessible, so that other groups could apply them on their own data
without spending extra resources on re-implementation. In the third part of the thesis, we
describe two web platforms that we developed for sharing our algorithms.
In Chapter 9, we describe tYNA (the Yale Network Analyzer) [206], which is a web
tool for network analysis. Its functionality includes statistics calculations, motif finding,
visualization, and network comparisons. The tool has been used by around 200 researchers
worldwide to analyze around 1,500 networks.
In Chapter 10, we describe a tool for studying residue co-evolution [205], which can
be viewed as a special kind of network with each node representing an amino acid residue
and two nodes are connected if the corresponding residues have undergone co-evolution
across different species. There are many ways to mathematically quantify the likelihood of
co-evolution between two residues. On the web site, we provide the implementation of more
CHAPTER 1. INTRODUCTION 7
than 100 variations of such co-evolution scoring functions, and allow users to study the
co-evolution networks of their own proteins. The site has processed more than 1,000 tasks,
and the programs have been downloaded and installed locally on the servers of a number
of other research groups to facilitate their large-scale studies of residue co-evolution.
We conclude the thesis in Chapter 11 and point out potential future directions.
Chapter 2
Biological Background
In this chapter, we introduce some basic biological concepts and experimental tech-
niques. The goal is to explain the important terms useful for understanding this thesis,
without delving into too much detail. A lot of related basic concepts will be omitted, and
some concepts will be presented in a simplified way. In particular, exceptions will not be
mentioned if they are rare. Additional concepts will be introduced in later chapters when
the need arises.
2.1 DNA, RNA and proteins
The basic unit of living systems is the cell. For most species, the heritable information
that distinguishes one organism from another is stored in the deoxyribonucleic acid (DNA)
sequences in living cells. Each DNA sequence is a linear chain of nucleotides. There are
four types of nucleotides in DNA sequences: adenine (A), cytosine (C), guanine (G) and
8
CHAPTER 2. BIOLOGICAL BACKGROUND 9
thymine (T). A DNA sequence can thus be represented by a string using an alphabet with
four characters.
In a cell, DNA sequences are arranged in a double-stranded helical structure, where
the nucleotides on the two strands are complementary to each other so that A is paired
with T, G is paired with C, and vice versa.
The DNA in a cell can be divided into different parts called chromosomes. All chro-
mosomes together form the genome of an organism. If an organism has two copies of the
set of chromosomes, it is said to be diploid. If there is only one copy, it is haploid.
For higher organisms such as humans, DNA is stored in the nucleus of a cell. Organisms
having cells with a clear nucleus are called eukaryotes. Organisms without clear cell nucleus
are called prokaryotes.
In a DNA sequence, there are parts that can be used as templates to generate products
called ribonucleic acids (RNA). These parts are called genes, while the other parts are
called intergenic regions. The process of generating RNA from DNA is called transcription.
Like DNA, an RNA sequence is also composed of nucleotides. There are four such types
of nucleotides in RNA: A, C, G and Uracil (U). The transcription process ensures that the
resulting RNA is complementary to the DNA on the gene, with A being complementary to
U in this case.
After transcription, some unwanted parts on the RNA are removed in higher organisms.
The corresponding DNA of the retained and removed parts are called exons and introns,
respectively. Most genes do not have RNA as their end products, but rather the RNA is
further used as a template to generate amino acid chains, which after folding into particular
three-dimensional structures are called proteins. The process of creating protein from RNA
CHAPTER 2. BIOLOGICAL BACKGROUND 10
is called translation. The translation of RNA into amino acids is again based on fixed rules.
The DNA sequence is read 3 at a time, with each triple (called a codon) encoding one of
the twenty types of amino acid.
When two amino acids are joined together, a water molecule is expelled from the
bonding called peptide bond. Each remaining amino acid is thus given the name of a
residue (what is left after expelling water). In other words, residues will be used as the
basic units of proteins, just as nucleotides are the basic units of DNA and RNA. Although
logically the chain of amino acids of a protein should be called an amino acid sequence, it is
more commonly called a protein sequence and we will follow this convention. Be cautious
that a protein sequence is not a sequence of multiple proteins, but the sequence of multiple
amino acid residues of a protein.
The content of DNA is subject to change by mutations. If a DNA region is important,
so that mutations in the region could cause serious survival problems, the region in survived
organisms all have relatively few mutations and therefore look more similar to each other:
the region is said to be more conserved. For example, if two species both require a certain
protein to survive, then the encoding DNA sequence will be highly conserved across the two
genomes. To identify the well-conserved and poorly-conserved regions, the DNA/protein
sequence of the same gene/protein can be used to form an alignment by using alignment
algorithms that try to minimize the number of mutations required to change one to the
other. The idea can be generalized to involve more than two sequences, and the resulting
alignment is called a multiple sequence alignment (MSA).
Protein sequences have been compared to look for conserved regions. The conserved
sequences have been named motifs and domains, usually for shorter and longer sequences,
respectively. They are crucial to the functions and structures of proteins, and their inter-
CHAPTER 2. BIOLOGICAL BACKGROUND 11
actions with other biological objects.
Based on the concept of conservation, one may ask which species has a certain gene
based on a reference sequence of the gene and a similarity/mutation threshold. By exam-
ining multiple species, each gene receives a binary vector with each bit indicating whether
the gene is present in a given species. The vector is called a phylogenetic profile of the gene.
The cell contains not only the nucleus, but also many other compartments. A protein
typically resides in only some of the compartments. A binary vector similar to a phylogenetic
profile can be constructed for a protein based on the cell compartments where it can be
found. The resulting data is called the localization profile.
Another type of large-scale dataset for genes is their RNA expression levels as measured
by microarray experiments. A microarray consists of many small wells, each containing
many copies of one type of target sequence, such as the complementary sequence of a gene
region. To measure how much RNA produced by the gene is present in a sample, a small
portion of sample is added to the well. The RNA in the sample is hybridized (bound) to the
complementary sequences in the well. The amount is detected by fluorescence. By adding
a portion of the sample to every well, the resulting measurements tell the relative RNA
expression levels of different genes in the sample. The whole set of experiments can also be
repeated for other samples, for example from the same cells in different conditions, which
would produce data allowing for comparison of the activity of the genes across different
conditions.
A gene can be artificially disabled by knockout experiments such as mutagenesis. For a
diploid organism, it is possible to knock out only one of the copies, or both. The resulting
strain of the former is called heterozygous and the latter is called homozygous.
CHAPTER 2. BIOLOGICAL BACKGROUND 12
2.2 Biological networks
In this thesis our primary interest is not individual biological objects, but the inter-
action between different objects in networks. There are many different types of biological
networks.
A protein-protein interaction (PPI) network records which proteins have physical inter-
actions with each other. The edges are undirected. We will assume that the interactions are
binary, i.e., involving only two proteins. Real biological systems contain protein complexes
that involve multiple proteins (and also multiple copies of the same proteins) physically
binding together. Each complex can be represented by a set of binary interactions.
In this thesis, we define an edge of a protein-protein interaction network as two proteins
that interact in at least one condition. We do not consider whether the proteins are per-
manently bound together or just transiently interacting. We also do not consider whether
two protein interactions can simultaneously occur.
Protein interactions can be detected by small-scale experiments such as western blot-
ting. Large-scale detection methods have also been proposed. The two most popular meth-
ods are yeast-two-hybrid (Y2H) and tandem affinity purification with mass spectrometry
(TAP-MS). The former detects binary interactions happening in the cell nucleus, while
the latter pulls down whole complexes without revealing the connections within. Though
neither of them provides the complete set of binary interactions and both have high error
rates, these experiments are the current state-of-the-art in large-scale detection of protein
interactions.
In gene-regulatory networks, edges are drawn from a regulator to its target. Transcrip-
tion is controlled by regulators called transcription factors (TFs). They are proteins that
CHAPTER 2. BIOLOGICAL BACKGROUND 13
recognize and bind to certain gene regions, to either activate or suppress the transcription.
In graph-theoretic terms, the edges are directed and signed, where a positive sign means
activation and a negative sign means suppression.
There are large-scale experiments for detecting TF binding, including chromatin im-
munoprecipitation with microarray (ChIP-chip) or with sequencing (ChIP-seq).
Metabolic networks are more commonly called metabolic pathways, which involve the
conversion of metabolites from one form to another through enzymatic actions. The most
common representation has the metabolites as nodes. There is an edge from one node to
another if the former can be converted into the latter by the action of an enzyme. The
enzyme is used as the label of the directed edge. Some reactions are reversible. In such
cases, there are two edges between the nodes, one from the first node to the second, and
the other from the second back to the first.
In co-evolution networks, each node is a biological object and two nodes are connected
if they are evolutionarily linked so that mutations of one would trigger corresponding mu-
tations of the other. Co-evolution networks can be defined at multiple levels, from single
nucleotides in DNA, to single residues in the same protein or different proteins, to whole
proteins.
The term “genetic interaction” sometimes refers generally to the interaction between
different genes and their products [70], and sometimes refers specifically to the situation
that the absence or change of dosage of the products of two genes together (the genotypes)
causes some unexpected outcomes (the phenotypes) [182]. The most well known example
is synthetic lethality, in which a cell can survive the deletion of either of two genes, but it
cannot survive if both genes are deleted. By having each gene as a node and putting an
CHAPTER 2. BIOLOGICAL BACKGROUND 14
edge between two genes that have a certain type of genetic interaction, the resulting genetic
interaction network reflects some interesting special relationships between the genes. For
example, it has been shown that genes that have synthetic lethality are more likely to be
in parallel biological pathways [102].
A related network is the gene-drug network, in which there are two sets of nodes, one
for genes and one for drugs. An edge is drawn from a drug to a gene if the latter is the target
of the former. The network is thus a directed, bipartite graph. Since a drug could cause
inhibition of proteins, applying a drug to a cell has a net effect similar to knocking out (or
partially knocking out, i.e., knocking down) the encoding genes of the proteins. By carefully
comparing the genetic interaction network and gene-drug network, one could predict drug
targets and get more insights about the biological pathways a protein participates in.
Part I
Reconstructing Biological
Networks
15
Chapter 3
Exploiting Data Properties:
Training Set Expansion
3.1 Introduction
Biological networks offer a global view of the relationships between biological objects.
In recent years high-throughput experiments have enabled large-scale reconstruction of the
networks. However, as these data are usually incomplete and noisy, they can only be used as
a first approximation of the complete networks. For example, a recent study reports that the
false positive and negative rates of yeast two-hybrid protein-protein interaction data could
be as high as 25%-45% and 75%-90% respectively [90], and a recently published dataset
combining multiple large-scale yeast-two-hybrid screens is estimated to cover only 20% of
the yeast binary interactome [207]. As another example, as of July 2008, the synthetic
lethal interactions in the BioGRID database [29] (version 2.0.42) only involve 2505 yeast
16
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 17
genes, while there are about 5000 non-essential genes in yeast [72]. A large part of the
genetic network is likely not yet discovered.
To complement the experimental data, computational methods have been developed
to assist the reconstruction of the networks. These methods learn from some example
interactions, and predict the missing ones based on the learned models.
This problem is known as supervised network inference [187]. The input to the problem
is a graph G = (V,E, E) where V is a set of nodes each representing a biological object
(e.g. a protein), and E, E ⊂ V × V are sets of known edges and non-edges respectively,
corresponding to object pairs that are known to interact and not interact respectively. For
each of the remaining pairs, whether they interact is not known (Figure 3.1(a)). A model
is to be learned from the data, so that when given any object pair (vi, vj) as input, it will
output a prediction y ∈ [0, 1] where a larger value means a higher chance of interaction
between the objects.
The models are learned according to some data features that describe the objects. For
example, in predicting protein-protein interaction networks, functional genomic data are
commonly used. In order to learn models that can make accurate predictions, it is usually
required to integrate heterogeneous types of data that contain different kinds of information.
Since the data are in different formats (e.g. numeric values for gene expression, strings for
protein sequences), integrating them is non-trivial. A natural choice for this complex data
integration task is kernel methods [164], which unify the data representation as special
matrices called kernels and facilitate easy integration of these kernels into a final kernel K
through various means [111] (Figure 3.1(b)). As long as K is positive semi-definite, K(vi, vj)
represents the inner product of objects vi and vj in a certain embedded space [130], which
can be interpreted as the similarity between the objects. Kernel methods then learn the
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 18
?
v1
v2
vN
.
.
.
v1 v2 vN. . .
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
(a)
v1
v2
vN
.
.
.
v1 v2 vN. . .
(b)
?
v1
v2
vN
.
.
.
v1 v2 vN. . .
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
? ?
??
?
?
??
?
?
vn
vn
(c)
Figure 3.1. The supervised network inference problem. (a) Adjacency matrix of known in-teractions (black boxes), known non-interactions (white boxes), and node pairs with an un-known interaction status (gray boxes with question marks). (b) Kernel matrix, with a darkercolor representing a larger inner product. (c) Partially-complete adjacency matrix requiredby the supervised direct approach methods, with complete knowledge of a submatrix. In thebasic local modeling approach, the dark gray portion cannot be predicted.
models from the training examples and the inner products [2]. Since network reconstruction
involves many kinds of data, in this study we will focus on kernel methods for learning.
The supervised network inference problem differs from most other machine learning
settings in that instead of making a prediction for each input object (such as a protein),
the learning algorithm makes a prediction for each pair of objects, namely how likely these
objects interact in the biological network. Since there is a quadratic number of object pairs,
the computational cost could be very high. For instance, while learning a model for the
around 6000 genes of yeast is not a difficult task for contemporary computing machines,
the corresponding task for the around 18 million gene pairs remains challenging even for
high-end computers. Specialized kernel methods have thus been developed for this learning
problem.
For networks with noisy high-throughput data, reliable “gold-standard” training sets
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 19
are to be obtained from data verified by small-scale experiments or evidenced by multiple
methods. As the number of such interactions is small, there is a scarcity of training data.
In addition, the training data from small-scale experiments are usually biased towards some
well-studied proteins, creating an uneven distribution of training examples across proteins.
In the next section, we review some existing computational approaches to reconstruct-
ing biological networks. One recent proposal is local modeling [21], which allows for the
construction of very flexible models by letting each object construct a different local model,
and has been shown promising in some network reconstruction tasks. However, when there
is a scarcity of training data, the high flexibility could turn out to be a disadvantage, as
there is a high risk of overfitting, i.e., the construction of overly complex models that fit
the training data well but do not represent the general trend of the whole network. As a
result, the prediction accuracy of the models could be affected.
In this study we propose methods called training set expansion that alleviate the prob-
lem of local modeling while preserving its modeling flexibility. They also handle the issue
of uneven training examples by propagating knowledge from information-rich regions to
information-poor regions. We will show that the resulting algorithms are highly competi-
tive with the existing approaches in terms of prediction accuracy. We will also present some
interesting findings based on the prediction results.
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 20
3.2 Related work: existing approaches for network recon-
struction
3.2.1 The pairwise kernel approach
In the pairwise kernel (Pkernel) approach [17], the goal is to use a standard kernel
method (such as SVM) to make the predictions by treating each object pair as a data
instance (Figure 3.2(a,b)). This requires the definition of an embedded space for object
pairs. In other words, a kernel is to be defined, which takes two pairs of objects and returns
their inner product. With n objects, the kernel matrix contains O(n4) entries in total.
One systematic approach to constructing such pairwise kernels is to build them on top
of an existing kernel for individual objects, in which each entry corresponds to the inner
product of two objects. For example, suppose a kernel K for individual objects is given,
and v1, v2, v3, v4 are four objects, the following function can be used to build the pairwise
Loosely speaking, two object pairs are similar if the two objects in the first pair are
respectively similar to different objects in the second pair.
3.2.2 The direct approach
The direct approach [199] avoids working in the embedded space of object pairs. In-
stead, only a kernel for individual objects is needed. Given such an input kernel K and
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 21
a cutoff threshold t, the direct approach simply predicts each pair of objects (vi, vj) with
K(vi, vj) ≥ t to interact, and each other pair to not interact. Since the example interactions
and non-interactions are not used in making the predictions, this method is unsupervised.
The direct approach is related to the pairwise kernel approach through a simple pairwise
kernel:
K ′((v1, v2), (v3, v4)) = K(v1, v2)K(v3, v4) (3.2)
With this kernel, each object pair (vi, vj) is mapped to the point K(vi, vj) on the real
line in the embedded space of object pairs. Thresholding the object pairs at a value t
is equivalent to placing a hyperplane in the embedded space with all pairs (vi, vj) having
K(vi, vj) ≥ t on one side and all other pairs on the other side. Therefore, if this pairwise
kernel is used, then learning a linear classifier in the embedded space is equivalent to learning
the best value for threshold t.
To make use of the training examples, two supervised versions of the direct approach
have been proposed. They assume that the sub-network of a subset of objects is completely
known, so that a submatrix of the adjacency matrix is totally filled (Figure 3.1(c)). The
goal is to modify the similarity values of the objects defined by the kernel to values that
are more consistent with the partial adjacency matrix. Thresholding is then performed on
the resulting set of similarity values.
The two versions differ in the definition of consistency between the similarity values and
the adjacency matrix. In the kernel canonical correlation analysis (kCCA) approach [199],
the goal is to identify feature f1 from the input kernel and feature f2 from the diffusion
kernel derived from the partial adjacency matrix so that the two features have the highest
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 22
correlation under some smoothness requirements. Additional feature pairs orthogonal to
the previous ones are identified in similar ways, and the first l pairs are used to redefine the
similarity between objects.
In the kernel metric learning (kML) approach [187], a feature f1 is identified by op-
timizing a function that involves the distance between known interacting objects. Again,
additional orthogonal features are identified, and the similarity between objects is redefined
by these features.
3.2.3 The matrix completion approach
The em approach [184] (which is theoretically related to the expectation-maximization
(EM) framework) also assumes a partially complete adjacency matrix. The goal is to com-
plete it by filling in the missing entries, so that the resulting matrix is closest to a spectral
variant of the kernel matrix as measured by KL-divergence. The algorithm iteratively
searches for the filled adjacency matrix that is closest to the current spectral variant of the
kernel matrix, and the spectral variant of the kernel matrix that is closest to the current
filled adjacency matrix. When convergence is reached, the predictions are read from the
final completed adjacency matrix.
3.2.4 The local modeling approach
A potential problem of the previous approaches is that one single model is built for
all object pairs. If there are different subgroups of interactions, a single model may not be
able to separate all interacting pairs from non-interacting ones. For example, protein pairs
involved in transient interactions may use a very different mechanism than those involved
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 23
in permanent complexes. These two types of interactions may form two separate subgroups
that cannot be fitted by one single model.
A similar problem has been discussed in Myers and Troyanskaya [135]. In this work,
the biological context of each gene is taken into account by conditioning the probability
terms of a Bayesian model by the biological context. The additional modeling power of
having multiple context-dependent sub-models was demonstrated by improved accuracy in
network prediction.
Another way to allow for a more flexible modeling of the subgroups is local modeling [21].
Instead of building a single global model for the whole network, one local model is built
for each object, using the known interactions and non-interactions of it as the positive and
negative examples. Each pair of objects thus receives two predictions, one from the local
model of each object. In our implementation, the final prediction is a weighted sum of the
two according to the training accuracy of the two local models.
Figure 3.2 illustrates the concept of local modeling. Part (a) shows an interaction
network, with solid green lines representing known interactions, dotted red lines representing
known non-interactions, and the dashed black line representing an object pair of which the
interaction status is unknown. Part (b) shows a global model with the locations of the
object pairs determined by a pairwise kernel. The object pair (v3, v4) is on the side with
many positive examples, and is predicted to interact. Part (c) shows a local model for
object v3. Object v4 is on the side with a negative example, and (v3, v4) is predicted to not
interact.
Since each object has its own local model, subgroup structures can be readily handled
by having different kinds of local models for objects in different subgroups.
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 24
1 2
43?
(a)
1
32
3 2
41
4
1
2
3
4
1
1
4
4 2
2
3
31
4
1
2
3
4
4
4 2
2
3
3
(b)
1
2
4
3
2
4
(c)
Figure 3.2. Global and local modeling. (a) An interaction network with each green solidedge representing a known interaction, each red dotted edge representing a known non-interaction, and the dashed edge representing a pair of objects with an unknown interactionstatus. (b) A global model based on a pairwise kernel. (c) A local model for object v3.
3.3 Our proposal: the training set expansion approach
Local modeling has been shown to be very competitive in terms of prediction accu-
racy [21]. However, local models can only be learned for objects with a sufficiently large
amount of known interactions and non-interactions. When the training sets are small, many
objects would not have enough data for training their local models. Overfitting may occur,
and in the extreme case where an object has no positive or negative examples, its local
model simply cannot be learned. As to be shown in our empirical study presented below,
this problem is especially serious when the embedded space is of very high dimension, since
very complex models that overfit the data could be formed.
In the following we propose ways to tackle this data scarcity issue while maintaining
the flexibility of local modeling. Our idea is to expand the training sets by generating
auxiliary training examples. We call it the training set expansion approach. Obviously these
auxiliary training examples need to be good estimates of the actual interaction status of the
corresponding object pairs, for expanding the training sets by wrong examples could further
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 25
worsen the learned models. We propose two methods for generating reliable examples:
prediction propagation and kernel initialization.
3.3.1 Prediction propagation (pp)
Suppose v1 and v2 are two objects, where v1 has sufficient training examples while v2
does not. We first train the local model for v1. If the model predicts with high confidence
that v1 interacts with v2, then v1 can later be used as a positive example for training the
local model of v2. Alternatively, if the model predicts with high confidence that v1 does not
interact with v2, v1 can be used as a negative example for training the local model of v2.
This idea is based on the observation that predictions that a model is most confident
with are more likely correct. For example, if the local models are support vector machines,
the predictions for objects far away from the separating hyperplane are more likely correct
than those for objects falling in the margin. Therefore, to implement the idea, each pre-
diction should be associated with a confidence value obtained from the local model. When
expanding the training sets of other objects, only the most confident predictions should be
involved.
We use support vector regression [172] to produce the confidence values. When training
the local model of an object vi, the original positive and negative examples of it are given
labels of 1 and -1 respectively. Then a regression model is constructed to find the best fit.
Objects close to the positive examples will receive a regressed value close to 1, meaning
that they correspond to objects that are likely to interact with vi. Similarly, objects close
to the negative examples will receive a regressed value close to -1, and hence correspond to
objects that are likely to not interact with vi. For other objects, the model is less confident in
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 26
telling whether they interact with vi. Therefore the predictions with large positive regressed
values can be used as positive examples for training other local models, and those with large
negative regressed values can be used as negative examples, where the magnitudes of the
regressed values represent the confidence.
Each time we use p% of the most confident predictions to expand the training sets of
other objects, where the numbers of new positive and negative examples are in proportion
to the ratio of positive and negative examples in the original training sets. The parameter
p is called the training set expansion rate.
To further improve the approach, we order the training of local models so that ob-
jects with more (original and augmented training examples) are trained first, as the models
learned from more training examples are generally more reliable. Essentially this is han-
dling the uneven distribution of training examples by propagating knowledge from the
information-rich regions (objects with many training examples) to the information-poor
regions (objects with no or few training examples).
Theoretically prediction propagation is related to co-training [22], which uses the most
confident predictions of a classifier as additional training examples of other classifiers. The
major differences are that in co-training, the classifiers are to make predictions for the same
set of data instances, and the classifiers are complementary to each other due to the use of
different data features. In contrast, in prediction propagation, each model is trained for a
different object, and the models are complementary to each other due to the use of different
training examples.
Instead of regression, one can also use support vector classifier (SVC) to determine the
confidence values, by measuring the distance of each object from the separating hyperplane.
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 27
Since we only use the ranks of the confidence values to deduce the auxiliary examples but
not their absolute magnitudes, we would expect the results to be similar. We implemented
both versions and tested them in our experiments. The two sets of results are indeed
comparable, with SVR having slightly higher accuracy on average, as we will see in the
experiment section.
3.3.2 Kernel initialization (ki)
The prediction propagation method is effective when some objects have sufficient input
training examples at the beginning to start the generation of auxiliary examples. Yet if all
objects have very few input training examples, even the object with the largest training
sets may not be able to form a local model that can generate accurate auxiliary examples.
An alternative way to generate auxiliary training examples is to estimate the interaction
status of each pair of objects by its similarity value given by the kernel. This is in line with
the idea of the direct approach, that object pairs with a larger similarity value are more
likely to interact. However, instead of thresholding the similarity values to directly give
the predictions, they are used only to initialize the training sets for learning the local
models. Also, to avoid generating wrong examples, only the ones with the largest and
smallest similarity values are used, which correspond to the most confident predictions of
the unsupervised direct method.
For each object, p% of the objects with the largest/smallest similarity values given by
the kernel are treated as positive/negative training examples in proportion to the positive
and negative examples in the original training sets. These auxiliary examples are then
combined with the original input examples to train the local models.
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 28
The kernel initialization method can be seen as adding a special prior to the object
pairs, which assigns a probability of 1 to the most similar pairs of each object and 0 to
the most dissimilar pairs. We have also tried normalizing the inner products to the [0,1]
range and using them directly as the initial estimate of the confidence of interaction. Yet
the performance was not as good as the current method, which could be due to the large
variance of confidence values of the object pairs with moderate similarity.
The two training set expansion methods fall within the class of semi-supervised learning
methods [34], which make use of both the training examples and some information about all
data instances to learn the model. Prediction propagation exploits the information about
each object pair produced by other local models to help train the current local model.
Kernel initialization utilizes the similarity between objects in the feature space to place soft
constraints on the local models, that the objects most similar to the current object should
be put in the positive class and those most dissimilar to the current object should be put
in the negative class.
3.3.3 Combining the two methods (pp+ki)
Since kernel initialization is applied before learning while prediction propagation is
applied during learning, the two can be used in combination. In some cases this leads to
additional performance gain in our experiments.
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 29
3.4 Prediction accuracy
3.4.1 Data and setup
To test the effectiveness of the training set expansion approach, we compared its pre-
diction accuracy with the other approaches on three protein-protein interaction networks
of the yeast Saccharomyces cerevisiae from BioGRID [29], DIP [160], MIPS [131] and iP-
fam [63]. The BioGRID-10 dataset contains all BioGRID interactions of Saccharomyces
cerevisiae (version 2.0.44) that satisfy the following criteria:
1. Having one of the following physical interaction types:
• FRET
• Protein-peptide
• Co-crystal Structure
• Co-fractionation
• Co-purification
• Reconstituted Complex
• Biochemical Activity
• Affinity Capture-Western
• Two-hybrid
• Affinity Capture-MS
2. From one of the small-scale studies, defined as studies that report less than 10 physical
interactions to BioGRID
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 30
3. The proteins/genes involved in the interactions have valid values from all the features
for learning
The cutoff (10 physical interactions) was chosen so that the network is large enough to have
relatively few missing interactions, while small enough to run the different algorithms in
reasonable time. The dataset contains 5,126 interactions that involve 2,328 yeast proteins.
The BioGRID-200 dataset is similar to BioGRID-10, except that small-scale studies
are defined as studies that report less than 200 physical interactions to BioGRID. Notice
that since the four high-throughput datasets used as data features all have more than 200
interactions, they are not included in this dataset. The dataset contains 12,155 interactions
that involve 3,222 yeast proteins.
The DIP MIPS iPfam dataset contains the union of all interactions from DIP (7 Oct
2007 version), MIPS (18 May 2006 version) and iPfam (version 21 of Pfam) that satisfy the
following criteria:
1. For interactions in DIP, only those identified in small-scale experiments or multiple
experiments are considered
2. For interactions in MIPS, only the physical, non-Yeast two hybrid and non-TAP-MS
ones are considered
3. The involving proteins/genes have valid values from all the features for learning
The dataset contains 3,201 interactions that involve 1,681 yeast proteins.
We use BioGRID-10 as the main dataset for comparison, while DIP MIPS iPfam repre-
sents a high quality but smaller dataset, and BioGRID-200 represents one with few missing
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 31
Table 3.1. List of datasets used in the comparison study. Each row corresponds to a datasetfrom a publication in the Source column, and is turned into a kernel using the function in theKernel column, as in previous studies [ 21, 199].
interactions, but is too large that the pairwise kernel method could not be tested as it
caused our machine to run out of memory. The three datasets together allow us to show
the effectiveness of training set expansion in a wide spectrum of scenarios.
We tested the performance of the different approaches on various kinds of genomic
data features, including phylogenetic profiles, sub-cellular localization and gene expression
datasets using the same kernels and parameters as in previous studies [21, 200]. We also
added in datasets from tandem affinity purification with mass spectrometry using the dif-
fusion kernel, and the integration of all kernels by summing them after normalization, as in
previous studies [21, 200]. The list of datasets used is shown in Table 3.1.
We performed ten-fold cross validations and used the area under the receiver operator
characteristic curve (AUC) as the performance metric. The cross validations were done in
two different modes. In the first mode, as in previous studies [21, 199], the proteins were
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 32
divided into ten sets. Each time one set was left out for testing, and the other nine were
used for training. All known interactions with both proteins in the training set were used as
positive training examples. As required by some of the previous approaches, the sub-network
involving the proteins in the training set was assumed completely known (Figure 3.1(c)).
As such, all pairs of proteins in the training set not known to interact were regarded as
negative examples. All pairs of proteins with exactly one of the two proteins in the training
set were used as testing examples (light gray entries in Figure 3.1(c)). Pairs with both
proteins not in the training set were not included in the testing sets (dark gray entries in
Figure 3.1(c)), as the original local modeling method cannot make such predictions.
Since all protein pairs in the submatrix are either positive or negative training examples,
there are O(n2) training examples in each fold. In the pairwise kernel approach, this
translates to a kernel matrix with O(n4) elements. This is of the order of 1012 for 1,000
proteins, which is too large to compute and to learn the SVC and SVR models. We therefore
did not include the pairwise kernel method in the experiments that used the first mode of
cross-validation.
Since some protein pairs treated as negative examples may actually interact, the re-
ported accuracies may not completely reflect the absolute performance of the methods.
However, as the tested methods were subject to the same setting, the results are still good
indicators of the relative performance of the approaches.
In the second mode of cross-validation, we randomly sampled protein pairs not known
to interact to form a negative training set with the same size as the positive set, as in
previous studies [17, 151]. Each of the two sets was divided into ten subsets, which were
used for left-out testing in turn. The main difference between the two modes of cross-
validation is that the train-test split is based on proteins in the first mode and protein pairs
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 33
in the second mode. Since the training examples do not constitute a complete submatrix,
the kCCA, kML and em methods cannot be tested in the second mode. The second mode
represents the more general case, where the positive and negative training examples do not
necessarily form a complete sub-network.
We used the Matlab code provided by Jean-Philippe Vert for the unsupervised direct,
kCCA, kML and em methods with the first mode of cross-validation. We implemented the
other methods with both the first and second modes of cross-validation. We observed almost
identical accuracy values from the two implementations of the direct approach in the first
mode of cross-validation with the negligible differences due only to random train-test splits,
which confirms that the reported values from the two sets of code can be fairly compared.
For the pairwise kernel approach, we used the kernel in Equation 3.1.
We used the ε-SVR and C-SVC implementations of the Java version of libsvm [33]. In
a preliminary study, we observed that the prediction accuracy of SVR is not much affected
by the value of the termination threshold ε, while for both SVR and SVC the performance
is quite stable as long as the value of the regularization parameter C is not too small. We
thus fixed both parameters to 0.5. For PP and KI, we used a grid search to determine the
value of the training set expansion rate p.
3.4.2 Results
Since we use datasets different from the ones used in previous studies, the prediction
results are expected to be different. To make sure that our implementations are correct and
the testing procedure is valid, we compared our results on the DIP MIPS iPfam dataset
with those reported in Bleakley et al. [21] as the size of this dataset is most similar to the
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 34
one used by them. Our results (Table 3.4) display a lot of similarities with those in Bleakley
et al. [21]. For example, in the first mode of cross-validation, local modeling outperformed
the other previous approaches when object similarity was defined by phylogenetic profiles
and yeast two-hybrid data. Also, the em method had the best performance among all
previous approaches with the integrated kernel in both studies. We are thus confident that
our results represent a reliable comparison between the methods.
The comparison results for our main dataset, BioGRID-10, are shown in Table 3.2. In
the table pp, ki and pp+ki are written as local+pp, local+ki and local+pp+ki, respectively,
to emphasize that the two training set expansion methods are used on top of basic local
modeling. Notice that the accuracies in the second mode of cross-validation are in general
higher. We examined whether this is due to the presence of self-interactions in the gold-
standard set of the second mode of cross-validation but not in the first mode, by removing
the self-interactions and re-running the experiments. The results suggest that the perfor-
mance gain due to the removal of self-interactions is too small to explain the performance
difference between the two modes of cross-validation. The setting in the second mode may
thus correspond to an easier problem. The reported accuracies of the two modes should
therefore not be compared directly.
From the table, the advantages of the training set expansion methods over basic local
modeling are clearly seen. In all cases, the accuracy of local modeling was improved by
at least one of the expansion methods, and in many cases all three combinations (pp, ki
and pp+ki) performed better than basic local modeling. With training set expansion, local
modeling outperformed all the other approaches in all 9 datasets.
Inspecting the performance of local modeling without training set expansion, it is
observed that although local modeling usually outperformed the other previous methods,
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 35
Table 3.2. Prediction accuracy (percentage of AUC) of the different approaches on theBioGRID-10 dataset. The best approach for each kernel and each mode of cross-validationis in bold face.
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 36
its performance with the integration kernel was unsatisfactory. This is probably due to
overfitting. When kernels are summed, the resulting embedded space is the direct product
of the ones defined by the kernels [164]. Since the final kernel used for the integrated dataset
is a summation of 8 kernels, the corresponding embedded space is of very high dimension.
With the high flexibility and the lack of training data, the models produced by basic local
modeling were probably overfitted. In contrast, with the auxiliary training examples, the
training set expansion methods appear to have largely overcome the problem.
Comparing the two training set expansion methods, in most cases prediction propa-
gation resulted in a larger performance gain. This is reasonable since the input training
examples were used in this method, but not in kernel initialization.
The results for BioGRID-200 and DIP MIPS iPfam are shown in Table 3.3 and Ta-
ble 3.4, respectively. They exhibit similar patterns as in the case of BioGRID-10, and thus
the above discussion also applies to them.
To better understand how the two training set expansion methods improve the pre-
dictions, we sub-sampled the gold-standard network at different sizes, and compared the
performance of local modeling with and without training set expansion using the second
mode of cross-validation. The results for two of the kernels are shown in Figure 3.3, which
show the two typical cases observed.
In general training set expansion improved the accuracy the most with moderate gold-
standard set sizes, at around 3000 interactions. For prediction propagation, this is expected
since when the training set was too small, the local models were too inaccurate that even
the most confident predictions could still be wrong, which made propagation undesirable.
On the other hand, when there were many training examples, there were few missing in-
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 37
Table 3.3. Prediction accuracy (percentage of AUC) of the different approaches on theBioGRID-200 dataset. The best approach for each kernel and each mode of cross-validationis in bold face.
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 38
Table 3.4. Prediction accuracy (percentage of AUC) of the different approaches on theDIP MIPS iPfam dataset. The best approach for each kernel and each mode of cross-validation is in bold face.
= −0.38, p < 10−16), which confirms that the correct predictions made by local+pp that
were missed by the other four methods correspond to the protein pairs with few known
examples. We have also tested the average degree instead of the minimum, and the Pearson
correlation instead of Spearman correlation. The results all lead to the same conclusion
(Figure S2).
0
10
20
30
40
50
60
0 500000 1000000 1500000Rank difference
Min
inum
deg
ree
(SEC11, SPC1)
Figure 3.4. Correlating the number of gold-standard examples and the rank difference be-tween local+pp and the four methods.
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 42
A concrete example of a gold-standard interaction predicted by local+pp but ranked
low by the four methods is the one between SEC11 and SPC1. They are both subunits of
the signal peptidase complex (SPC), and are reported to interact in BioGRID according to
multiple sources. In the BioGRID-10 dataset, SPC1 is the only known interaction partner
of SEC11, while SPC1 only has one other known interaction (with SBH2). The extremely
small numbers of known examples make it difficult to identify this interaction. Indeed,
the best of the four previous methods could only give it a rank at the 74th percentile,
indicating that they were all unable to identify this interaction. In contrast, local+pp was
able to rank it at the top 7th percentile, i.e., with a rank difference of 67 percentiles (see
Figure 3.4). This example illustrates that interactions with very few known examples, while
easily missed by the previous methods, could be identified by using prediction propagation.
For local+ki, among the 2,880 commonly tested gold-standard interactions, 2,025 re-
ceived a higher rank from this method than from any of the four comparing methods.
Again, there is a negative correlation between the rank difference and the minimum de-
gree and average degree (Figure S2), which shows that kernel initialization is also able to
predict interactions for proteins with few training examples. In addition, there is a posi-
tive correlation with moderate significance between the rank difference and the similarity
between the interacting proteins according to the kernel (Figure S2, Spearman correlation
= 0.04, p = 0.04), which is expected as the kernel initialization method uses protein pairs
with high similarity as auxiliary positive training examples. Interestingly, for local+pp, a
negative correlation is observed between the rank difference and protein similarity (Figure
S2), which suggests that the prediction propagation method is able to identify non-trivial
interactions, where the two interacting proteins are not necessarily similar according to the
kernel.
CHAPTER 3. EXPLOITING DATA PROPERTIES: TRAINING SET EXPANSION 43
3.6 Discussion
Training set expansion is a general concept that can also be applied to other problems
and used with other learning methods. The learning method is not required to make
very accurate predictions for all object pairs, and the data features do not need to define
an object similarity that is very consistent with the interactions. As long as the most
confident predictions are likely correct, prediction propagation is useful, and as long as the
most similar objects are likely to interact and the most dissimilar objects are unlikely to
interact, kernel initialization is useful. In many biological applications at least one of these
requirements is satisfied.
In the next chapter, we continue our exploration of the idea of training set expansion.
In addition to expanding the training sets of other objects at the same level, we also study
ways to expand the training sets of objects at other levels in a natural concept hierarchy of
protein interactions.
Chapter 4
Utilizing Problem Structures:
Multi-level Learning
4.1 Introduction
In the previous chapter we described methods for predicting protein interactions, and
how we improved prediction accuracy by training set expansion. While some of the methods
could predict which proteins interact with high accuracy, they do not explain how the
proteins interact. For instance, if protein A interacts with both proteins B and C, whether
B and C could interact with A simultaneously remains unknown, as they may or may not
compete for the same binding interface of A. This observation has led to the recent interest
in refining PPI networks by structural information about domains [6, 16, 104]. It has also
called for the prediction of protein interactions at finer granularities.
Since binding interfaces of proteins are enriched in conserved domains in permanent
44
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 45
interactions [31], it is possible to construct a second-level interaction network with protein
interactions annotated by the corresponding domain interactions. An even finer third-level
interaction network involves the residues mediating the interactions (Figure 4.1).
Proteininteractions
Domaininteractions
Residueinteractions
(a) The three levels.
i. Independent levels
ii. Unidirectional flow
iii. Bidirectionalflow
(b) Information flow architectures
Training sets
Features
1
3 2
Learner
Predictions
Kernel
-+
+ -
(c) Coupling mechanisms
Figure 4.1. Schematic illustration of multi-level learning concepts. (a) The three levels ofinteractions. Top: the PDB structure 1piw of the homo-dimer yeast NADP-dependent alcoholdehydrogenase 6. Middle: each chain contains two conserved Pfam domain instances,PF00107 (inner) and PF08240 (outer). The interaction interface is at PF00107. Bottom:two pairs of residues predicted by iPfam to interact: 283 (yellow) with 287 (cyan), and 285(purple) with 285. Visualization by VMD [ 94]. (b) The three information flow architectures.i: independent levels, ii: unidirectional flow (illustrated by download flow), iii: bidirectionalflow. (c) Coupling mechanisms for passing information from one level to another. 1: passingtraining information to expand the training set of the next level, 2: passing predictions as anadditional feature of the next level, 3: passing predictions to expand the training set of thenext level.
As will be described in the next section, some recent studies have started to perform
interaction predictions at the domain and residue levels. The data features used by each
level are quite distinct. While protein level features are mostly from functional genomic
and proteomic data such as gene expression and sub-cellular localization of whole genes and
proteins, domain level features are mainly evolutionary information such as phylogenetic-
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 46
occurrence statistics of the domain families, and residue level features are largely structural
or physical-chemical information derived from the primary sequences.
In the literature of domain-level prediction, the term “domain” is usually used to mean
a domain family, which could have multiple occurrences in different proteins. In this study
we use the terms “domain family” and “domain instance” to refer to these two concepts
respectively, in order to make a clear distinction between them. For example, PF07974 is
a domain family from Pfam, where ADP1 YEAST.PF07974 is a domain instance in the
protein ADP1 YEAST.
Since the data features of the three levels describe very different aspects of the biological
objects, potentially they could contribute to the prediction of different portions of the
interaction networks. For example, some protein interactions could be difficult to detect
using whole-protein level features since they lack fine-grained physical-chemical information.
These can be supplemented by the residue level features such as charge complementarity.
Likewise, for the protein interactions that occur within protein complexes, there could
be a high correlation between the expressions of the corresponding genes. With proper gene
expression datasets included in the protein features, there is a good chance of correctly
predicting such protein interactions. Then if one such interaction involves a pair of proteins
each with only one conserved domain, it is very likely that the domain instances actually
interact.
One may worry that if the predictions at a particular level are inaccurate, the errors
would be propagated to the other levels and worsen their predictions. As we will discuss, this
issue can be handled algorithmically by carefully deciding what information to propagate
and how it is propagated. With a properly designed algorithm, combining the predictions
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 47
and utilizing the data features of all three levels can improve the predictions at each level.
In this work we propose a new multi-level machine-learning framework that combines
the predictions at different levels. Since the framework is also potentially useful for other
problems in computational biology that involve a hierarchy, such as biomedical text mining
(a journal contains papers and a paper contains key terms), we start with a high-level
description of multi-level learning and discuss three key aspects of it. Then we suggest a
practical algorithm for the problem of predicting interactions at the protein, domain and
residue levels, which integrates the information of all three levels to improve the overall
accuracy. We demonstrate the power of this algorithm by showing the improvements it
brings to the prediction of yeast interactions relative to the predictions from independent
levels.
4.2 Related work
Two main ingredients of protein-protein interaction predictions are the selection of a
suitable set of data features, and an appropriate way to integrate them into a learning
method. Many kinds of features have been considered [199], including sub-cellular localiza-
tion [92], gene expression [55, 176], and phylogenetic profiles [146]. With the many different
kinds of data features, Bayesian approaches [98] and kernel methods [17, 21, 199] are natural
choices for integrating them into a single learning algorithm. The former unifies the whole
inference process by a probabilistic framework, while the latter encodes different kinds of
data into kernel matrices that can be combined by various means [111].
Predictions of interactions between domain families are related to the more general goal
of identifying protein interaction interfaces. While some studies tackle the problem using
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 48
features at the domain level only [100], most other work assumes that a set of protein-protein
interactions are known a priori, and the goal is to predict either domain family interactions
(i.e., which domain families have their instances interact in at least one pair of proteins) or
domain-instance interactions (i.e., through which domain instances do proteins interact in
156, 161, 178, 190, 191]. The data features are mainly derived from statistics related to the
parent proteins. For example, for a pair of domain families, the frequency of co-occurrence
in interacting proteins is an informative feature, since a higher frequency may indicate a
larger chance for them to be involved in mediating the interactions.
At a finer level, identifying protein interaction interfaces involves the prediction of
residue interactions, which could be divided into two sub-tasks: 1) predicting which residues
are in any interaction interfaces of a protein [41], and 2) predicting which of these interfaces
interact [42]. Data features are mainly derived from the primary protein sequences or from
crystal structures if they are assumed available. Docking algorithms [163] represent related
approaches, but have a fundamentally different focus: Their goal is to utilize largely physical
information to deduce the structure of the complex from the unbound protein structures, a
considerably harder problem. Therefore, we do not consider them in this article and focus
on large-scale techniques.
From a theoretical perspective, our multi-level learning framework is loosely related
to co-training [22] and the meta-learning technique called stacking [195]. We will compare
them with our framework after introducing the information flow architectures and the cou-
pling mechanisms in Sections 4.4.1 and 4.4.2 respectively. Also, our framework by nature
facilitates semi-supervised learning [34]. We will briefly discuss semi-supervised learning
and its relationships with PSI-BLAST [7] in Section 4.4.2.
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 49
4.3 Problem definition
We now formally describe the learning problem we tackle in this study. The inputs of
the problem consist of the following:
• Objects: a set of proteins, each containing the instances of one or more conserved
domains, each of which contains some residues. Each protein, domain instance and
residue is described by a vector of feature values. Some additional features are avail-
able for pairs of objects, such as the likelihood for a pair of proteins to interact
according to a high-throughput experiment.
• Gold standard1 positive sets of known protein-protein, domain instance-domain in-
stance and residue-residue interactions. The positive sets could be 1) contaminated
with false positives, and 2) incomplete, with false negatives, and a pair of upper-level
objects in the positive set may not have any corresponding lower-level object pairs
known to be in the positive sets.
• Gold standard negative sets of non-interactions at the three levels.
We assume no crystal structures are available except for the proteins in the gold-
standard positive sets, so that the input features cannot be derived from known structures.
This is a reasonable assumption given the small number of known structures as compared
to the availability of other data features.
The objective is to use the gold standard sets and the data features to predict whether
the object pairs outside the gold standard sets interact or not. Prediction accuracies are1As in other studies on protein interaction networks, we use the term “gold standard set” to mean a set
of sufficiently reliable data useful for the prediction purpose, instead of a ground-truth set that is absolutelycorrect.
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 50
estimated by cross-validation using holdout testing examples in the gold standard sets not
involved in the training process.
In this study we focus on kernel methods [164] for learning from examples and making
predictions. The main goal of this study is to explain how the predictions at the different
levels can be integrated, and to demonstrate the resulting improvements in accuracy. We do
not attempt to boost the accuracy at each individual level to the limit. It may be possible
to improve our predictions by using other features, learning algorithms, and parameter
values. As we will see, the design of our algorithm provides the flexibility for plugging in
other state-of-the-art learning methods at each level. We expect that the more accurate the
individual algorithms are, the more benefits they will bring to the overall accuracy through
the multi-level framework.
4.4 Methods
In order to develop a method for predicting interactions at all three levels in a cohe-
sive manner, we need to define the relationships between the levels, which is the topic of
Section 4.4.1. We first describe two information flow architectures already considered in
previous studies, and then propose a new architecture that maximally utilizes the available
data. In Section 4.4.2 we discuss various possible approaches to coupling the levels, i.e.,
ways to pass information between levels. In Section 4.4.3 we discuss the data sparsity issue.
In particular, we describe the idea of local modeling, which is also useful for network pre-
dictions in general. Finally, in Section 4.4.4 we outline the actual concrete algorithm that
we have developed and used in our experiments.
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 51
4.4.1 Information flow architectures
Architecture 1: independent levels
A traditional machine-learning algorithm learns patterns from one single set of training
examples and predicts the class labels of one single set of testing instances. When there are
three sets of examples and instances instead, the most straightforward way to learn from all
three levels is to handle them separately and make independent predictions (Figure 4.1bi).
We use this architecture to set up the baseline for evaluating the performance of the other
two architectures.
Architecture 2: unidirectional flow
A second architecture is to allow downward (from protein to domain to residue) or
upward (from residue to domain to protein) flow of information, but not both (Figure 4.1bii).
This architecture is similar to some previous domain-level interaction methods described
above, which also use information from the protein level. However, in our case protein
interactions are not assumed to be known with certainty. So only the training set and the
predictions made from the training set at the protein level can be used to assist the domain
and residue levels.
Architecture 3: bidirectional flow
A third architecture is to allow the learning algorithm of each level to access the
information of any other levels, upper or lower (Figure 4.1biii). By allowing both upward
and downward flow of information, this new architecture is the most flexible among the
three, and is the architecture that we explore in this study. Theoretically, this architecture
is loosely related to co-training [22], which assumes the presence of two independent sets of
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 52
features, each of which is capable of predicting the class labels of a subset of data instances.
Here we have three sets of features, each of which is capable of predicting a portion of
the whole interaction network. Practical extensions to the ideal co-training model allow
partially dependent feature sets and noisy training examples, which fit our current problem.
Learning proceeds by iteratively building a classifier from one feature set, and adding the
highly confident predictions as if they were gold-standard examples to train another classifier
using the other feature set. The major difference between our bidirectional-flow architecture
and co-training is the presence of a hierarchy between the levels in our case, so that each
set of features makes predictions at a different granularity.
4.4.2 Different approaches to coupling the levels
To design a concrete learning algorithm, we need to specify what information is to be
passed between different levels and how it is passed. Here we suggest several possibilities,
and briefly discuss the pros and cons of each of them.
What information to pass
i. Training data
One simple idea is to pass training data to other levels (Figure 4.1c, arrow 1). This
can be useful in filling in the missing information at other levels. For example, many
known protein interactions do not have the corresponding 3D structures available, so there
is no information regarding which domain instances are involved in the interactions. The
known protein interactions can be used to compute statistics for helping the prediction of
domain-level interactions.
ii. Training data and predictions
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 53
The major limitation of passing only training data is that the usually much larger set of
data instances not in the training sets (the “unlabeled data”) would not benefit from multi-
level learning. In contrast, if the predictions made at a level are also passed to the other
levels, much more data instances could benefit (Figure 4.1c, arrow 2 and 3). For instance,
if two domain instances are not originally known to interact, but they are predicted to
interact by the domain-level features with high confidence, this information directly implies
the interaction of their parent proteins.
Algorithms adopting this idea are semi-supervised in nature [34], since they train on
not only gold-standard examples, but also predictions of data instances that are originally
unlabeled in the input data set. Note that the idea of semi-supervised learning has been
explored in the bioinformatics literature. For instance, in the PSI-BLAST method [7],
sequences that are highly similar to the query input are iteratively added as seeds to retrieve
other relevant sequences. These added sequences can be viewed as unlabeled data, as they
are not specified in the original query input.
How the information is passed
i. Combined optimization
To pass information between levels, a first approach is to combine the learning prob-
lems of the different levels into a single optimization problem. The objective function
could involve the training accuracies and smoothness requirements of all three levels. This
approach enjoys the benefits of being mathematically rigorous, and being backed by the
well-established theories of optimization. Yet the different kinds of data features at the
different levels, as well as noisy and incomplete training sets, make it difficult to define a
good objective function. Another drawback is the tight coupling of the three levels, so that
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 54
it is not easy to reuse existing state-of-the-art prediction algorithms for each level.
ii. Predictions as additional features
Another approach is to have a separate learning algorithm at each level, and use the
predictions of a level as an additional feature of another level (Figure 4.1c, arrow 2). For
example, if each pair of proteins is given a predicted probability of interaction, it can be used
as the value of an additional feature ’parent proteins interacting’ of the domain instance
pairs and residue pairs. In this approach the different levels are loosely coupled, so that
any suitable learners can be plugged into the three levels independently, and the coupling
of the levels is controlled by a meta-algorithm.
A potential problem is the weighting of the additional features from other levels relative
to the original ones. If the original set of features is large, adding one or two extra features
without proper weighing would have negligible effects on the prediction process. Finding
a suitable weight may require a costly external optimization or cross-validation procedure.
For kernel methods, an additional challenge is integrating the predictions from other levels
into the kernel matrix, which could be difficult as its positive semi-definiteness has to be
conserved.
The idea of having a meta-algorithm that utilizes the predictions of various learners is
also used in stacked generalization, or stacking [195]. It treats the predictions of multiple
learners as a new set of features, and uses a meta-learner to learn from these predictions.
However, in our setting, the additional features come from other levels instead of the same
level.
iii. Predictions as augmented training examples
A similar approach is to add the predictions of a level to the training set of another
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 55
level (Figure 4.1c, arrow 3). The resulting training set involves the original input training
instances and augmented training data from other levels, with a coefficient reflecting how
much these augmented training data are to be trusted according to the training accuracy
of the supplying level. This approach also has the three levels loosely coupled.
A potential problem of this training set expansion approach is the propagation of errors
to other levels. The key to addressing this issue is to perform soft coupling, i.e., to associate
confidence values to predictions, and propagate only highly confident predictions to other
levels [203]. For kernel methods, this means ignoring objects falling in or close to the margin.
This approach is similar to PSI-BLAST mentioned above, which selectively includes only
the most similar sequences in the retrieval process.
In this study, we focus on this third approach. It requires a learning method for each
level, while the control of information flow between the different levels by means of training
set expansion forms the meta-algorithm. Since each level involves only one set of features
and one set of data instances, traditional machine learning methods can be used. We
chose support vector regression (SVR) [52], which is a type of kernel method. We used
regression instead of the more popular support vector machine classifiers [26] because the
former can accept confidence values of augmented training examples as inputs, and produce
real numbers as output, which can be converted back into probabilities that reflect the
confidence of interactions.
4.4.3 Global vs. local modeling, and data sparsity issues
Global modeling
Taking a closer look at the prediction problem at each individual level, one would realize
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 56
that applying a traditional learning method is actually non-trivial since we are dealing with
network data. In a traditional setting, each training instance has a class label and the
job of a learning algorithm is to identify patterns in the feature values for predicting the
class label of each unlabeled object. In our current situation, each data instance is a pair
of biological objects (proteins/domain instances/residues), with two possible class labels:
interacting and non-interacting. In order to construct a learner, one would need features for
pairs of objects. A model can then be learned using a traditional machine learning method
for all object pairs. We call this ‘global modeling’ since a single regression model is built
for all the data instances. Global modeling has a number of major drawbacks:
1. Features for object pairs: it is not easy to construct features for pairs of objects, since
most available data features are for single objects. This is particularly a problem for
kernel methods, which require a kernel matrix to encapsulate the similarity between
each pair of data instances. For network data, this means a similarity value for each
pair of object pairs. While methods have been proposed to construct such kernel
matrices [17], the resulting kernels, while formally correct, are difficult to interpret.
2. Time complexity: working with pairs of objects squares the time requirement with
respect to the number of objects in the dataset. While state-of-the-art implementa-
tions of kernel methods could easily handle thousands of proteins, it would still be
challenging to deal with millions of protein pairs, let alone the even more daunting
numbers of domain instance pairs and residue pairs.
3. Space complexity: the kernel matrix has a size quadratic in the number of data
instances. With n objects at a level, there are O(n2) pairs and thus the kernel matrix
contains O(n4) entries.
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 57
4. Sub-clusters: the two classes of data instances may contain many sub-clusters that
cannot be handled by one single global model [21, 203]. For instance, proteins involved
in permanent complexes may use a very different interaction mechanism from transient
interactions in signaling pathways.
Local modeling
To avoid these problems, one alternative is local modeling [21], which we have described
in the previous chapter. Briefly, instead of building one single global model for all object
pairs, one local model is built for each object. For example, if the dataset contains n
proteins, then n models are built, one for each protein, for predicting whether this protein
interacts with each of the n proteins. The advantages of local modeling are obvious: 1) data
features are needed for individual objects only, 2) the time complexity is smaller than global
modeling whenever the learning method has a super-linear time complexity, 3) much less
memory space is needed for the kernel matrix, and 4) each object can have its very specific
local model. For all these benefits, in our experiments we only considered local modeling.
Local modeling is also not free from problems, but they are solvable. The most signif-
icant problem is data sparsity – some objects may have insufficient training examples (or
none at all) for building local models. For example, among the millions of yeast protein
pairs, there are only a few thousand known interactions, so many proteins have very few
known interactions. An object with zero or few known interaction partners would not have
enough training examples for building its local model.
Our proposed solution uses concepts related to semi-supervised learning: use high
confidence predictions to augment training sets [203]. Suppose protein A has sufficient
known positive and negative examples in the original training sets, and the local model
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 58
learned from these examples predicts with high confidence protein B to be an interaction
partner with A. Then when building the local model for B, A can be used as a positive
training example. Predicted non-interactions can be added as negative examples in a similar
way.
This idea is consistent with the training set expansion method proposed above for
inter-level communication. As a result, the information flow both between levels and within
a level can be handled in a unified framework. The expanded training set of a level thus
involves the input training data, highly confident predictions of the local models of the level,
and highly confident predictions from other levels.
Practically, training set expansion within the same level requires an ordered construc-
tion of the local models. Objects with many (input or derived) training examples should
have their local models constructed first, as more accurate models are likely to be obtained
from larger training sets. As these objects are added as training examples of their pre-
dicted interaction partners and non-partners, they would progressively accumulate training
examples for their own local models.
4.4.4 The concrete algorithm
We now explain how we used the ideas described in the previous sections, namely bidi-
rectional information flow, coupling by predictions passing, and local modeling with training
set expansion, to develop our concrete learning algorithm for prediction of protein, domain
instance and residue interactions. We first give a high-level overview of the algorithm, then
explain the components in more detail.
The main steps of the algorithm are:
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 59
1. Set up a learning sequence of the levels.
2. Use the model learned for the first level in the sequence to predict interactions at the
level.
3. Propagate the most confident predictions to the next level in the sequence as auxiliary
training examples.
4. Repeat the previous two steps for the second and third levels, and so on.
Learning at each level
We use training set expansion with support vector regression (SVR) to perform learning
at each level, similar to the idea in [203]. Each pair of objects in the positive and negative
training sets is given a class label of 1 and 0, respectively. A SVR model is learned for
the object (e.g. protein) with the largest number of training examples (denoted as A). The
model predicts a real value for each object, indicating the likelihood that it interacts with
A. The ones with the largest and smallest predicted values are treated as the most confident
positive and negative predictions, and are used to expand the training set. For example,
if B is an object with the largest predicted value, then A and B are predicted to interact,
and A is added as an auxiliary positive training example of B. After training set expansion,
the next object with the largest number of training examples is re-determined, its SVR is
learned, and the most confident predictions are used to expand the training set in the same
manner. The whole process then repeats until all models have been learned. Finally, each
pair of objects A and B received two predicted values, one from the model learned for A and
one from the model learned for B. The two values are weighted according to the training
accuracies of A and B to produce the predicted value for the pair. Sorting the predicted
values in descending order gives a list of predictions from the pair most likely to interact to
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 60
the one least likely. The list can then be used to evaluate the accuracy by metrics such as
the area under the receiver operator characteristic curve (AUC) [88].
Setting up the learning sequence
One way to set up the learning sequence is to use the above procedure to deduce the
training accuracy of the three levels when treated independently, then order the three levels
into a learning sequence according to their accuracies. For example, if the protein level
gives the highest accuracy, followed by the domain level, and then the residue level, the
sequence would be “PDRPDR...”, where P, D and R stand for the protein, domain and
residue levels, respectively. Having the level with the highest training accuracy earlier in
the sequence ensures the reliability of the initial predictions of the whole multi-level learning
process, which is important since all latter levels depend on them. Notice that after learning
at the last level, we feed back the predictions to the first level to start a new iteration of
learning.
In our computational experiments we also tested the accuracy when only two levels are
involved. In such situations, we simply bypassed the left-out level. For example, to test
how much the domain and residue levels could help each other without the protein level,
the learning sequence would be “DRDR...”.
Propagating predictions between levels
The mechanism of propagating predictions from a level to another depends on the
direction of information flow.
For an upward propagation (R→D, R→P or D→P), each object pair in the next level
receives a number of predicted values from its children at the previous level. For example, if
predictions are propagated from the domain level to the protein level, each pair of domain
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 61
instances provides a predicted value to their pair of parent proteins. We tried two methods
to integrate these values. In the first method, we normalize the predicted values to the [0,
1] range as a proxy of the probability of interaction, then use the noisy-OR function [190] to
infer the chance that the parent objects interact. Let X and Y be the two sets of lower-level
objects, and p(x, y) denotes the probability of interaction between two objects x ∈ X and
y ∈ Y , then the chance that the two parent objects interact is 1 −∏
x∈X,y∈Y (1 − p(x, y)),
i.e., the parent objects interact if and only if at least one pair of its children objects interact.
In the second method, we simply take the maximum of the values. In the ideal case where
all predicted values are either 0 or 1, both methods are exactly the same as taking the OR
of the values. When the values are noisy, the former is more robust as it does not depend
on a single value. Yet its value is dominantly affected by a large number of fuzzy predicted
values with intermediate confidence, and is thus less sensitive. Since in our tests it does not
provide superior performance, in the following we report results for the second method.
For a downward propagation (P→D, P→R or D→R), we inherit the predicted value of
the parent pair as the prior belief that the object pairs from the two parents will interact.
In both cases, after computing the probability of interaction for each pair of objects
in the next level based on the predicted values at the current level, we again add the most
confident positive and negative predictions as auxiliary training examples for the next level,
with the probabilities used as the confidence values of these examples.
In the actual implementation, we used the Java package libsvm [33] for SVR, and the
Java version of lapack2 for some matrix manipulations.2http://www.netlib.org/lapack/
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 62
Table 4.1. Data features at the protein level.Feature Feature of Data type KernelCOG (version 7) phylogenetic profiles [181] Proteins Binary vectors RBF (σ = 8)Sub-cellular localization [92] Proteins Binary vectors LinearCell cycle gene expression [176] Proteins Real vectors RBF (σ = 8)Environment response gene expression [68] Proteins Real vectors RBF (σ = 8)Yeast two-hybrid [97, 185] Protein pairs Unweighted graph Diffusion (β = 0.01)TAP-MS [69, 108] Protein pairs Weighted graph Diffusion (β = 0.01)
4.5 Experiments
We tested the effectiveness of multi-level learning by predicting protein, domain in-
stance and residue interactions of the yeast Saccharomyces cerevisiae.
4.5.1 Data
Protein level
Data features were gathered from multiple sources (Table 4.1), including phylogenetic
profiles, sub-cellular localization, gene expression, and yeast two-hybrid and TAP-MS net-
works. Each of them was turned into a kernel matrix and the final kernel was the summation
of them, as in previous studies [21, 199, 203].
A gold standard positive set was constructed from the union of experimentally verified
or structurally determined protein interactions from MIPS [131], DIP [160] and iPfam [63]
with duplicates removed. The MIPS portion was based on the 18 May 2006 version, and
only physical interactions not obtained from high throughput experiments were included.
The DIP portion was based on the 7 Oct 2007 version, and only interactions from small-
scale experiments or multiple experiments were included. The iPfam portion was based on
version 21 of Pfam [64]. A total of 1681 proteins with all data features and at least one
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 63
Table 4.2. Data features at the domain level. *: These two features were used with theunidirectional and bidirectional flow architectures only since they involve information aboutthe training set of the protein level.
Feature Feature of Data type KernelPhylogenetic tree correlations [76] of Pfamalignments
Domain family pairs Real matrix Empirical kernel map [183]
In all species, number of proteins containingan instance of the domain family
Domain families Integers Polynomial (d=3)
In all species, number of proteins containingdomain instances only from the family
Domain families Integers Polynomial (d=3)
Number of domain instances of parent protein Domain instances Integers Polynomial (d=3)Fraction of non-yeast interacting protein pairscontaining instances of the two domains re-spectively are mediated by the domain in-stances*
Domain family pairs Real matrix Constant shift embedding [157]
Fraction of protein pairs containing instancesof the two domains respectively are known tobe interacting in the PPI training set*
Domain family pairs Real matrix Constant shift embedding
interaction were included in the final dataset, forming 3201 interactions. A gold standard
negative set with the same number of protein pairs was then created from random pairs of
proteins not known to interact in the positive set [17, 37].
Domain level
We included two types of features at the domain level: co-evolution and statistics
related to parent proteins (Table 4.2). These are similar to the features used by previous
studies for domain family/domain instance interaction predictions [100, 178].
The gold standard positive set was taken from iPfam, where two domain instances are
defined as interacting if they are close enough in 3D structure and some of their residues are
predicted to form bonding according to their distances and chemistry. After intersecting
with the proteins considered at the protein level, a total of 422 domain instance interactions
were included, which involves 272 protein interactions and 317 domain instances from 223
proteins and 252 domain families. A negative set with the same number of domain instance
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 64
Table 4.3. Data features at the residue level.Feature Feature of Data type KernelPSI-BLAST profiles Residues and neighbors Vectors of real vectors Summation of linearPredicted secondary structures Residues and neighbors Vectors of real vectors Summation of linearPredicted solvent accessible surface areas Residues and neighbors Vectors of real numbers Summation of circular
pairs was then formed from random pairs of domain instances in the positive set. All known
yeast Pfam domain instances of the proteins were involved in the learning, many of which do
not have any known interactions in the gold standard positive set. Altogether 2389 domain
instances from 1681 proteins and 1184 domain families were included.
Residue level
We used three data features derived from sequences (Table 4.3). Charge complemen-
tarity and other features likely useful for interaction predictions are implicit in the sequence
profiles. The features are similar to those used in a previous study [42]. However, as we
do not assume the availability of crystal structures of unlabeled objects, the secondary
structures and solvent accessible surface areas we used were algorithmically predicted from
sequence instead of derived from structures. We used SABLE [1] to make such predictions.
In a previous study [42], the feature set of a residue involves not only the features of
the residue itself, but also neighboring residues closest to it in the crystal structure, which
allows for the possibility that some of them are involved in the same binding site and thus
have dependent interactions. In the absence of crystal structures, we instead included a
window of residues right before and after a residue in the primary sequence to construct its
feature set. We chose a small window size of 5 to make sure that the included residues are
physically close in the unknown 3D structures.
The gold standard positive set was taken from iPfam. Since there is a large number
of residue pairs, we only sampled 2000 interactions, which involve 228 protein pairs, 327
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 65
domain instance pairs and 3053 residues from 195 proteins, 279 domain instances and 224
domain families. Only these 3053 residues were included in the data set. A negative set
was created by randomly sampling from these residues 2000 residue pairs that do not have
known interactions in iPfam.
4.5.2 Evaluation procedure
We used ten-fold cross validation to evaluate the performance of our algorithm. Since
the objects in the three levels are correlated, an obvious performance gain would be obtained
if in a certain fold the training set of a level contains some direct information about the
testing set instances of another level. For example, if a residue interaction in the positive
training set comes from a protein pair in the testing set, then the corresponding protein
interaction can be directly inferred and thus the residue interaction would create a fake im-
provement for the predictions at the protein level. This problem was avoided by partitioning
the object pairs in the three levels consistently. First, the known protein interactions in
iPfam were divided into ten folds. Then, each domain instance interaction and each residue
interaction was put into the fold in which the parent protein interaction was assigned. Fi-
nally, the remaining protein interactions and all the negative sets were randomly divided
into ten folds.
Each time, one of the folds was held out as the testing set and the other nine folds were
used for training. We used the area under the ROC (Receiver Operator Characteristics)
curve (AUC) [88] to evaluate the prediction accuracies. For each level, all object pairs in the
gold standard positive and negative sets were sorted in descending order of the predicted
values of interaction they received when taking the role of testing instances. The possible
values of AUC range from 0 to 1, where 1 corresponds to the ideal situation where all
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 66
positive examples are given a higher predicted value than all negative examples, and 0.5 is
the expected value of a random ordering.
We compared the prediction accuracies in three cases: independent levels, unidirec-
tional flow of training information only, and bidirectional flow of both training information
and predictions. For the latter two cases, we compared the performance when different
combinations of the three levels were involved in training.
For independent levels, we trained each level independently using its own training set,
and then used the predictions as initial estimates to retrain for ten feedback iterations. This
iterative procedure was to make sure that any accuracy improvements observed in the other
architectures were at least in part due to the communications between the different levels,
instead of merely the effect of semi-supervised learning at a single level. For unidirectional
flow, we focused on downward flow of information. The levels were always arranged with
upper levels coming before lower levels.
4.5.3 Results
Table 4.4 summarizes the prediction accuracies of the three levels. All numbers corre-
spond to the average results among the ten feedback iterations. Each row represents the
results of one level. For unidirectional flow and bidirectional flow, the levels involved in
training are also listed. For example, the PR column of unidirectional flow involves the use
of the protein-level training sets in setting up the initial estimate of the residue interactions.
This has no effect on the predictions at the protein level since information only flows down-
ward. The cell at the row for protein interactions is therefore left blank. The best result in
each row is in bold face.
CHAPTER 4. UTILIZING PROBLEM STRUCTURES: MULTI-LEVEL LEARNING 67
Table 4.4. Prediction accuracies (AUC) of the three levels with different information flowarchitectures and training levels.
CHAPTER 5. HANDLING ERRORS IN DATA: CONSISTENT PREDICTION OFINTERACTIONS AT DIFFERENT LEVELS 86
Again, only the term lnPr(Dij |λ) depends on λ, so the MLE for each λmn remains the
same as in Equation 5.10.
A variation of this method is to consider not only protein pairs in the training set T , but
all pairs of proteins. On the positive side, this approach utilizes also the features of protein
pairs outside the training set, which could potentially discover more domain interactions.
On the negative side, if the protein-level predictions are not too accurate, this approach
might introduce more noise to the DDI inference. We study the performance of the new
EM algorithm and this variation empirically in the next section.
5.3.3 Empirical study
We use the new EM algorithm and its variation to predict DDI and PPI as before,
and compare the results with those of the other methods. Protein features include phylo-
genetic profiles, gene expression data and high-throughput data we used in the training set
expansion study. Cell localization is not included as it is used to define the gold-standard
negative set. We take the protein kernels constructed in the training set expansion study,
and use their elements as the feature values of protein pairs.
We would also want to study if any potential performance gain of the new algorithms
can be trivially achieved by first running a protein-level classifier to predict PPI from the
training set, and then use the results to infer DDI. To this end, we also include an approach
that uses a Naive Bayes classifier to predict protein interactions, and then use the predicted
probabilities to initialize the value of λmn in the original EM method. Again, we have two
variations here, one uses only the predicted probabilities of the protein pairs in the training
set T , and the other uses all predicted probabilities.
CHAPTER 5. HANDLING ERRORS IN DATA: CONSISTENT PREDICTION OFINTERACTIONS AT DIFFERENT LEVELS 87
Figure 5.2 shows the prediction accuracy of the various methods.
0.680.690.7
0.710.720.730.740.750.760.770.78
0 0.1 0.2 0.3 0.4Noise level
Acc
urac
y (A
UC
)
Old EM
Old EM + NaiveBayes (training)
Old EM + NaiveBayes (all)
New EM (training)
New EM (all)
Naive Bayes
(a)
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
0 0.1 0.2 0.3 0.4Noise level
Acc
urac
y (A
UC
)
Old EM
Old EM + NaiveBayes (training)
Old EM + NaiveBayes (all)
New EM (training)
New EM (all)
(b)
Figure 5.2. Comparing the accuracy of the new methods with the original EM algorithm.(a) Accuracy of PPI prediction. (b) Accuracy of DDI prediction. Old EM: the original EMalgorithm by Deng et al. Old EM + Naive Bayes (training): the original EM algorithm, withinitial parameter values estimated by the Naive Bayes predictions of the protein pairs in thetraining set. Old EM + Naive Bayes (all): the original EM algorithm, with initial parametervalues estimated by the Naive Bayes predictions of all protein pairs. New EM (training): thenew EM algorithm, with variables defined for protein pairs in the training set only. New EM(all): the new EM algorithm, with variables defined for all protein pairs. Naive Bayes: theNaive Bayes predictions.
The figure shows that the new EM algorithm predicts protein interactions with a higher
accuracy than the original EM algorithm when the training set is error free. It is also much
less sensitive to errors in the training set. The performance gain is not only due to a
more accurate PPI network input, as initializing the original EM algorithm with Naive
CHAPTER 5. HANDLING ERRORS IN DATA: CONSISTENT PREDICTION OFINTERACTIONS AT DIFFERENT LEVELS 88
Bayes predictions results in only a small accuracy improvement. Furthermore, the new EM
algorithm is more accurate than Naive Bayes alone in predicting protein interactions. This
result suggests that the new EM algorithm is able to utilize the information from both the
protein and domain levels to make more accurate predictions.
The two variations have very similar performance, with the one considering all protein
pairs having slightly higher accuracy at high noise level.
The DDI performance is intriguing. When only protein pairs in the training set are
considered, the new EM algorithm is slightly more accurate than the original EM algorithm,
but is still very sensitive to noise. However, when all protein pairs are considered, the new
EM algorithm has a very stable accuracy regardless of the noise level. It appears that by
considering all protein pairs, this approach is dominantly affected by the initial likelihood
estimations that are based on information at the protein level only.
This result suggests that using only information at the protein level to estimate feature
likelihood could make it difficult to predict some domain interactions. We would want to
devise a method to estimate feature likelihood using also information at the domain level.
5.3.4 Constrained likelihood estimation
Intuitively, if two protein pairs share a large number of common domain pairs, it is
desirable to predict the interaction status of the two protein pairs consistently, so that if
the first pair has a large likelihood, the second pair should also have a large likelihood.
This idea can be formally described as a constrained optimization problem over a graph.
Consider a graph in which each node represents a pair of proteins, labeled with its feature
likelihood. An edge is drawn from a node p to a node q if the latter pair of proteins shares
CHAPTER 5. HANDLING ERRORS IN DATA: CONSISTENT PREDICTION OFINTERACTIONS AT DIFFERENT LEVELS 89
some domain pairs as the former pair. The weight of the edge is equal to the fraction of
common domain pairs:
wpq =|D(p)
⋂D(q)|
|D(p)|(5.19)
We would like to assign a new set of labels to the nodes, so that 1) nodes connected by
edges with large weights have similar labels, and 2) the new node labels do not deviate too
much from the original labels. The first criterion originates from the idea that protein pairs
that share common domain pairs should have similar feature likelihood, while the second
criterion ensures that information at the protein level continues to play a role in the final
likelihood estimations. These two criteria can be formulated mathematically as follows. Let
x be the vector of original labels and y be the vector of new labels. Define the following
objective function f :
f(y) =1n
∑p
(yp − xp)2 + α2
n(n− 1)
∑p<q
wpq(yq − yp)2 (5.20)
=1n||y − x||2 +
2α
n(n− 1)yT (I −W )y,
where n is the number of protein pairs, α is a tradeoff parameter between the two criteria,
I is the identity matrix, and W is the weight matrix defined as Wpq = wpq. Again, the
summations can be taken over only protein pairs in the training set T , or all protein pairs.
To minimize the objective function, we differentiate it with respect to y:
CHAPTER 5. HANDLING ERRORS IN DATA: CONSISTENT PREDICTION OFINTERACTIONS AT DIFFERENT LEVELS 90
∂f(y)∂y
=2n
(y − x) +2α
n(n− 1)[(I −W )T + (I −W )]y (5.21)
=2n
(y − x) +4α
n(n− 1)(I −W ∗)y,
=2n
[(n− 1 + 2α)I − 2αW ∗]y − 2n
x
where W ∗ is the symmetrized weight matrix W ∗pq = wpq+wqp
2 . By setting the equation to
zero, the analytical solution of the y that minimizes f(y) is [(n − 1 + 2α)I − 2αW ∗]−1x.
Since W is an n × n matrix where n is the number of protein pairs, which is of the order
of millions, taking the inverse directly is infeasible. Instead, the equation can be solved by
Jacobi iterations [44]. The initial estimate of y is simply x:
y(0) = x (5.22)
Subsequent approximations are based on this update formula:
y(t) =n− 1
n− 1 + 2αx +
2α
n− 1 + 2αW ∗y(t−1) (5.23)
The value of a particular component p of the vector is:
CHAPTER 5. HANDLING ERRORS IN DATA: CONSISTENT PREDICTION OFINTERACTIONS AT DIFFERENT LEVELS 91
y(t)p =
n− 1n− 1 + 2α
x + (5.24)
2α
n− 1 + 2α
∑(m,n)∈D(p)
{∑
q∈P(m,n),q 6=p
12[
1|D(p)|
+1
|D(q)|]y(t−1)
q }
Brute-force summation is still infeasible if some domain pairs are shared by a large
number of protein pairs. However, the term in curly brackets can be rewritten as
∑q∈P(m,n),q 6=p
12[
1|D(p)|
+1
|D(q)|]y(t−1)
q (5.25)
=1
2|D(p)|∑
q∈P(m,n)
y(t−1)q +
12
∑q∈P(m,n)
y(t−1)q
|D(q)|−
y(t−1)p
|D(p),
where the first two terms can be pre-computed for each domain pair and the last term can
be obtained in constant time.
By considering different possible values of the tradeoff parameter α, we can obtain
an intuitive interpretation of the update formula in Equation 5.23. The estimation of any
component yp of y at time t involves the initial estimate xp and a weighted sum of the
neighbors of p. If α = 0, the formula reduces to x, which corresponds to the case that
the estimation is made using protein-level information only. When α = 0.5, the weight of
x is n − 1 times the weight of each neighbor. Since there are at most n − 1 neighbors,
CHAPTER 5. HANDLING ERRORS IN DATA: CONSISTENT PREDICTION OFINTERACTIONS AT DIFFERENT LEVELS 92
and each element in W ∗ is no larger than 1, this value of α is the upper threshold that
guarantees protein-level information would have an effect at least as strong as the domain-
level information. When α = n−12 , the weight of x is the same as any of the neighbors.
Finally, when α � n−12 , the second term dominates and the value of y is totally determined
by the labels of the neighbors according to domain-level information.
As long as α ≤ n−12 , the matrix (n − 1 + 2α)I − 2αW ∗ remains diagonally dominant,
which ensures the convergence of the Jacobi iterations.
We use this method to compute constrained likelihood for the new EM algorithm with
several values of α. The prediction accuracy at the different values is similar, and the results
for α = 100 are shown in Figure 5.3.
0.76
0.765
0.77
0.775
0.78
0.785
0 0.1 0.2 0.3 0.4Noise level
Acc
urac
y (A
UC
)
New EM (training)
New EM (all)
New EM(constrained,training)
New EM(constrained, all)
(a)
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
0 0.1 0.2 0.3 0.4Noise level
Acc
urac
y (A
UC
)
New EM (training)
New EM (all)
New EM(constrained,training)
New EM(constrained, all)
(b)
Figure 5.3. Comparing the accuracy of the new EM algorithm with constrained and uncon-strained likelihood. (a) Accuracy of PPI prediction. (b) Accuracy of DDI prediction.
The performance of EM is observed to improve slightly when predicting PPI with
constrained likelihodd, and remain largely the same when predicting DDI.
CHAPTER 5. HANDLING ERRORS IN DATA: CONSISTENT PREDICTION OFINTERACTIONS AT DIFFERENT LEVELS 93
5.4 Discussion
Since some DDI prediction algorithms have built-in error detection mechanisms, an-
other way to handle errors in the PPI network is to actively correct the training sets by
detecting dubious training examples, and either removing them from the training sets, or
swapping them from the positive set to the negative set, or vice versa.
We have tried this approach by identifying protein pairs that have the largest difference
between the input and predicted probabilities of interaction. For example, if a protein pair
is in the positive training set, and an algorithm predicts the two proteins to have a very
low probability of interaction according to their domain information, this protein pair is
a potential false positive. We tested if the prediction accuracy of InSite and BP could be
improved by removing these examples or swapping to the opposite training set.
The results suggest that statistically false positives do tend to be predicted with a
smaller probability of interaction than true positives according to a Wilcoxin rank sum test
of the probabilities of the two set. Similarly, false negatives do tend to be predicted with a
larger probability of interaction than true negatives. Yet the precision of error detection is
not high enough to be useful in correcting errors. Swapping is observed to be not feasible
as it would create even more errors. For example, when the positive training set contains
10% false positives, among the protein pairs in the positive set with the lowest predicted
probabilities of interaction, the percentage of real false positives is between 15%-20%. While
this percentage is higher than the average false positive rate in the whole positive set, the
remaining 80%-85% are true positives. Adding the detected dubious protein pairs to the
negative would thus increase the error rate. On the other hand, while removing the examples
is guaranteed to reduce the average error rate, it also reduces the size of the training set,
CHAPTER 5. HANDLING ERRORS IN DATA: CONSISTENT PREDICTION OFINTERACTIONS AT DIFFERENT LEVELS 94
so that less domain interactions would find evidence from the positive protein pairs.
It thus seems more effective to deal with PPI errors by working on the input PPI
network before it is used in DDI inference. We have shown that incorporating protein
features is one way to reduce noise. Another way is to enforce PPI predictions to consider
domain occurrence. The constrained likelihood method is one possible approach, yet more
work is still needed to improve its effectiveness.
In general, it is advantageous to incorporate more data if they contain some new
information. The main challenge is finding a proper way to extract such knowledge and
integrate with existing data in learning. In the next chapter we describe a study in which
we successfully incorporated new data into our learning method and outperformed other
prediction algorithms.
Chapter 6
Adding New Perspectives to
Existing Problems: Discovering
New Information in New Data
6.1 Introduction
In this chapter we switch our focus from the protein interaction network to the gene
regulatory network. The expression of genes is tightly controlled by the regulatory ma-
chinery in the cell, by regulator proteins called transcription factors (TFs). Taking the
simplified view that each gene encodes for a protein, transcription regulation can be mod-
eled as a directed graph with each node representing a gene and its encoded protein, and an
edge from one node to another if the former is a regulator of the latter. In addition to the
directionality, the edges are also signed, with a positive sign indicating a positive regulation
95
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 96
(activation) and a negative sign indicating a negative regulation (suppression).
Methods have been proposed for computationally reconstructing regulatory networks.
One common approach is to use differential equations to model how the expression levels
of genes change according to the abundance of their regulator proteins over time [25, 67,
129, 188]. Since it has only recently been possible to quantitatively measure the abundance
of proteins in each cell for many proteins simultaneously by flow cytometry [43], protein
abundance has been approximated in two ways: 1) the expression level of mRNA has been
used as a proxy of the quantity of the corresponding protein; 2) a multi-cell average has
been used as a proxy of the quantity in individual cells. With the use of mRNA level to
approximate protein abundance, both the data for estimating the expression level of a gene
and the activity of its regulators are obtained from the same mRNA microarray assays.
Each set of experiments involves an initial experimental condition (e.g., an environmental
perturbation such as a heat shock), which affects the expression levels of some genes reacting
to the condition. Then additional expression profiles are obtained at different time points
as a measure of the changing internal state of the cell.
In the resulting dataset, each data point measures the expression level of a gene in
a specific condition at a certain time point. Each such observed value is a mixture of
many different factors, including the previous expression level of the gene, the activity of its
regulators, decay of mRNA transcripts, randomness, and measurement errors. The many
entangled parameters make it difficult to reconstruct the regulatory network based on this
type of data alone.
To decode this kind of complex systems, one would want to reduce it to a series of
subsystems with manageable sizes by keeping the values of most parameters constant and
varying only a small number of them. Thanks to the creation of large-scale deletion li-
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 97
braries [72], it is now possible to carry out this divide-and-conquer strategy. A deletion
library contains different strains of a species (e.g. yeast), each of which has one of the genes
of the species disabled – completely (knocked out) by mutagenesis [72] or partially (knocked
down) by RNA inference (RNAi) [101]. Profiling the expression of each gene in a deletion
strain allows one to study the sub-network that is affected by the deleted gene. For instance,
if the deleted gene encodes for a protein that is the only activator of another gene, then the
expression level of the latter would be dramatically decreased in the deletion strain of the
former as compared to the wild-type strain in which the regulator gene is intact.
While deletion data is good for detecting simple, direct regulatory events, they may
not be sufficient for decoding those that are more complicated. For example, if a gene
is up-regulated by two TFs in the form of an OR circuit, so that the gene is expressed
normally as long as one of the TFs is active, these edges in the regulatory network cannot
be uncovered by single-gene deletion data. In such a scenario, traditional time course data
could supplement the deletion data in detecting the missing edges. For instance, if at a
certain time point both the TFs have a low abundance and the expression rate of the gene
is observed to be impaired, this observation could help reconstruct the OR circuit if it
provides a good fit to a differential equation model.
In this study we demonstrate how these two types of data can be used in combination
to reconstruct regulatory networks. We propose methods for predicting regulatory edges
from each type of data, and a meta-method for combining their predictions. Using a set of
fifteen benchmark datasets, we show the effectiveness of our approach, which led our team
to get the first place in the public challenge of the third Dialogue for Reverse Engineering
Assessments and Methods (DREAM) [50]. We will also discuss potential weaknesses of our
approach, and directions for future studies.
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 98
6.2 Problem definition
We first formally define our problem of reconstructing regulatory networks. The target
network is a directed graph with n nodes. The edges are completely unobserved, and we
are to predict them from the data features alone. In other words, this is an unsupervised
learning setting. The edges are signed, but these signs are not considered in our experimental
evaluation. The goal is thus to learn a model from the data features, such that given an
ordered pair of two genes (i, j), one can predict whether i is a regulator of j.
We use two types of data features: perturbation data and deletion data. Deletion data
are further sub-divided into homozygous deletion and heterozygous deletion.
In a perturbation time series dataset, an initial perturbation is performed at time 0 by
setting the expression levels of each gene to a certain level. Then the regulatory system is
allowed to adjust the internal state of the cell by up- and down-regulating genes according
to the abundance of the TFs. The expression level of each gene is taken at subsequent time
points. Thus, for each perturbation experiment, each gene is associated with a vector of
real numbers that correspond to its expression level at different time points after the initial
perturbation. If there are m perturbation experiments and the i-th one involves ti time
points, then each gene is associated with a vector of∑m
i=1 ti expression values.
In a deletion dataset, a gene is deleted (knocked-out or knocked-down), and the result-
ing expression level of each gene at steady state is measured. By deleting each gene one
by one, and adding the wild-type (no deletion) as control, each gene is associated with a
vector of n+1 values, corresponding to its steady-state expression level in the n+1 strains.
For diploid organisms (with two copies of each gene in the genome), the deletion can be
homozygous (with both copies deleted, i.e., “null mutant”) or heterozygous (with only one
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 99
copy deleted).
We assume both types of deletion data, as well as perturbation data, are available,
although it is trivial to modify our algorithm by simply removing the corresponding sub-
routines if any kind of data is missing.
6.3 The learning method
Our basic strategy is to learn the simple regulation cases from deletion data using
noise models, and to learn the more complex ones from perturbation data using differential
equation models. We first describe the two kinds of models and how we learn the parameter
values from data, and then discuss our method to combine the two lists of predicted edges
into a final list of predictions.
6.3.1 Learning noise models from deletion data
We consider a simple noise model for deletion data, that each data point is the super-
position of the real signal and a reasonably small Gaussian noise independent of the gene
and the time point. The Gaussian noise models the random nature of the biological system,
and the measurement error. Based on this model, the larger is the change of expression
of a gene a from wild type to the deletion strain of a gene b, the more unlikely that the
deviation is due to the Gaussian noise only, and thus the larger chance that a is directly or
indirectly regulated by b.
Notice that the regulation could be direct (b regulates a) or indirect (b regulates c
that directly or indirectly regulates a). There are studies that try to separate the direct
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 100
regulation from the indirect ones using methods such as graph algorithms [189] and condi-
tional correlation analysis [155]. In this study we do not attempt to distinguish direct and
indirect regulation, and show that even assuming all significant deviation in deletion data
to be direct regulation could already provide substantial performance improvements over
approaches that focus on perturbation data only.
Given the observed expression level xba of a gene a in the deletion strain of gene b, and
its real expression level in wild type, xwt∗a , we would like to know whether the deviation
xba − xwt∗
a is merely due to noise. To answer this question, we would need to know the
variance σ2 of the Gaussian, assuming the noise is non-systematic and thus the mean µ
is zero. If the value of σ2 is known, then the probability for observing a deviation as
large as xba − xwt∗
a due to random chance only is simply 2[1− Φ( |xba−xwt∗
a |σ )], where Φ is the
cumulative distribution function of the standard Gaussian distribution. The complement,
pb→a = 1 − 2[1 − Φ( |xba−xwt∗
a |σ )] = 2Φ( |x
ba−xwt∗
a |σ ) − 1 is the probability that the deviation is
due to a regulation event. One can then rank all the gene pairs (b, a) in decreasing order of
pb→a.
To implement the above procedure, it is necessary to estimate σ2 from data, which is
standardly done by using the non-biased sample variance of data points that are not affected
by the deleted gene from the wild-type expression. However, this involves two difficulties.
First, the set of genes not affected by the deleted gene is unknown and is exactly what
we are trying to learn from the data. Second, the observed expression value of a gene in
the wild-type strain xwta is also subjected to random noise, and thus cannot be used as the
gold-standard reference point xwt∗a in the calculations.
We propose an iterative procedure to progressively refine our estimation of pb→a. We
start by assuming the observed wild-type expression levels xwta are reasonable rough esti-
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 101
mates of the real wild type expression levels xwt∗a . Using them as the initial reference points,
we repeat the following three steps for a number of iterations:
1. Calculate the probability of regulation pb→a for each pair of genes (b, a) based on the
current reference points xwta . Then use a p-value of 0.05 to define the set of potential
regulation: if the probability for the observed deviation from wild type of a gene a in
a deletion strain b to be due to random chance only is less than 0.05, we treat b → a as
a potential regulation. Otherwise, we add (b, a) to the set P of gene pairs for refining
the error model.
2. Use the set P to re-estimate the variance of the Gaussian noise, σ2 =P
(b,a):P (xba−xwt
a )2
|P |−1 .
3. For each gene a, we re-estimate its wild-type expression level by the mean of its
observed expression levels in strains in which the expression level of a is unaffected
by the deletion: xwta :=
xwta +
Pb:(b,a)∈P xb
a
1+|b:(b,a)∈P | .
After the iterations, the probability of regulation pb→a is computed using the final
estimate of the reference points xwta and the variance of the Gaussian noise σ2.
Notice that we have chosen to use a “conservative” p-value of 0.05 in the following
sense: when the number of genes in the network, n, is sufficiently large (e.g. n ≥ 10) and
there are relatively few regulatory edges, there is a large number of gene pairs for estimating
the parameters such that missing some of them would not seriously affect the estimation. It
would thus be good to add to P only gene pairs that are very unlikely to contain regulatory
edges. This is achieved by using a large (i.e., conservative in this context) p-value to define
the potential regulatory edges.
The above iterative procedure can be applied to both homozygous and heterozygous
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 102
deletion data, although the regulation signals are expected to be less clear in the heterozy-
gous case since deleting only one copy of a regulator gene may induce only a mild effect
to its targets. The final p-values computed from homozygous data are thus expected to be
more reliable. Yet the ones learned from heterozygous data can still be useful references in
resolving ambiguous cases, as we will discuss in more detail when describing our approach
to combining the predictions learned from the different types of data.
6.3.2 Learning differential equation models from perturbation time series
data
For time series data after an initial perturbation, we use differential equations to model
the gene expression rates. The general form is as follows:
dxi
dt= fi(x1, x2, ..., xn), (6.1)
where xi represents the expression level of gene i and fi is a function that explains how the
expression rate of gene i is affected by the expression level of all the genes in the network,
including the level of gene i itself. Various types of function fi have been proposed. We
consider three of them. The first one is a linear model [67]:
dxi
dt= ai0 − aiixi +
∑j∈S
aijxj , (6.2)
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 103
where ai0 is the basal expression rate of gene i in the absence of regulators, aii is the decay
rate of the mRNA transcripts of i, and S is the set of potential regulators of i. In theory,
S could be set as [n] = {1, 2, ..., n}, the whole set of genes in the network, as the regulators
of i are unknown. However, for performance reasons, S is usually restricted to some small
sets of genes. Our choice of S will be discussed below. For each potential regulator j, aij
explains how the expression of i is affected by the abundance of j. A positive aij indicates
that j is an activator of i, and a negative aij indicates that j is a suppressor of i.
The linear model assumes a linear relationship between the expression level of the
regulators and the resulting expression rate of the target. It is a rough first approximation
of the expression rate. An advantage of the model is the small number of parameters
(|S| + 2), yet real biological regulatory systems seem to exhibit non-linear characteristics.
The second model we consider assumes the more realistic sigmoidal relationship between
the regulators and the target [188]:
dxi
dt=
bi1
1 + exp(−ai0 −∑
j∈S aijxj)− bi2xi, (6.3)
where bi1 is the maximum expression rate of i and bi2 is its decay rate. This model involves
|S|+ 3 parameters.
The third model we consider has a multiplicative form, with each factor capturing the
relationship between the target and one of its regulators [129]:
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 104
dxi
dt= ai0
∏j1∈S1
bij1
xcij1j1
+ bij1
∏j2∈S2
xcij2j2
xcij2j2
+ bij2
− ai1xi, (6.4)
where S1 and S2 represent the sets of suppressors and activators, respectively, ai1 is the
decay rate, bij1 and bij2 are rate constants, and cij1 and cij2 are sigmoidicity constants. This
model involves 2|S1|+ 2|S2|+ 2 parameters.
In our actual implementation, the exponent terms in the third model sometimes caused
numerical instability when the base was close to zero. We therefore based our predictions
on the first two models.
Our goal is to try different possible regulator sets S (or S1 and S2) and identify the
ones that predict the observed expression levels well in the least-square sense:
gi(θ) =∑
t
(xit − xit)2, (6.5)
where θ denotes the set of parameters (a, b and c), xit is the expression level of gene i at
time point t, and xit is the corresponding predicted level of a model. The summation is
taken over all time points of all perturbation experiments.
The objective function is not convex with respect to the parameters. We use Newton’s
method [27] to find local minima of the objective function gi(θ) with 100 random initial
values of θ, and adopt the one that provides the best fit with the smallest gi(θ). The
expression vector x, gradient 5x and Hessian 52x are estimated by using the closed-form
formulas provided by the second order Runge-Kutta method [44].
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 105
We try two types of regulator sets. The first type involves single regulators, in which
we try each gene j as the potential regulator of gene i in turn, and compare the least
square errors of their best-fit models. The second type involves high-confidence potential
regulators, plus one extra regulator to be tested. As we will see in the next section, the
high-confidence potential regulators are obtained from the predictions of the noise models
learned from the deletion data, as well as those predicted by the single-regulator differential
equation models. We call such models the “guided models” since the construction of the
regulator sets is guided by previous predictions. The full detail of the resulting algorithm
will be given in the next subsection.
We also tried double regulator sets with all pairs of potential regulators. Yet the
resulting models did not appear to provide much additional information on top of the single
regulator set models, while requiring much longer computational time. We therefore decided
to consider only the single regulator sets and guided single regulator sets.
For a regulator set S and a target gene i, the value of the objective function of the best
model indicates how likely i is regulated by the members of S. The values are thus used to
rank the likelihood of existence of the regulatory edges.
6.3.3 Combining the predictions of the models
Our main idea for combining the predictions of the different models learned from dele-
tion and perturbation data is to rank the predictions according to our confidence that they
are correct. Specifically, we make predictions in batches, with the first batch containing
the most confident predictions, and each subsequent batch containing the most confident
predictions that have not been covered by the previous batches. Within each batch, the
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 106
predictions are ordered by the confidence of the models, which corresponds to the probabil-
ity of regulation pb→a for noise models, and negated objective score −gi(θ) for differential
equation models. We define the batches as follows:
• Batch 1: all predictions with a probability of regulation larger than 0.99 according to
the noise model learned from homozygous deletion data
• Batch 2: all predictions with an objective score two standard deviations below the av-
erage according to all types of differential equation models learned from perturbation
data
• Batch 3: all predictions with an objective score two standard deviations below the
average according to all types of guided differential equation models learned from per-
turbation data, where the regulator sets contain regulators predicted in the previous
batches, plus one extra potential regulator
• Batch 4: as in batch 2, but requiring the predictions to be made by only one type of
the differential equation models as opposed to all of them
• Batch 5: as in batch 3, but requiring the predictions to be made by only one type of
the differential equation models as opposed to all of them
• Batch 6: all predictions with a probability of regulation larger than 0.95 according to
both the noise models learned from homozygous and heterozygous deletion data, and
have the same edge sign predicted by both models
• Batch 7: all remaining gene pairs, with their ranks within the batch determined by
their probability of regulation according to the noise model learned from homozygous
deletion data
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 107
In general, we put the greatest confidence in the noise model learned from homozygous
deletion data as the signals from this kind of data are clearest among the three types of
data. We are also more confident with predictions that are consistently made, either by the
different types of differential equation models (batches 2 and 3 vs. batches 4 and 5) or by
the noise models learned from homozygous and heterozygous deletion data (batch 6).
6.4 Performance study
6.4.1 Datasets and performance metrics
We used the algorithm described above to take part in the third Dialogue for Reverse
Engineering Assessments and Methods Challenge (DREAM3) [51] on regulatory network
reconstruction. The challenge involves fifteen benchmark datasets, five of which have 10
genes, five have 50 and five have 100. The networks are constructed based on parameters
extracted from modules in real biological networks [124]. At each size, two of the networks
are based on parameters from the regulatory network of E. coli, and three are based on
yeast.
The predictions are compared against the actual edges in the networks by the DREAM
organizer using four different metrics for evaluating the accuracy:
• AUPR: The area under the precision-recall curve
• AUROC: The area under the receiver-operator characteristics curve
• pAUPR: The p-value of AUPR based on the distribution of AUPR values in 100,000
random network link permutations
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 108
• pAUROC: The p-value of AUROC based on the distribution of AUROC values in
100,000 random network link permutations
These metrics are further aggregated into an overall p-value for each size using the
geometric mean of the five p-values from the five networks, and finally an overall score
equal −0.5 log10(p1p2), where p1 and p2 are the geometric means of pAUPR and pAUROC
respectively.
6.4.2 Results
The challenge of size 10 has attracted 29 teams to participate, the one of size 50 has
27 teams and the one of size 100 has 22 teams. The large number of participants makes the
challenge currently the largest benchmark for gene network reverse engineering [51].
Our algorithm ended in first place on all three network sizes. The complete set of
performance scores for all teams can be found at the DREAM3 web site [51]. Below we
summarize our prediction results, and discuss some interesting observations.
Table 6.1 and Table 6.2 show the AUROC and pAUROC values of our predictions
reported by the DREAM organizer, respectively. From the p-values, we see that our pre-
dictions are consistently significantly better than random. In general, we observe that out
method performed better on the E. coli networks, but is relatively unaffected by the network
size, as evaluated by AUROC.
We notice that in some cases our first predictions are already very close to the actual
network. Figure 6.1(a) shows the actual network of the Yeast1-size10 network, where an
arrowhead represents an activation and a blunt-end represents a suppression. Figure 6.1(b)
CHAPTER 6. ADDING NEW PERSPECTIVES TO EXISTING PROBLEMS:DISCOVERING NEW INFORMATION IN NEW DATA 109
Table 6.1. AUROC of our predictions.Ecoli1 Ecoli2 Yeast1 Yeast2 Yeast3
Figure 7.6. An example collection of yeast essential genes represented in RDF/XML format.
The link connecting the generated RDF metadata file, data file, and the schema file
is via a common system-generated identifier that is stored in the property identifier in the
metadata file. We create a RDF repository for each file type.
7.3 Biological use case: YeastHub
To demonstrate how to use semantic web techniques to integrate diverse types of
genome data in heterogeneous formats, we have developed a prototype application called
“YeastHub”. In this application, a data warehouse has been constructed using Sesame to
store and query a variety of yeast genome data obtained from multiple sources. For perfor-
CHAPTER 7. SEMANTIC WEB 130
mance reasons, we create the RDF repository using main memory. The application allows
the user to register a dataset and convert it into RDF format if it is in tabular format. Once
the datasets are loaded into the repository, they can be queried in the following ways.
1. Ad hoc queries. This allows the user to compose RDF-based query statements and
issue them directly to the data repository. Currently, it allows the user to use the
following query languages: RQL, SeRQL, and RDQL. This requires the user to be
familiar with at least one of these query syntaxes as well as the structure of the RDF
datasets to be queried. SQL users should find it easy to learn RDF query languages.
2. Form-based queries. While ad hoc RDF queries are flexible and powerful, users who
do not know RDF query languages would prefer to use an alternative method to pose
queries. Even users who are familiar with RDF query languages might find these
languages arcane to use. To this end, the application allows users to query the repos-
itory through web query forms (although they are not as flexible as the ad hoc query
approach). To create these query forms, YeastHub provides a query template genera-
tor. Figure 7.7 shows the web pages that allow the user to perform the steps involved
in generating and saving the query form. First, as shown in Figure 7.7(a), the user
selects the datasets and the properties of interest. After the selection, the user pro-
ceeds to specify how to generate the query form template, as shown in Figure 7.7(b).
This page requires the user to indicate which properties are to be used for the query
output (select clause), search Boolean criteria (where clause), and join criteria. In
addition, the user is given the option to create a text field, pull-down menu, or select
list (in which multiple items can be selected) for each search property. Once the entry
is complete, the user can go ahead to generate the query form by saving it with a
name (all this information is stored as metadata in a MySQL database). The user
CHAPTER 7. SEMANTIC WEB 131
Table 7.1. Types of databases and data distribution formats.Tabular XML RDF Rel. DB
Global Databases (GB/TB) BIND UniProtBoutique Databases (MB/GB) SGD, YGDP, MIPS GO TRIPLELocal Databases (KB/MB) Protein Chips, Protein-Protein Interactions
can then use the generated query form, as shown in Figure 7.7(c), to perform Boolean
queries on the selected datasets. Notice that the user who generates the query is not
necessarily the same person who uses the form to query the repository. Some users
may just use the query form(s) generated by someone else to perform data querying.
These users may not have the need to create query forms themselves.
Presently, both types of queries return results in HTML format for display to the
human user. Other formats (e.g., RDF format) can be provided.
7.3.1 Example queries
Our example queries involve integrating datasets obtained from different web-accessible
databases. Table 7.1 lists these databases. In addition to showing the data distribution
formats, it categories the databases into the following types.
1. Global databases represent very large repositories typically consisting of gigabytes or
terabytes of data. These databases are widely accessed by researchers from different
countries via the Internet. The example here is the yeast portion of UniProt in RDF
format.
2. Boutique databases are large databases with typical sizes ranging from several megabytes
to hundreds of megabytes (or even several gigabytes). Examples include SGD, YGDP,
CHAPTER 7. SEMANTIC WEB 132
(a)
(b)
(c)
Figure 7.7. (a) Selection of data sources and properties for creating a query template. (b)Query template generation. (c) Generated query form template.
CHAPTER 7. SEMANTIC WEB 133
MIPS, BIND, GO, and TRIPLES. While SGD and MIPS datasets are typically avail-
able in tabular format, GO and BIND are available in XML format. TRIPLES is a
relational database.
3. Local databases are relatively small databases that are typically developed and used
by individual laboratories. These databases may range from several kilobytes to
several (or tens of) megabytes in size. Examples include a protein-protein interac-
tion dataset extracted from BIND and a protein kinase chip dataset. While global
and boutique databases are mostly Internet-accessible, some local databases may be
network-inaccessible and may involve proprietary data formats.
Example Query 1: Figure 7.8 shows a query form that allows the user to simultaneously
query the following yeast resources: a) essential gene list obtained from MIPS, b) essential
gene list obtained from YGDP, c) protein-protein interaction data [208], d) gene and GO
ID association obtained from SGD, e) GO annotation and, f) gene expression data obtained
from TRIPLES [110]. Datasets (a)-(d) are distributed in tab-delimited format. They were
converted into our RDF format. The GO dataset is in an RDF-like XML format (we made
some slight modification to it to make it RDF-compliant). TRIPLES is an Oracle database.
We used D2RQ to dynamically map a subset of the gene expression data stored in TRIPLES
to RDF format.
The example query demonstrates how an integrated query can be used to correlate be-
tween gene essentiality and connectivity derived from the interaction data. The hypothesis
is that the higher its connectivity, the more likely that the gene is essential. This hypothesis
has been investigated in other work [80, 197]. In the query form shown in Figure 7.8, the
user has entered the following Boolean condition: connectivity = 80, expression level = 1,
CHAPTER 7. SEMANTIC WEB 134
Figure 7.8. Example integrated query form.
CHAPTER 7. SEMANTIC WEB 135
growth condition = vegetative, and clone id = V182B10. Such Boolean query joins across
six resources based on common gene names and GO IDs. Figure 7.9 shows the correspond-
ing SeRQL query syntax and output. The query output indicates that the essential gene
(YBL092W) has a connectivity equal to 80. This gene is found in both the MIPS and
YGDP essential gene lists. This gives a higher confidence of gene essentiality as the two re-
sources might have used different methods and sources to identify their essential genes. The
query output displays GO annotation (molecular function, biological process, and cellular
component) and TRIPLES gene expression.
Example Query 2: This query demonstrates how to integrate the UniProt dataset with
the yeast protein kinase chip dataset that captures the number of substrates that each kinase
phosphorylates with an expression level > 1. Figure 7.10 shows the RQL query syntax and
the output that gives the number of substrates phosphorylated by kinase “YBL105C” (level
> 1) as well as the functional annotation of the kinase. This protein is listed as essential
in both MIPS and YGDP. In addition to connectivity, we might hypothesize that the more
the number of substrates a kinase phosphorylates at a high level, the more likely that the
kinase is essential.
7.3.2 Performance
Sesame allows a repository to be created using a database (e.g., MySQL), native disk,
or main memory. We evaluate the performance of these approaches using example query
1 described previously. We run the same query twice against main memory, mySQL, and
native disk repositories. Each repository stores the identical datasets with a total of ∼ 800K
triple statements.
CHAPTER 7. SEMANTIC WEB 136
Figure 7.9. Syntax and output of example query 1.
CHAPTER 7. SEMANTIC WEB 137
Figure 7.10. Syntax and output of example query 2.
Table 7.2. Query performance.Query run Memory MySQL File1 312ms 308ms 9929ms2 306ms 44ms 11045ms
Table 7.2 shows the amount the time (in milliseconds) it takes for query execution
for each repository type. Both the main memory and MySQL approaches take about the
same amount of time on the first query run (∼ 300ms). On the second query run, the
MySQL approach is 7 times faster than the main memory one due to a cache effect (the
speed difference, however, is only a fraction of a second). The file-based approach takes the
longest query execution time.
Table 7.3 shows the amount of time (in seconds) it takes to load an RDF-formatted
UniProt data file, which contains yeast data only, into the three repositories. The file size
is about 63 MB (∼ 1.4 million triple statements). As shown in Table 7.3, the main memory
approach has the best data loading performance, while the MySQL approach has the worst
performance due to the overhead involved in creating data indexes. In conclusion, the main
memory approach gives the best overall performance.
CHAPTER 7. SEMANTIC WEB 138
Table 7.3. UniProt data loading performance.Load run Memory MySQL File1 38ms 651ms 262ms2 40ms 646ms 275ms
7.3.3 Implementation
YeastHub is implemented using Sesame 1.1. We use Tomcat as the web server. The
web interface is written using Java servlets. The tabular-to-RDF conversion is written using
Java. To access and query the repository programmatically, we use Sesame’s Sail API that
is Java-based. We use MySQL as the database server (version 3.23.58) to store information
about the correspondences between the resource properties and the query form fields. Such
information facilitates automatic generation of query forms and query statements. We
also use the database server to create an RDF repository for performance benchmark as
described previously. YeastHub is currently running on a Dell PC server that has dual
processors of 2 GHz, 2 GB main memory, and a total of 120 GB hard disk space. The
computer operating system is Red Hat Enterprise Linux AS release 3 (Taroon Update 4).
7.4 Discussion
Although the tab-delimited format is popularly used in distributing life sciences data,
there are other data distribution formats such as the record format (or the attribute-value
pair format), XML format, other proprietary formats. It would be logical to incorporate
these formats into our RDF data conversion scheme. In the process of our RDF data
conversion, we generate the corresponding RDF schemas. While our approach to generating
new schemas allows existing properties that are defined in other schemas to be reused,
CHAPTER 7. SEMANTIC WEB 139
there is a need to perform schema mapping at a later stage, as new standard RDF schemas
will emerge. How to translate one RDF schema into another RDF schema would be an
interesting semantic web research topic.
While URL’s are commonly used as a means to identify resources on the web, they
have the following problems.
1. The web server referenced by the URL may be broken or become unavailable. Also,
when a new server replaces the old one, the URL may need to be changed.
2. The syntax of the URL may change over time as the underlying data retrieval program
evolves. For example, parameter names may be changed and additional parameters
may be required.
3. The data returned by a URL may change over time as the underlying database con-
tents change. This creates a problem for researchers when they want to exactly
reproduce any observations and experiments based on a data object.
To address these problems, the Life Science Identifier project (http://www-124.ibm.
com/developerworks/oss/lsid/) has proposed a standard scheme to reference data re-
sources. Every LSID consists of up to five parts: the Network Identifier (NID); the root
DNS name of the issuing authority; the namespace chosen by the issuing authority; the ob-
ject id unique in that namespace; and finally an optional revision id for storing versioning
information. For example, “urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434” is an LSID that
references a PubMed article. Each part is separated by a colon to make LSIDs easy to
parse. The specific details of how to resolve the LSID to a given data object is left to
an LSID issuing authority. In our case, we can potentially implement an LSID resolution
provide researchers with a set of tools to carry out such computations with great ease.
The system provides five main types of functionality: (1) Network management: storing,
retrieving and categorizing networks. A comprehensive set of widely used network datasets
is preloaded, put into standard form, and categorized with a set of tags. (2) Network
visualization: displaying networks in an interactive graphical interface (Figure 9.1). (3)
Network comparison and manipulation: various kinds of filtering and multiple network
operations. (4) Network analysis: computing various statistics for the whole network and
subsets, and finding motifs and defective cliques. (5) Network Mining: predicting one
network based on the information in another.
Our system shares some elements with some other network analysis and visualization
systems, such as Cytoscape [167], JUNG1, N-Browse2, and Osprey3, but also offers some
additional features such as defective clique finding. Besides, being a Web-based system,
tYNA also has some unique advantages:
• Users can share networks through a centralized database.
• Computationally intensive tasks such as motif finding and statistics calculations can
be performed on powerful servers.
• The system can be linked from/to other online resources.
• Users can incorporate some functions of tYNA into their own programs using the
SOAP-based web service interface.
Table 9.1 summarizes some major differences between the systems. We do not attempt1http://jung.sourceforge.net/2http://nematoda.bio.nyu.edu:8080/NBrowse/N-Browse.jsp?last=false3http://biodata.mshri.on.ca/osprey/
Figure 9.1. The intersection of two yeast-two-hybrid datasets [ 97, 185] with all nodes havingno edges in the intersection filtered by a statistics filter. The nodes are colored according totheir degrees. Also shown in the figure are the various statistics of the resulting network.
CHAPTER 9. TYNA: THE YALE NETWORK ANALYZER 179
Table 9.1. A comparison of several network analysis and visualization systems. Note that wehave included only some network analysis and visualization systems in this comparison.
Cytoscape (2.3.1) JUNG N-Browse Osprey tYNABasic NetworkAnalyzer Metabolica (1.0) (10 Aug 2006) (1.2) (10 Aug 2006)
plugin (1.0) plugin (1.0)Main purpose Visualization Network analysis Motif finding Graph library Visualization Visualization Network analysisSystem Standalone Plugin Plugin Standalone Web Standalone Web· Link from external resources Indirect: Java Web Start N/A N/A No Direct No Direct· Web service interface No No No No No No Yes: SOAPStatistics calculation No Yes No Yes No No Yes· Degree No Yes No Yes No No Yes· Clustering coefficient No Yes No Yes No No Yes· Shortest path length (eccentricity) No Yes No Yes No No Yes· Betweenness No No No No No No YesMotif finding No No Yes No No No Yes· Chain No No No No No No Yes· Cycle No No Yes No No No Yes· Feed-forward loop No No Yes No No No Yes· Complete two-layer No No No No No No Yes· Defective clique No No No No No No YesMultiple network operations Yes No No No Yes Yes YesUser network management Session N/A N/A Individual files Database Individual files Database· Network classification By session N/A N/A No No No By tagging attributes
to make the list exhaustive. Furthermore, since each system has its unique goals, it is not
completely fair to compare them in this way. This table simply serves as a quick reference
for readers who are interested in knowing some of the differences between the systems.
9.2 Using tYNA
tYNA provides a simple view with some basic features, and an advanced view for more
complex analyses.
9.2.1 Uploading networks and categories
The first step of analysis is to upload networks. tYNA accepts various file formats,
including the SIF format of Cytoscape. One may also enter additional attributes to organize
the networks into groups, such as network type (e.g. protein-protein interaction), organism
(e.g. yeast) and experimental method (e.g. yeast-two-hybrid) (Figure 9.2). Furthermore,
CHAPTER 9. TYNA: THE YALE NETWORK ANALYZER 180
tYNA allows users to analyze subsets of the networks (e.g., active parts in a dynamic
network [86, 123]) by using category files.
Figure 9.2. Networks uploaded and categorized in the tYNA database.
9.2.2 Loading networks into workspaces
After uploading a network, one may view its statistics and visualize it graphically by
loading it into a workspace. A workspace is a working area for a single network (Fig-
ure 9.1). Various statistics are computed, such as the clustering coefficient, eccentricity and
betweenness [210]. Networks are visualized in Scalable Vector Graphics (SVG) using the
CHAPTER 9. TYNA: THE YALE NETWORK ANALYZER 181
aiSee package4, which facilitates an interactive interface: one may change the appearance
of the network in real time (Figure 9.3 and Figure 9.4).
Figure 9.3. Visualizing a network in a workspace.
9.2.3 Single-network operations (advanced view)
Filtering allows one to retain a portion of the network, based on a statistics cutoff
(e.g. the 5% of nodes with the highest out-degrees) or node names. It will easily allow
one to identify the hubs and bottlenecks in a graph. Motif finding identifies various regular
patterns in the network, including chains, cycles, feed-forward loops and complete two-4http://www.aisee.com
number of sequences in the MSA, and wkl is the weight for the sequence pair k, l. If the
two sites are coevolving in that radical substitutions at the first site are accompanied by
radical substitutions at the second site, the correlation will be high. Our system provides
the classical McLachlan matrix [128] that scores substitutions based on the physiochemical
properties of the residues, as well as matrices based on residue volume, pI, and hydropa-
thy index, for studying the properties individually. Two variations are provided for each
of them: the “absolute value version” considers only the magnitude, while the “raw ver-
sion” also considers the direction of change, for detecting compensatory mutations. The
correlation can be computed from raw values (Pearson correlation) or from value ranks
(Spearman correlation [143]). Several schemes are provided for the weights wkl, preventing
false coevolution signals due to uneven sequence representation or site conservation.
10.2.2 Perturbation-based functions
The idea of perturbation-based functions is to perform a “perturbation” at a first
site, and observe its effect on a second site. The Statistical Coupling Analysis (SCA)
method [121] defines a statistical energy term for a site, and computes the energy change
at a second site when the first site is perturbed by retaining only the sequences with a
certain residue.1 The Explicit Likelihood of Subset Variation (ELSC) method [47] is based
on the same idea, but has the energy computations replaced by probabilities according to
hypergeometric distributions. The mutual information (MI) method [74] can be viewed
as a generalized perturbation method that considers the subsetting of all twenty kinds of
residues, and combines them by a weighted average according to their frequencies. To deal
with finite sample size effects and phylogenetic influence, the normalization options in [126]1Our implementation provides an asymmetric SCA score matrix, as well as extra summarizing statistics.
Details can be found at the appendix of this chapter.
CHAPTER 10. THE COEVOLUTION SERVER 193
are also provided.
10.2.3 Independence tests
The chi-square test (c.f. the OMES method [112]) and the quartets method [65] both
identify site pairs that are unlikely to be independent. The former computes the p-value
under the null hypothesis of independent sites. The latter counts the number of quartets
in the two-dimensional histogram of residue frequencies that deviate considerably from the
expectation.
10.3 Preprocessing options
To improve the sensitivity and specificity of the functions, options are pro-vided for
preprocessing sequences, sites and site pairs.
10.3.1 Sequence filtering and weighting
Sequences that contain too many gapped positions or are too similar to others in the
MSA (which might cause sites to appear coevolving) can be removed by specifying the
gap and similarity thresholds respectively. A minimum number of sequences can also be
specified to avoid small sample size effects.
A sequence weighting scheme based on the topology of the phylogenetic tree [71] and
one based on Markov random walk are provided. Both schemes down-weigh sequences that
are very similar to others in the MSA.
CHAPTER 10. THE COEVOLUTION SERVER 194
10.3.2 Site filtering
After sequence filtering, sites that contain too many gaps or are too conserved can
be discarded. The former is likely non-informative, while the latter may artificially inflate
some coevolution scores.
10.3.3 Site pair filtering
Sites that are close in the primary sequence may produce trivial coevolution signals
that hide other more unexpected coevolution events. Such site pairs can be filtered by
specifying the minimum sequence separation. It has also been observed that insertions/
deletions of multiple residues may create artificial coevolution signals [142]. An option is
provided for filtering site pairs that participate in the same gaps in too many sequences.
10.3.4 Other options
Grouping similar residues into a smaller alphabet may increase the sensitiv-ity [147].
Our system provides two residue groupings proposed in the literature [56, 81]. It has
also been observed that gaps might give important coevolution signals [142]. An option is
provided for treating gaps as noise or as the twenty first residue when computing coevolution
scores.
CHAPTER 10. THE COEVOLUTION SERVER 195
10.4 Scores analysis
In some proteins coevolving residues tend to be close to each other in the 3D struc-
ture [47, 74]. This suggests that the instability created by the mutation of a residue may
be (partially) compensated for by a corresponding mutation of a close residue. Coevolution
signals may thus convey some information about the protein structure. For instance it is
interesting to study how well the coevolution scores predict the residue contact map [85].
Our system provides functions for plotting and analyzing the coevolution scores against
inter-residue distances, and standard machine-learning techniques (e.g. ROC curve) for
evaluating the effectiveness of the various coevolution functions in predicting interacting
residues. A shuffling scheme for evaluating the significance of the scores is also provided in
the program package for running locally.
10.5 Example
We provide a worked example of our system in operation on the web site, which il-
lustrates coevolution in the transmembrane protein bacteriorhodopsin due to physically
constrained residues not adjacent in the primary sequence. The example can be easily
loaded by clicking the corresponding link on the main page. Running the example will
compute the coevolution scores between site pairs separated by at least 3 residues. The
scatterplot for coevolution scores against inter-residue distances generated using a known
PDB structure (Figure 10.1) shows that residue pairs receiving high scores do tend to be
closer in the crystal structure.
Due to the intensive computation involved in the score calculations, cur-rently only one
CHAPTER 10. THE COEVOLUTION SERVER 196
scoring function is allowed to be used each time. Anyone interested in performing large-scale
comparisons can download the Java programs from the web site and run locally on most
platforms (Windows, Macintosh, Linux, UNIX, etc.). Detailed installation instructions are
provided on the web site.
10.6 Discussion
Although the scatterplot in Figure 10.1, and other studies in the literature, have sug-
gested some relationships between coevolution and physical constraints, to what extent
coevolution scores could help us understand physical structures remains unclear. We hope
the current application can serve as a neutral tool for further exploration in this area.
The current system focuses on functions that do not assume any mutation models.
Other functions, such as the likelihood method by Pollock et al. [147] and the Bayesian
mutational mapping method [49] may be added in a later version.
Coevolution signals have been used in recent studies to predict sequence regions in-
volved in protein-protein interactions with different levels of success [144, 85]. We plan
on extending the system to include inter-protein residue coevolution in the next phase of
development.
CHAPTER 10. THE COEVOLUTION SERVER 197
10.7 Appendix: our implementation of the SCA method
10.7.1 Introduction
The Statistical Coupling Analysis (SCA) method is one of the earliest and most popular
methods for measuring the coevolution of pairs of sites. It was first described in Lockless and
Ranganathan [121]. We based our implementation of the SCA method on the description
in this article, as well as the description in Suel et al. [180], and its web supplement at
http://www.hhmi.swmed.edu/Labs/rr/SCA.html. In the following we will call them “the
reference sources”. We have also referenced the Matlab software of the original authors
(SCA version 1.5) for some implementation details.
Since not all the algorithmic details are given in the reference sources, and we need to fit
our implementation into the overall software framework, we have made a number of design
choices. We have made our best effort in having the choices reasonable and close to the
original definitions in the reference sources. Yet we have to stress that our implementation
is not completely the same as the one of the original authors, and users of our system should
be aware of the details of our implementation, which we describe below.
We have also referenced Dekker et al [47] since the authors also implemented the SCA
method and made some design choices. However, our choices are not exactly the same as
theirs.
We have discussed our design with some of the SCA inventors (Rama Ranganathan
and William Russ). We follow their suggestion to produce a non-square, non-symmetric
SCA matrix as was done in the original SCA papers (based on the details described below,
which are close to, but not completely the same as their SCA software), with each row