26. Lecture WS 2003/04 Bioinformatics III 1 Network Wrap-Up No lecture Tuesday next week (10.2.) Questions about lectures 1-12 on 12.2. Questions about lectures 13-25 on 17.2. Questions about assignments on 17.2. today: - review network topologies (Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)) - review some findings of previous lectures on networks Jansen et al. Science 302, 449 (2003)
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
The degree distribution, P(k), gives the probability that a selected node has exactly k links. P(k) is obtained by counting the number o f nodes N(k) with k = 1,2... links and dividing by the total number of nodes N. The degree distribution allows us to distinguish between different classes of networks. For example, a peaked degree distribution, as seen in a random network, indicates that the system has a characteristic degree and that there are no highly connected nodes (which are also known as hubs). By contrast, a power-law degree distribution indicates that a few hubs hold together numerous small nodes.
Origin of scale-free topology and hubs in biological networks
The origin of the scale-free topology in complex networks can be reduced to two basic mechanisms: growth and preferential attachment. Growth means that the network emerges through the subsequent addition of new nodes, such as the new red node that is added to the network that is shown in part a . Preferential attachment means that new nodes prefer to link to more connected nodes. For example, the probability that the red node will connect to node 1 is twice as large as connecting to node 2, as the degree of node 1 (k1=4) is twice the degree of node 2 (k2 =2). Growth
and preferential attachment generate hubs through a 'rich-gets-richer' mechanism: the more connected a node is, the more likely it is that new nodes will link to it, which allows the highly connected nodes to acquire new links faster than their less connected peers. In protein interaction networks, scale-free topology seems to have its origin in gene duplication. Part b shows a small protein interaction network (blue) and the genes that encode the proteins (green). When cells divide, occasionally one or several genes are copied twice into the offspring's genome (illustrated by the green and red circles). This induces growth in the protein interaction network because now we have an extra gene that encodes a new protein (red circle). The new protein has the same structure as the old one, so they both interact with the same proteins. Ultimately, the proteins that interacted with the original duplicated protein will each gain a new interaction to the new protein. Therefore proteins with a large number of interactions tend to gain links more often, as it is more likely that they interact with the protein that has been duplicated. This is a mechanism that generates preferential attachment in cellular networks. Indeed, in the example that is shown in part b it does not matter which gene is duplicated, the most connected central protein (hub) gains one interaction. In contrast, the square, which has only one link, gains a new link only if the hub is duplicated.
Hierarchical networks To account for the coexistence of modularity, local clustering and scale-
free topology in many real systems it has to be assumed that clusters
combine in an iterative manner, generating a hierarchical network.
The starting point of this construction is a small cluster of four densely
linked nodes (see the four central nodes in Ca). Next, three replicas of
this module are generated and the three external nodes of the replicated
clusters connected to the central node of the old cluster, which produces
a large 16-node module. Three replicas of this 16-node module are then
generated and the 16 peripheral nodes connected to the central node of
the old module, which produces a new module of 64 nodes. The
hierarchical network model seamlessly integrates a scale-free topology
with an inherent modular structure by generating a network that has a
power-law degree distribution with degree exponent = 1 + ln4/ln3 =
2.26 (see Cb) and a large, system-size independent average clustering
coefficient <C> ~ 0.6.
The most important signature of hierarchical modularity is the scaling of
the clustering coefficient, which follows C(k) ~ k-1 a straight line of slope -
1 on a log–log plot (see Cc). A hierarchical architecture implies that
sparsely connected nodes are part of highly clustered areas, with
communication between the different highly clustered neighbourhoods
being maintained by a few hubs (see Ca).
26. Lecture WS 2003/04
Bioinformatics III 14
Reminder A few remarks on the past lectures ...
26. Lecture WS 2003/04
Bioinformatics III 15
V14: Prediction of P-P interaction from correlated mutationsResults obtained by i2h in a set of 14 two domain proteins of known structure = proteins with two interacting domains. Treat the 2 domains as different proteins.
A: Interaction index for the 133 pairs with 11 or more sequences in common. The true positive hits are highlighted with filled squares.
B: Representation of i2h results, reminiscent of those obtained in the experimental yeast two-hybrid system. The diameter of the black circles is proportional to the interaction index; true pairs are highlighted with gray squares. Empty spaces correspond to those cases in which the i2h system could not be applied, because they contained <11 sequences from different species in common for the two domains.
In most cases, i2h scored the correct pair of protein domains above all other possible interactions.
Pazos, Valencia, Proteins 47, 219 (2002)
26. Lecture WS 2003/04
Bioinformatics III 16
V14: Co-localization of interaction partners
Use localization data to assess the
quality of prediction because two
predicted interacting partners
sharing the same subcellular
location are more likely to form a
true interaction.
Comparison of colocalization index
(defined as the ratio of the number
of protein pairs in which both
partners have the same subcellular
localization to the number of
protein pairs where both partners
have any sub-cellular localization
annotation).
Lu, ..., Skolnick, Genome Res 13, 1146 (2003)
Multithreading predictions (MTA) are
less reliable than high-confidence inter-
actions, but score quite well amongst
predictions + HTS screens.
26. Lecture WS 2003/04
Bioinformatics III 17
V14:Do partners have the same function?
Proteins from different groups of
biological functions may interact with
each other.
However, the degree to which interacting
proteins are annotated to the same
functional category is a measure of
quality for predicted interactions.
Here, the predictions cluster fairly well
along the diagonal.
Lu, ..., Skolnick, Genome Res 13, 1146 (2003)
26. Lecture WS 2003/04
Bioinformatics III 18
V15: Statistical significance of complexes and modules
Number of complete cliques (Q = 1) as
a function of clique size enumerated in
the network of protein interactions
(red) and in randomly rewired graphs
(blue, averaged >1,000 graphs where
number of interactions for each protein
is preserved).
Inset shows the same plot in log-
normal scale. Note the dramatic
enrichment in the number of cliques in
the protein-interaction graph
compared with the random graphs.
Most of these cliques are parts of
bigger complexes and modules.
Spirin, Mirny, PNAS 100, 12123 (2003)
26. Lecture WS 2003/04
Bioinformatics III 19
V15: Architecture of protein network
Fragment of the protein network. Nodes
and interactions in discovered clusters
are shown in bold. Nodes are colored by
functional categories in MIPS:
red, transcription regulation;
blue, cell-cycle/cell-fate control;
green, RNA processing; and
yellow, protein transport.
Complexes shown are the SAGA/TFIID
complex (red), the anaphase-promoting
complex (blue), and the TRAPP complex
(yellow).
Spirin, Mirny, PNAS 100, 12123 (2003)
26. Lecture WS 2003/04
Bioinformatics III 20
V15: Evolution of the yeast protein interaction network
Isotemporal categories are designed
through a binary (b) coding scheme.
The b code represents the
distribution of each yeast protein's
orthologs in the universal tree of life.
Bit value 1 indicates the presence of
at least one orthologous hit for a
yeast protein in a corresponding
group of genomes, and bit value 0
indicates the absence of any
orthologous hit. The presented
example is 110011 in the b format
and 51 in the d format. Orthologous
identifications are based on COGs at
NCBI and in von Mering et al. (2002).
Qin et al. PNAS 100, 12820 (2003)
Previously, phylogenetic profileswere used to detect proteininteraction partners.Here, use phylogenetic profiles to detect modules.
26. Lecture WS 2003/04
Bioinformatics III 21
V15: Evolution of the yeast protein interaction network
Interaction patterns.
Z scores for all possible interactions
of the isotemporal categories in the
protein interaction network.
For categories i and j,
Zi,j = (Fi,jobs – Fi,j
mean)/i,j
where Fi,jobs is the observed number
of interactions, and Fi,jmean and i,j are
the average number of interactions
and the SD, respectively, in 10,000
MS02 null models.
Qin et al. PNAS 100, 12820 (2003)
The diagonal distribution of large positive Z scores indicates that yeast proteins
tend to interact with proteins from the same or closely related isotemperal
categories.
26. Lecture WS 2003/04
Bioinformatics III 22
V19: Flux balancingmass conservation.
Therefore one may analyze metabolic systems by requiring mass conservation.
Only required: knowledge about stoichiometry of metabolic pathways and
metabolic demands
For each metabolite:
Under steady-state conditions, the mass balance constraints in a metabolic
network can be represented mathematically by the matrix equation:
S · v = 0
where the matrix S is the m n stoichiometric matrix,
m = the number of metabolites and n = the number of reactions in the network.
The vector v represents all fluxes in the metabolic network, including the internal
fluxes, transport fluxes and the growth flux.
)( dtransporteuseddegradeddsynthesizei
i VVVVdt
dXv
Any chemical reaction requires
26. Lecture WS 2003/04
Bioinformatics III 23
V19: Flux balance analysis
Since the number of metabolites is generally smaller than the number of reactions
(m < n) the flux-balance equation is typically underdetermined.
Therefore there are generally multiple feasible flux distributions that satisfy the mass
balance constraints.
The set of solutions are confined to the nullspace of matrix S.
To find the „true“ biological flux in cells ( e.g. Heinzle, Huber, UdS) one needs
additional (experimental) information,
or one may impose constraints
on the magnitude of each individual metabolic flux.
The intersection of the nullspace and the region defined by those linear inequalities
defines a region in flux space = the feasible set of fluxes.
iii v
26. Lecture WS 2003/04
Bioinformatics III 24
V19: Rerouting of metabolic fluxes
(Black) Flux distribution for the wild-type.
(Red) zwf- mutant. Biomass yield is 99% of
wild-type result.
(Blue) zwf- pnt- mutant. Biomass yield is 92% of
wildtype result. The solid lines represent
enzymes that are being used, with the
corresponding flux value noted.
Note how E.coli in silico circumvents removal of
one critical reaction (red arrow) by increasing
the flux through the alternative G6P P6P
reaction.
Edwards & Palsson PNAS 97, 5528 (2000)
26. Lecture WS 2003/04
Bioinformatics III 25
V20: Extreme Pathwaysintroduced into metabolic analysis by the lab of Bernard Palsson
(Dept. of Bioengineering, UC San Diego). The publications of this lab
are available at http://gcrg.ucsd.edu/publications/index.html
Extreme pathway
technique is based
on the stoichiometric
matrix representation
of metabolic networks.
All external fluxes are
defined as pointing outwards.
Schilling, Letscher, Palsson,
J. theor. Biol. 203, 229 (2000)
26. Lecture WS 2003/04
Bioinformatics III 26
V20: Feasible solution set for a metabolic reaction network
(A) The steady-state operation of the metabolic network is restricted to the region
within a cone, defined as the feasible set. The feasible set contains all flux vectors
that satisfy the physicochemical constrains. Thus, the feasible set defines the
capabilities of the metabolic network. All feasible metabolic flux distributions lie
within the feasible set, and
(B) in the limiting case, where all constraints on the metabolic network are known,
such as the enzyme kinetics and gene regulation, the feasible set may be reduced
to a single point. This single point must lie within the feasible set.
Edwards & Palsson PNAS 97, 5528 (2000)
26. Lecture WS 2003/04
Bioinformatics III 27
V21: Reconfigured Network
Klamt & Stelling Trends Biotech 21, 64 (2003)
A C P
B
D
A(ext) B(ext) C(ext)R1 R2 R3
R5
R4 R8
R9
R6
R7bR7f
3 EFMs are not systemically independent:EFM1 = EP4 + EP5EFM2 = EP3 + EP5EFM4 = EP2 + EP3
26. Lecture WS 2003/04
Bioinformatics III 28
V21: relation between EP and EFMsWhat is the consequence of when all exchange fluxes (and hence all
reactions in the network) are irreversible?
Klamt & Stelling Trends Biotech 21, 64 (2003)
EFMs and EPs always co-incide!
26. Lecture WS 2003/04
Bioinformatics III 29
V22: Correlation between genes of same metabolic pathway
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
Genes of the glycolysis pathway
(according KEGG) were clustered
and ordered based on the correlation
in their expression profiles.
Shown here is the matrix of their
pair-wise correlations.
The cluster of highly correlated
genes (orange frame) corresponds
to genes that encode the central
glycolysis enzymes.
The linear arrangement of these
genes along the pathway is shown at
right.
Of the 46 genes assigned to the
glycolysis pathway in the KEGG
database, only 24 show a correlated
expression pattern.
In general, the coregulated genes
belong to the central pieces of
pathways.
26. Lecture WS 2003/04
Bioinformatics III 30
The connectivity of a given metabolite
is defined as the number of reactions
connecting it to other metabolites.
Shown are the distributions of
connectivity between metabolites in an
unrestricted network () and in a
network where only correlated
reactions are considered ().
In accordance with previous results
(Jeong et al. 2000) , the connectivity
distribution between metabolites
follows a power law (log-log plot).
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
V22: Connectivity of metabolites
In contrast, when coexpression is
used as a criterion to distinguish
functional links, the connectivity
distribution becomes exponential
(log-linear plot).
26. Lecture WS 2003/04
Bioinformatics III 31
V22: Co-expression of transporters
Transporter genes are
co-expressed with the relevant
metabolic pathways providing
the pathways with its metabolites.
Co-expression is marked in green.
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
26. Lecture WS 2003/04
Bioinformatics III 32
V25: Combination of data sets into probabilistic interactomes
(B) Combination of data sets into
probabilistic interactomes.
The 4 interaction data sets
from HT experiments were
combined into 1 PIE.
The PIE represents a
transformation of the
individual binary-valued
interaction sets into a data
set where every protein pair
is weighed according to the
likelihood that it exists in a
complex. A „naïve” Bayesian network is used to model
the PIP data. These information sets hardly
overlap.
Jansen et al. Science 302, 449 (2003)
Because the 4 experimental
interaction data sets contain
correlated evidence, a fully
connected Bayesian network
is used.
26. Lecture WS 2003/04
Bioinformatics III 33
V25: Static naive Bayesian Networks
The likelihood ratio L defined as
relates prior and posterior odds according to Bayes‘ rule:
negffP
posffPffL
N
NN ...
......
1
11
priorNpost OffLO ...1
In the special case that the N features are conditionally independent
(i.e. they provide uncorrelated evidence) the Bayesian network is a so-called
„naïve” network, and L can be simplified to:
N
i
N
i i
iiN negfP
posfPfLffL
1 11...
Jansen et al. Science 302, 449 (2003)
26. Lecture WS 2003/04
Bioinformatics III 34
V25: PIP vs. the information sources
Ratio of true to false positives (TP/FP) increases
monotonically with Lcut, confirming L as an
appropriate measure of the odds of a real
interaction.
The ratio is computed as:
Protein pairs with Lcut > 600 have a > 50%
chance of being in the same complex.Jansen et al. Science 302, 449 (2003)
cut
cut
LL
LL
cut
cut
Lneg
Lpos
LFP
LTP
26. Lecture WS 2003/04
Bioinformatics III 35
V25: PIE vs. the information sources
9897 interactions are predicted from PIP and
163 from PIE.
In contrast, likelihood ratios derived from single