Statistical approaches to protein matching in Bioinformatics Vysaul B. Nyirongo Submitted in accordance with the requirements for the degree of Doctor of Philosophy The University of Leeds Department of Statistics January 2006 The candidate confirms that the work submitted is his own and that appropriate credit has been given where reference has been made to the work of others. This copy has been supplied on the understanding that it is a copyright material and that no quotation from the thesis may be published without proper acknowledgement.
198
Embed
Statistical approaches to protein matching in Bioinformatics
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Statistical approaches toprotein matching in Bioinformatics
Vysaul B. Nyirongo
Submitted in accordance with the requirements for the degree
of Doctor of Philosophy
The University of Leeds
Department of Statistics
January 2006
The candidate confirms that the work submitted is his own and that appropriate
credit has been given where reference has been made to the work of others. This
copy has been supplied on the understanding that it is a copyright material and that
no quotation from the thesis may be published without proper acknowledgement.
2
Dedication
To my father W.C. Nyirongo, the kindest
and
my mother Nee Dorothy Nyirenda, the finest.
Acknowledgements
I am deeply thankful to my supervisor, Prof. K.V. Mardia for his guidance, discus-
sions, helpful comments and inspiring interest on this research. I am also deeply
indebted to Prof. P.J. Green for kindly providing the source code for Bayesian
alignment using hierarchical models.
I wish to thank Dr. C. Xu for his many helpful comments on spatial point pro-
cesses and kindly allowing to use his program for analysing spatial point processes.
I am also grateful to Dr. D.R. Westhead and Dr. N.D. Gold for their many helpful
discussions, comments and for the access to functional sites database (SITESDB).
Finally, but not least, I would like to express by gratitude for financial support
from Universities UK, University of Leeds and the Department of Statistics at Uni-
versity of Leeds. My research studies were financed by Universities UK through
ORS scholarship and University of Leeds through Tetley and Lupton scholarship.
During this research, I was financially supported by the Department of Statistics,
University of Leeds.
i
Abstract
Structural genomics projects aim to provide structural data or accurate models
for uncharacterised proteins (Brenner and Levitt, 2000). The motivation for these
initiatives is the knowledge that similarity between protein structures can provide
evidence of common evolutionary ancestry (and hence possible functional similarity)
even where sequence similarity lies undetectable because structure is conserved for
longer in evolution than sequence (Chothia and Lesk, 1986). Recent advances in
high-throughput protocols for structural determination of structural genomics target
proteins have produced an explosion in volume of structural data prior to knowledge
of protein biochemical function. With these advances has come the need to rapidly
predict functions for proteins based on structure.
We present statistical matching of functional sites. In particular, we are using
the EM algorithm in a mixture model formulation to solve for correspondence and
alignment in matching two configurations of functional sites. We extend the EM
algorithm of Kent et al. (2004) to incorporate concomitant information in matching
functional sites. We also extend Green and Mardia (2006) to matching configura-
tions of coupled points using hierarchical models for Bayesian alignment.
We also present goodness-of-fit statistics for matching two functional sites un-
der the Gaussian error model. We consider the Procrustes statistic for matching of
forms. The Procrustes statistic is related to RMSD except for a divisor. P-values
are used to indicate goodness-of-fit. Related but harder is the problem of finding
the distribution for the minimum Procrustes statistic when the points are unla-
belled. First we will discuss this problem and the inherent difficulty. For illustrative
ii
purposes, we use Gaussian configurations on a line.
Key words: active site, binding site, Bayesian, Bioinformatics, correspondence
and alignment, EM algorithm, functional site, hierarchical models, Markov chain
Monte Carlo, mixture model, Procrustes, Root mean square deviation.
amino acid type, l = 1, . . . , 20. Note that typically Ni = 200 − 2000.
An amino acid residue is a set of atoms (and covalent bonds). This atom set can
be partitioned into backbone atoms, B (same for every residue type) and side-chain
atoms Rl (differing between residue types). Figure 1.1 shows two amino acids (si
and si+1) joined by a “peptide bond”.
Pl = {B,Rl}, the peptide chain may be known only at the sequence level, where
the identities sij of the amino acid residues are known but there is no information
Chapter 1. Introduction and Literature Review 5
OH
O
H
H
H
CN
C’
O
H
H
H
N C’
OH
O
H H
N
C’
CC
C’
OH
O
H
H
H
CN
+
2H O
Peptide bond
αα
α αψφ
Ri
Ri Ri+1
Ri+1
Si Si+1
Figure 1.1: Peptide bond joining two amino acids.
about three-dimensional structure. This is commonly the type of information that
emerges from genome sequencing projects. In a minority of cases, three dimensional
structure information may be available in the form of x, y and z Cartesian coordi-
nates for all the protein atoms. Information about the association of peptide chains
into complete proteins (quaternary structure) may be available in some cases.
The amino acids can be labelled by the side chain (Ri) which takes one of 20
types. For example with Ri = H we have glycine, with Ri+1 = CH3 we have alanine.
These are sometimes also referred to as peptide units. Each peptide unit can only
rotate around N −Cα and Cα−C ′ bonds; these angles φ and ψ are also of interest.
Amino acids have different physico-chemistry properties and can be grouped
according to shared properties e.g. hydrophobic or hydrophilic (see Table 5.4 for
one one possible grouping). Hydrophobic amino acids are those with side-chains
that do not like to reside in an aqueous (i.e. water) environment. For this reason,
these amino acids are generally buried within the hydrophobic core of a protein. On
the other hand non hydrophobic or hydrophilic amino acids tend to interact with
the aqueous environment and are predominantly found on the exterior surfaces of
proteins or in the reactive centres. This property is more important for transport
proteins. These proteins are often globular structures and are generally tightly
Chapter 1. Introduction and Literature Review 6
packed (compact) with hydrophilic (polar) side chains on the outside to enhance
their solubility in water. They typically have hydrophobic (non-polar) side chains
folded to the inside to keep water from getting in and unfolding them. In section
3.1.2 we take into account hydrophobic/hydrophilic properties of the side chains in
order to simulate globular, compact structures. We also take into account physico-
chemistry properties in matching functional parts of proteins (see section 1.1.4) in
Chapters 5, 6 and 7.
The data bank Swiss-Prot contained sequence data of more than 212,425 proteins
as of 21st March, 2006. Protein 3-dimensional structures derived from X-ray diffrac-
tion and neutron-diffraction studies of crystallised proteins are housed at the Protein
Data Bank (PDB). There were about 35,813 (as of 28th March, 2006) structures
which can be accessed at web address http://www.rcsb.org.
1.1.4 Functional Sites
Although proteins are large molecules, in many cases only a small part (e.g. in
Figure 1.2) of the structure: a functional site - is functional, the rest existing only
to create and fix the spatial relationship among amino acids of the functional site.
The term functional site refers to both active sites and binding sites. An active
site is a protein part where chemical reactions occurs while a binding site refers to
a region which binds specific ligands (smaller molecules). For example, Figure 1.2
shows a functional site in 5-aminolaevulinate dehydratase protein structure.
1.1.5 The SITESDB Database
In this thesis all functional sites were taken from a database of known sites (SITESDB)
(Gold, 2003). SITESDB had 91,441 entries (functional sites) as of 28th March, 2006.
The median and mean for the number of amino acid was 10 and 16 respectively.
Lower and upper quartiles were 10 and 19. The range was from 1 to 120.
SITESDB entries were automatically formed from the PDB (Berman et al., 2000)
by locating the local protein environment (amino acids within 5A) around bound
Chapter 1. Introduction and Literature Review 7
Figure 1.2: Functional site in 5-aminolaevulinate dehydratase protein structure.
ligands (identified by PDB HETATM records) and author annotated active sites
(identified by PDB SITE records). A protein may contain multiple functional sites
so unique identifiers for SITESDB entries were generated from the four letter PDB
identifier with an extra integer to distinguish sites from the same protein. For
example, the identifiers 1hdx 0 and 1hdx 1 were separate sites from the protein
with PDB identifier 1hdx.
The automatic extraction of sites results in multiple and incomplete representa-
tions of functional sites containing more than one bound ligand, or sites that are
both annotated with SITE records and contain bound ligands. In these cases a
better biochemical description of the site was obtained by merging component sites
without duplication of their amino acid contents. Sites were merged if ligand atoms
occurred within 5A of atoms in a second ligand (cf 5-5 rule in Park et al., 2001;
Dafas et al., 2004; Gong et al., 2005). In the absence of bound ligands, sites were
merged if they were found to contain common amino acid residues.
Chapter 1. Introduction and Literature Review 8
Availability
SITESDB is accessible at http://www.bioinformatics.leeds.ac.uk (hosted by
the Institute of Molecular and Cellular Biology, University of Leeds). The database
currently contains more than 90,000 functional sites.
1.1.6 Structure comparisons
The 3-dimensional structure of protein is very important in understanding how pro-
teins function as other proteins with similar 3-dimensional structures are likely to
have related functions. Therefore comparing 3-dimensional protein structures is
very important. A newly determined protein structure with 3-dimensional structure
similar to a protein with a known function is likely to have a similar function. This
would facilitate predicting the function of a newly determined protein structure.
The other useful application is protein homology detection. Structure comparison
can complement sequence similarity which is commonly used for homology mod-
elling. Homology refers to proteins having descended from a common ancestor. The
importance of 3-dimensional comparisons cannot be overemphasised as these are
more conserved than the amino acid sequences in homologous proteins.
As much as overall protein structure comparisons are done and very useful in
some applications (see literature review in section 1.2), they have sometimes difficul-
ties in identifying situations where proteins share similar structures and are clearly
related in evolution, yet they have different functions. The reverse i.e. proteins with
functional similarity but having differences in their structure also present difficul-
ties e.g. overall fold comparison misses the functional similarity of subtilisin and
chymotrypsin (Blow et al., 1969; Wright et al., 1969). Thus to complement fold
comparisons we consider comparing functional sites of proteins.
1.1.7 Objectives in matching protein structures
To appreciate the difficulty involved in matching protein structures or part thereof,
consider configurations of functional site Cα atoms in Figure 1.3 from 17 − β
Chapter 1. Introduction and Literature Review 9
hydroxysteroid-dehydrogenase and carbonyl reductase proteins. These functional
sites are related but which and how many atoms correspond are unknown. However
it is not always known apriori if the functional sites are related or not. Our aim is to
match atoms of these configurations. Functional sites matching has two objectives:
(a) To match the proteins geometrically so as to minimise the root mean square
error (RMSD),
r(x, y) =
[q∑
i=1
||xi − yi||2/q]1/2
where we have given q points for configuration {x} and the corresponding q
points for configuration {y}. The matched proteins should come as close as
possible (minimal RMSD) when configurations are superimposed on each other.
(b) The second objective is to maximise the matches of similar residues.
These objectives are often conflicting. Hence the question is how to optimise
this multi-objective matching problem.
a) 1a27 0 (63 atoms) b) 1cyd 0 (40 atoms)
Figure 1.3: RasMol (Sayle and Milner-White, 1995) ball representation of Cα for
functional sites of 17 − β hydroxysteroid-dehydrogenase (1a27 0) and carbonyl re-
ductase (1cyd 0).
Chapter 1. Introduction and Literature Review 10
1.2 Literature Review
Whole domain structural comparison methods such as CE (Shindyalov and Bourne,
1998) and DALI (Holm and Sander, 1993) and databases such as FSSP (Holm et al.,
1992), CATH (Orengo et al., 1997) and SCOP (Hubbard et al., 1997) provide valu-
able insight into the functions of newly determined proteins. However, discovery of
proteins adopting similar folds but exhibiting a variety of functions i.e. superfolds
(Orengo et al., 1994) and proteins showing similar functions without common an-
cestry (Blow et al., 1969; Wright et al., 1969) poses problems for comparisons at the
fold level. Note that SCOP hierarchical classification consists of class, (super)fold
and (super)family.
Protein function is usually carried out by relatively small parts of protein surfaces
at ligand binding or catalytic sites and hence new structural comparison methods
focus on the precise structural nature of these functional sites (Artymiuk et al., 1994;
Binkowski et al., 2003; Kinoshita et al., 2002; Kleywegt, 1999; Shulman-Peleg et al.,
2004; Stark et al., 2003b; Wallace et al., 1997). These methods are based on the
idea that geometrically similar sites are likely to have similar functions since their
amino acids are conserved in precise orientations in order to perform their chemistry
or their similar shapes and physico-chemical properties may be selective for similar
small molecules such as substrates, inhibitors or cofactors. Hence, finding structural
similarity to functional sites of known and characterised proteins may facilitate
function prediction for newly determined protein structures even in the absence of
overall fold or sequence similarity.
Functional site comparison methods essentially fall into one of two categories.
The first category provides known templates of specific motifs of conserved amino
acids or atoms often involved in enzyme catalysis (Artymiuk et al., 1994; Kley-
wegt, 1999; Wallace et al., 1997). These are knowledge-based methods which aim
to discover new proteins with the same catalytic function. The second category
consists of similarity searching algorithms (Binkowski et al., 2003; Schmitt et al.,
2002; Shulman-Peleg et al., 2004; Stark et al., 2003b; Kinoshita et al., 1999) where
Chapter 1. Introduction and Literature Review 11
prior knowledge of motifs is not required and site similarity is assessed by how
closely the sites align and/or the proportion of overlap. Partial similarity between
sites can be detected and hence much larger sites such as ligand binding sites can
be compared. Methods addressing this problem generally represent functional sites
or functional site surfaces as mathematical graphs for graph-theoretic or geomet-
ric hashing comparisons where graph vertex positions are placed using a variety of
methods. CavBase (Schmitt et al., 2002) , SiteEngine (Shulman-Peleg et al., 2004)
and PINTS (Stark et al., 2003b) for example use positions of pseudo-centres whereas
eF-site (Kinoshita et al., 1999) uses electrostatic potentials and surface curvature.
pvSoar (Binkowski et al., 2003) and SitesBase (Gold and Jackson, 2006) use alpha-
shapes and an all-atom model respectively. Recently, Green and Mardia (2006)
proposed a Bayesian hierarchical modelling approach using Cα atoms.
1.2.1 Matching and Superposition Algorithms
Finding the correspondence is intrinsically a combinatorial problem. Without geo-
metric constraints there are
min(n,m)∑
q=1
q!
(n
q
)(m
q
)ways of choosing corresponding pairs
from two configurations with m and n points. However with geometric constraints,
the solution space is tremendously reduced. Matching methods exploit geometric
constraints to solve for correspondence.
To show how geometric constraints make the correspondence problem feasible,
Kuhl et al. (1984) presented a naive brute force approach for matching a molecule
to a functional site.
A naive brute force method (Kuhl et al., 1984)
With the requirement that matching pairs are geometrically as close as possible, all
degrees of freedom are expended when three pairs of matches are made. Thus after
making three matches, simply check the coincidence of other points. Suppose two
configurations are {xj} and {µi}, j = 1 . . . n and i = 1 . . .m. Kuhl et al. (1984) in
their “DOCK” algorithm proceed as follows:
Chapter 1. Introduction and Literature Review 12
(a) For each unique set of three pairings ({i1, j1},{i2, j2},{i3, j3}) of points from two
configurations:
i. Choose the first pair to superpose by translation b.
ii. Find rotation A1 to bring the second pair into optimal superposition.
iii. Find rotation A2 to superpose the third pair.
iv. Thus got xjl = A1A2µil + b, l = 1, . . . , 3. Matching pairs are coinciding
(closest and within a defined distance of each other) points of {x} and
{A1A2µ+ b}.
v. Calculate the number of matched pairs and the RMSD.
(b) The solution is the combination which gives the largest number of matches. In
the case of several solutions with the same number of matching pairs, the one
with smallest Procrustes distance may be taken.
Kuhl et al. (1984) algorithm goes through mn(m − 1)(n − 1)(m − 2)(n − 2)
combinations i.e. mn ways to choose b; (m − 1)(n − 1) ways to choose A1; and
(m− 2)(n− 2) ways to choose A2. Some of these combinations are unnecessary for
ordering is not important. There is need for just mn(m−1)(n−1)(m−2)(n−2)/3!
combinations.
There are a few more efficient approaches for solving the problem of matching and
superimposing in Bioinformatics applications in literature. These efficient matching
methods mainly fall in two categories:
(a) Algorithms iterating between solving for alignment and correspondence. Align-
ment and correspondence support each other, making the problem solvable in a
reasonable time space. These algorithms include the EM algorithm considered
by Kent et al. (2004) which is presented in section 5.1. Wu et al. (1998) also
use an iterative algorithm.
Also in this category is the approach by Green and Mardia (2006). Green
and Mardia (2006) take a Bayesian approach where-by they formulate a joint
Chapter 1. Introduction and Literature Review 13
model for alignment and correspondence. Conditional models for alignment and
correspondence are updated in turn of each other. This framework is presented
in section 6.1.
(b) Combinatorial algorithms which utilise inter-point distance constraints. These
distance-based methods use graph theoretic algorithms to solve for correspon-
dence. Kuhl et al. (1984) proposed to use a graph algorithm of Bron and Ker-
bosch (1973) for matching a molecule binding to a functional site. Gold (2003)
implemented a parallelised database search tool on a Beowulf system, using ei-
ther Bron and Kerbosch (1973) or Carraghan and Pardalos (1990) graph clique
detecting algorithms to match functional sites.
Below we briefly describe the graph theoretic approach taken by Gold (2003). Also
for iterative algorithm category, we briefly describe the approach by Wu et al. (1998).
Graph method (Gold et al., 2003)
The principles of graph theory have been applied to matching biomolecular config-
urations for some time e.g. Kuhl et al., 1984 and Artymiuk et al., 1994. Consider
points as representing amino acid positions. These points could have attributes (con-
comitant information) representing amino acid groups or types. We require to match
two configurations of points {xj} and {yk} for j = 1, 2, . . . , m and k = 1, 2, . . . , n.
• Each configuration is represented by a mathematical graph.
• Vertices are placed at point positions.
• Each vertex is connected by an edge to every other vertex in the same graph.
• Each edge is labelled with the inter-point distance.
A search for the maximum similarity between two graphs G1 and G2 repre-
senting configurations {x} and {y} respectively; corresponds to finding the maxi-
mal common subgraph or a clique within the vertex product graph for G1 and G2
(Hv = G1 ◦v G2). The vertex product graph is defined as follows:
Chapter 1. Introduction and Literature Review 14
Definition 1.2.1. If V1 and V2 are the sets of vertices for G1 and G2 respectively.
The vertex product graph Hv = G1 ◦v G2 includes the vertex set VH = V1 × V2, in
which the vertex pairs (xj , yk) with xj ∈ V1 and yk ∈ V2 have the same attribute.
An edge between two vertices vh = (xj , yk), vh′ = (xj′ , yk′) ∈ VH exists for j 6= j′
and k 6= k′ if the absolute difference between the distances |xj − xj′| and |yk − yk′|is less than some threshold, say δ = 1.5A.
Graph matches based on inter-point distances are not necessarily superimposable
(e.g. mirror image sites). Subsequently, a Procrustes algorithm (Kabsch, 1978) is
used to check that matched configurations are geometrically superimposable. The
Procrustes algorithm minimises the size-and-shape squared (least squares) distance
between two structures, say X1 and X2. The size-and-shape squared distance is:
d2S(X,µ) = inf
A∈SO(d)‖X2 − AX1 − b‖2.
Here d = 3 and SO(d) denotes a set of all d × d rotation matrices (orthogonal
matrices with the determinant equal to +1), b is the translation vector.
Basically the algorithm is a three step process:
(a) Construct a vertex product graph.
(b) Find a maximal clique within the product graph.
(c) Check the 3-dimensional superimposition using Kabsch (1978).
In the least restrictive case all vertices (points) are assumed to have the same
attribute and hence matching can occur between any two points and is only depen-
dent on inter-point distances. Alternatively points can be labelled with colours (con-
comitant information) to restrict matching points with the same colour i.e. colour is
treated as an attribute. Although concomitant information can be incorporated as
attributes, this approach is very rigid. When matching functional sites, Gold, 2003
take into account the amino acid type by introducing a score (presented in section
3.2.2). Bron and Kerbosch (1973) finds all common subgraphs in addition to the
clique. Concomitant information can be used to score all the common subgraphs in
Chapter 1. Introduction and Literature Review 15
order to give preference to the solution with the highest score. Gold (2003) score all
complete subgraphs found by the Bron and Kerbosch (1973) and take the one with
a maximal score. However the algorithm of Carraghan and Pardalos (1990) finds
just the clique so it is not possible to use concomitant information with Carraghan
and Pardalos (1990). Gold (2003) uses the algorithm of Carraghan and Pardalos
(1990) because it is faster or optionally the algorithm of Bron and Kerbosch (1973)
can be used in order to account for concomitant information.
Iterative algorithm (Wu et al., 1998)
This method is for analysing multiple protein structures. The method allows to
perform superposition and averaging. The algorithm iterates between solving for
correspondence and superposition. Correspondence is solved by dynamic program-
ming and superposition by least squares regression.
Dynamic programming
Dynamic programming is used to align two sequences; specifically it finds corre-
spondence between two structures that minimises the overall distance between the
structures. Let i and j be the sequence indices of atoms in structures {µi} and
{xj} respectively for i = 1, 2, . . . , m and i = 1, 2, . . . , n. Let d(j, i) be some distance
metric between atoms {xj} and {µi}. Then we can find two collinear sequences of
atoms 1 ≤ j(1) < j(2) < · · · < j(q) ≤ n that minimise the function
∑qr=1 d(j(r), i(r)) + g(0, j(1)) +
∑q−1r=1 h(j(r), j(r+1)) + g(j(q), n + 1)+
g(0, i(1)) +∑q−1
r=1 h(i(r), i(r+1)) + g(i(q), m+ 1).(1.1)
where g(r, s) is the gap penalty for skipping from r to s at the end of either sequence,
and h(r, s) is the gap penalty for skipping from r to s in the middle of either sequence.
The algorithm makes two attempts to find correspondence within each iteration.
Firstly it uses curvature at each Cα as a distance metric for matching. Secondly it
matches using coordinates of Cα as distance metric.
Dynamic programming is used to find a correspondence between two structures
that minimises the overall distance between the structures. The premise behind
Chapter 1. Introduction and Literature Review 16
the algorithm is that an optimal correspondence can be constructed by adding two
aligned elements to a previously obtained optimal alignment. This insight means
that it is not necessary to search all possible alignments in order to obtain the optimal
one (given two n-length sequences, this amounts to time proportional to n4n; rather,
dynamic programming sequentially adds elements to an optimal alignment that are
already constructed). This basically reduces time cost to just O(n2)
Mechanics of the iterative algorithm (Wu et al., 1998)
In general the algorithm allows superposition of multiple proteins. Let Xj be a
coordinate matrix of corresponding atoms in the jth protein structure. Each column
in Xj represents an atom in the protein structure. In the least-squares formulation,
they find an affine model X and transformation matrices Aj (forj = 1, . . . , J) that
minimise the objective function:
J∑
j=1
‖AjXj − X‖2. (1.2)
The algorithm consists of three steps:
(a) Compute a curvature function κ for each protein structure Sj. Find corre-
sponding landmarks X(1)j by matching curvatures to a reference structure and
obtain the affine model X(1) and transformation matrices A(1)j for j = 1, . . . , J .
(b) Find corresponding landmarks X(2)j by matching coordinates to a reference
structure, and obtain the affine model X(2) and transformation matrices A(2)j .
(c) Find corresponding landmarks Xj by matching coordinates iteratively to the
evolving affine model, and obtain the affine model X and transformation ma-
trices Aj .
The iterative algorithm of Wu et al. (1998) assumes and uses sequence order
information in addition to spatial information in terms of point coordinates while
the graph method of Gold (2003) uses spatial information in terms of inter point
distances. A common problem with these approaches is that they do not take into
Chapter 1. Introduction and Literature Review 17
account concomitant information of amino acids in their matching and alignment in
a flexible way as to model the amino acid substitution phenomenon taking place in
proteins.
1.2.2 Extreme Values in Bioinformatics
Minimum RMSD in protein matching
Extreme values of RMSD in Bioinformatics protein matching applications are at
two levels. The first level is for each pair-wise matching of two configurations, say
{µ} and {x} with m and n points respectively. In this case optimal RMSD in some
sense is sought. Optimal RMSD could be defined to satisfy:
(a) the minimum RMSD among q! (mq ) (nq) values where q ∈ [2, . . . ,min(n,m)] is the
number of matched points;
(b) and require that after alignment, distances between matching points be within
a specified tolerance limit;
(c) and q is maximal.
In Chapter 5 and 6 we would look for corresponding points and alignment that give
the minimum RMSD.
The second level is when searching the database for a match. Here the interest
are matches with smaller RMSD. For example, best fitting matches are analysed
in Chapter 4 where we develop a method to rank best matches. In Chapter 7 we
follow Stark et al. (2003b) using Extreme Value Distribution (EVD) to quantify the
probability of matching by chance. In this set-up the null hypothesis is that match-
ing configurations are random and the matching is due to mere chance (random
matches). That is matching configurations are not related in any way whatsoever.
Extreme value distribution as null distribution
In Bioinformatics applications e.g. sequence matching or structure matching, the
sample space under the random matching hypothesis is practically infinite, diffi-
Chapter 1. Introduction and Literature Review 18
cult to specify and calculate. Consider the question of two random 3-dimensional
structures. These can be of any sizes, say, m,n = 1, 2, . . . and give matches of size
q = 2, 3, . . . . Each structure with m > 2 or n > 2 points can take an infinite num-
ber of configurations. Following Stark et al. (2003b), a practical way to specify the
random distribution is by collecting a large enough database of non-redundant and
non-homologous configurations. The database distribution is then used as the null
distribution under the random hypothesis. The database has to be non-homologous
and non-redundant to correctly control for Type I error rate. Ideally the background
database size should be as large as possible.
Because the interest is in the extremes from an infinitely large database, limiting
Extreme Value Distributions (EVD) are used to model the background database
distribution. For example, let the distribution for RMSD, r be F (r) and denote
its limiting distribution by G(r). Due to weak reliance of limiting EVD on data-
generating distribution function, F (r), the null distribution can be easily modelled
reliably even in these cases where F (r) is difficult to calculate or let alone specify.
What is required is just to know how F (r) depends on m,n and q.
Limiting extreme value distributions
Extremal types distributions are limiting distributions used to model extreme devi-
ations from the mean of probability distributions for stochastic processes.
Two approaches exist today:
(a) most common at this moment is the tail fitting approach based on the second
theorem in extreme value theory (Theorem II Pickands, 1975; Balkema and de
Haan, 1974).
(b) Basic theory approach as described by Burry (1975).
In general this conforms to the first theorem in extreme value theory (Theorem
I Fisher and Tippett, 1928; Gnedenko, 1943). The difference between the two
theorems is due to the nature of the data generation. For theorem I the data are
generated in full range, while in theorem II data is only generated when it surpasses
a certain threshold (POT’s models or Peak Over Threshold).
Chapter 1. Introduction and Literature Review 19
There are three classes of limiting distributions for extreme values:
Gumbel
G(r) = exp {− exp(−r)} for −∞ < r <∞ ;
Frechet
G(r) =
0 r ≤ 0;
exp(−r−α) r > 0, α > 0.
Negative Weibull
G(r) =
exp [−(−r)−α] r < 0, α > 0;
1 r ≥ 0.
These classes are unified by re-parametrisation to give the Generalised Extreme
Value distribution, GEV(µ, σ, ξ) with distribution function
G(r) = exp
{−[1 + ξ
(r − µ
σ
)]−1/ξ
+
}(1.3)
where x+ = max(x, 0) and σ > 0, so up to type the GEV distribution is
G(r) = exp[−(1 + ξr)
−1/ξ+
]. (1.4)
• Gumbel corresponds to ξ = 0 (taken as limit ξ → 0) i.e. GEV(0,1,0) =
Gumbel;
• Frechet corresponds to ξ > 0 i.e. GEV(α−1, α−1, α−1) = Frechet(α) ;
• Negative Weibull corresponds to ξ < 0 i.e. GEV(−α−1, α−1,−α−1) = Negative
Weibull(α).
Type identification
For particular, well-known F (r), the type of limiting distribution can be derived.
For example normal and log normal give rise to Gumbel while student’s t and uni-
form give Frechet and Negative Weibull respectively. In general, exponentially tailed
distributions give Gumbel type; algebraically tailed with a finite end-point distri-
butions give Frechet or Negative Weibull types. Frechet distribution is for positive
Chapter 1. Introduction and Literature Review 20
random variables while Negative Weibull is for negative random variables. This
classification facilitates an easy identification of the right type EVD to model the
scores or measures. For example, Frechet distribution is clearly the right type for
RMSD (Stark et al., 2003b). RMSD values are positive and have a heavy tailed
distribution attenuated at zero.
Adjusting for database size
Because of max-stability property of the GEV distribution, the modelled random
distribution can be used for searches in a database with a different size by correcting
normalising constants. In general, if for an extreme, Mn(ri, i = 1, . . . , n):
Mn − bnan
D−→ EVD(µ, σ, ξ) for a random database of size n (1.5)
then using domains of attraction principle, normalising constants for searching a
database of size n′ are 1 − F(b′n) = 1/n′ and a′n = h(b′n) = 1−F(b′n)f(r)
. However in
Chapter 7 we use an ad hoc method (Stark et al., 2003b; Torrance et al., 2005) to
adjust for sample space since F (r) is unknown.
Chapter 2
Exploratory Analysis of Protein
Geometry
In this Chapter we are interested in learning some properties of proteins in general
and functional sites in particular. We consider spatial information in terms of only
point coordinates for functional sites and protein structure atoms. We explore spa-
tial arrangement of atoms in both functional sites and whole proteins using spatial
statistics tools.
2.1 Background
We are interested in point patterns or spatial positions of points. We would like
to characterise say, whether the points in the configurations are clustered, regular,
random or if there are variations in intensity in different regions.
2.1.1 Inter-event distances
Consider a configuration of points, {xj}, j = 1, . . . , n. Inter-event or inter-point
distances are ||xi − xj || for i, j = 1, . . . , n and i 6= j. The inter-event distribution is
H(t) = P(||Xi −Xj|| ≤ t)
21
Chapter 2. Exploratory Analysis of Protein Geometry 22
Conditional on the number of events N(A) = n of a spatial point process N in
the region of observation A, where N(A) = N∩A, the empirical inter-event distance
distribution function (EDF) is written
H(t) =1
n(n− 1)
∑
i6=jI {||xi − xj || ≤ t} ,
where xi are the events in the observed spatial point pattern and I{.} is an
indicator function. If the theoretical inter-event distance distribution function, say
H0(t) for a theoretical spatial point process is known, deviations of H(t) from H0(t)
can be used to test the hypothesis that an observed point pattern is a realisation
from the theoretical spatial point process.
In section 2.2 we visually (informally) compare H(t) for functional site Cα to
H0(t) for complete spatial randomness (CSR) model i.e. uniform distribution N
points in the region A where N(A) ∼ Poison(λ).
2.1.2 Point to nearest event distances
Another statistical tool for characterising spatial point processes is “point to nearest
event distance”. While for inter-event distance, we consider all the events in the
region, this type of analysis uses distances ti from each of m sample points in A to
the nearest of the n events. Thus point to nearest event distance summarises local
characteristics of the spatial point process.
The empirical distribution (EDF), F (t) = m−1#(ti ≤ t) measures the “empty
spaces” in A, in the sense that 1 − F (d) is an estimate of the volume (area) |Bt|of the region Bt consisting of all points in A a distance at least t from every one
of n events in A. Again F (t) can be compared to the theoretical distribution of a
particular spatial point process of interest.
2.1.3 The K-function
The K-function was introduced by Bartlett (1964) and its potential and importance
for analysing point patterns was realised and developed by Ripley (1976, 1977).
Chapter 2. Exploratory Analysis of Protein Geometry 23
For a stationary isotropic process, the K-function can be defined as
K(t) = λ−1E(number of points within distance t of a randomly chosen point), with t > 0,
where λ is the mean number of points per unit region.
A K-function provides a summary of spatial dependence over a wide range of
scales of a pattern, including all event-event distances, not just the nearest neigh-
bour distances. Since theoretical forms of the function are known for various possible
spatial point process models, the K-function can be used to explore spatial depen-
dence, in addition to suggesting specific models to represent the observed spatial
point process and to estimate the parameters of such models.
The estimator for K(t) is
K(t) = n−2|A|∑
i,j:i6=jω(xi, uij)It(uij).
where uij denotes the distance between the ith and the jth events in A, ω(xi, uij) is
the proportion of the surface of the sphere with centre xi and radius uij which lies
within A. It(uij) is an indicator function taking the value 1 if uij ≤ t, 0 otherwise.
We consider the K-function and inter-event distance for Cα atoms.
2.2 Functional Sites
We consider spatial distribution of Cα atoms of functional sites. We evaluate the
first and second order statistics for spatial processes:
(a) Three dimensional plot of an estimated density field.
(b) Point to nearest event distance frequency plot.
(c) The K-function. Plotted are normalised K-functions: K(t) − πt2.
Chapter 2. Exploratory Analysis of Protein Geometry 34
Figure 2.10: The K-function and inter-point distance distribution for Cα atoms in
5-aminolaevulinate dehydratase structure (1aw5).
Chapter 2. Exploratory Analysis of Protein Geometry 35
Figure 2.11: The K-function and inter-point distance distribution for Cα atoms in
5-aminolaevulinate dehydratase structure from E. coli (1b4e).
Chapter 2. Exploratory Analysis of Protein Geometry 36
Figure 2.12: The K-function and inter-point distance distribution for Cα atoms in
carbonyl reductase structure (1cyd).
Chapter 2. Exploratory Analysis of Protein Geometry 37
Figure 2.13: The K-function and inter-point distance distribution for all atoms in
17 − β hydroxysteroid dehydrogenase (1a27).
Chapter 3
Simulation Design and Evaluation
of Algorithms
We will consider simulations to evaluate the performance of our approach and some
other known algorithms. Simulations are used to evaluate the correct correspondence
rate for matching methods in Chapters 5 and 6. We cover the simulation scheme
in section 3.1 while 3.2 covers topics on evaluation. Highlighted in section 3.1.2
are simulations of Aszodi and Taylor (1994), producing compact random structures
with a hydrophobic core.
3.1 Simulations
In this section we are concerned on how we simulate functional sites and proteins
to evaluate performance of different algorithms.
3.1.1 Functional Sites Simulations
Functional site pairs with varying sizes were simulated. Each pair consisted of {µ}and {x}. Size of {x}, n varied from 4 to 64 by steps of 4. (i.e. n = 32, 36, . . . , 64).
Size of {µ} was taken to be m = ⌈1.1n⌉, with additional 10% of the points in {µ}having no corresponding points in {x}. The choice of 10% should provide enough
38
Chapter 3. Simulation Design and Evaluation of Algorithms 39
noise in the system to evaluate our matching algorithms. Luo and Hancock (2001)
added up to about 10% of non corresponding points to evaluate different matching
algorithms.
Point set configurations.
For size, n = 4, 36, . . . , 64
(a) Hardcore simulate a configuration of m = ⌈1.1n⌉ points constituting {µ} in a
353 cube. We uniformly sample points inside the cube and reject if it is within
5 units from any other point. The inhibition distance of 5A is to model van
der Waals radius in molecules. Aszodi and Taylor (1994) observed an average
value of about 5.5A for van der Waals radius between Cα atoms in protein
molecules. We also observed an inhibition distance between 5A to 6A in inter-
event distance histograms for functional sites in section 2.2. The distance of
5A was chosen as this is a conservative threshold for interaction between two
atoms, where the atoms are either Cα atoms or atoms in side chains (Park
et al., 2001).
(b) We then randomly generate colours for these points according to frequencies
of amino acids in Table 3.1.
(c) Choose randomly(without replacement) n points from {µ}.
(d) From each of the chosen n points of {µ} simulate a point, x ∼ MN(~µ, 0.5I3).
There is no preferred direction for the x, y and z coordinates hence we assume
isotropic Gaussian. It is also biologically plausible to assume independence
between the coordinates.
Colour of x is
i. First approach: Just take the colour of µ (no mutation of colour).
ii. Second approach: Simulate mutation to get colour of x (see simulation of
mutational process below).
Chapter 3. Simulation Design and Evaluation of Algorithms 40
A set of m points in {µ} and n points in {x} constitute a pair of functional
sites. These pairs are used for evaluating the performance of matching methods as
outlined in section 3.2.1. We also use these configurations for studying the minimum
RMSD distribution in section 4.1.1 of Chapter 4.
Table 3.1: Frequencies of amino acids in the Accepted Point Mutation (PAM) Data.
N Asn 0.040 H His 0.034
S Ser 0.070 R Arg 0.040
D Asp 0.047 K Lys 0.081
E Glu 0.050 P Pro 0.051
A Ala 0.087 G Gly 0.089
T Thr 0.058 Y Tyr 0.030
I Ile 0.037 F Phe 0.040
M Met 0.015 L Leu 0.085
Q Gln 0.038 C Cys 0.033
V Val 0.065 W Trp 0.010
Evolution of amino acid classes.
We consider a model for evolutionary change in proteins of Dayhoff et al. (1978).
Dayhoff et al. (1978) model for amino acid interchanges is applicable for functional
site amino acids as well since it assumes that amino acid mutation is sequence inde-
pendent. However the actual frequencies of substitutions are different in functional
sites. We assume that amino acid mutation is also independent of spatial positions.
Accepted point mutations
An accepted point mutation is a replacement of one amino acid by another which
is again accepted by natural selection. To be viable, the new amino acid usually
must function in a way similar to the old one. Chemical and physical similarities are
Chapter 3. Simulation Design and Evaluation of Algorithms 41
found between the amino acids that are observed to interchange frequently. In the
evolutionary change model, the likelihood of amino acid c replacing k is the same as
that of k replacing c. As a result, no change in amino acid frequencies over evolution
distance will be detected.
The probability that each amino acid will change in a given small evolutionary
interval is called the “relative mutability” of the amino acid. Thus relative mutability
of each amino acid is proportional to the ratio of changes to occurrences. Table 3.2
gives these relative mutabilities computed by Dayhoff et al. (1978).
Table 3.2: Relative mutabilities of amino acidsa.
N Asn 134 H His 66
S Ser 120 R Arg 65
D Asp 106 K Lys 56
E Glu 102 P Pro 56
A Ala 100 G Gly 49
T Thr 97 Y Tyr 41
I Ile 96 F Phe 41
M Met 94 L Leu 40
Q Gln 93 C Cys 20
V Val 74 W Trp 18aThe value for Ala has been arbitrarily set at 100.
Substitution matrix
Information about individual kinds of mutations and about the relative mutability
of amino acids is combined into one time-dependent “mutation probability matrix”.
An element of this matrix, mij , gives the probability that the amino acid in row i
will be replaced by the amino acid in column j after a given evolutionary interval.
Evolutionary distance between proteins is measured in PAM (Percent Accepted
Mutation). 1 PAM corresponds to an evolutionary distance of one amino acid change
in every 100 amino acids. Dayhoff et al. (1978) in addition to calculating mutabilities
Chapter 3. Simulation Design and Evaluation of Algorithms 42
for amino acids, also compiled data on Accepted Point Mutation. PAM substitution
matrices are computed from these pieces of information. Shown in a Table 3.3 is a
1PAM matrix.
Simulation of mutational process
The mutation probability matrix provides the information with which to simulate
any degree of evolutionary change in an unlimited number of proteins. Further, we
can start with one protein and simulate its separate evolution in duplicated genes
or in divergent organisms. By considering large numbers of such related sequences,
a measure is readily obtained of the expected deviations due to random fluctuations
in the evolutionary process.
Let us simulate the effect of 1 PAM of evolutionary change on a particular
amino acid set. To determine the fate of the first amino acid, say alanine, we
obtain a uniformly distributed random number between 0 and 1. The first row of
the mutation probability matrix (Table 3.3) gives the relative probability of each
possible event that may befall alanine (neglecting deletion for simplicity). If the
random number falls between 0 and 0.9867, Ala is unchanged. If the number is
between 0.9867 and 0.9868, it is replaced with Arg, if it is between 0.9868 and 0.9872,
it is replaced with Asp, and so forth. Similarly, a random number is produced for
each amino acid in the set, and action is taken as dictated by the corresponding row
of the matrix. The result is a simulated mutant set. Any number of these can be
generated; their average distance from the original is 1 PAM.
The effects on the set of a longer period of evolution may be simulated by suc-
cessive applications of the matrix to the set resulting from the last application.
Alternatively, the matrix may be multiplied by itself repeatedly and applied once to
the sequence. The two procedures produce mutant sequences of the same average
PAM distance from the initial set. Simulations in this thesis e.g. in section 3.1.1
use PAM120 i.e. the matrix in multiplied by itself 120 times.
For simulations in which a predetermined number of changes are required, a
two-step process involving two random numbers for each mutation can be used.
and so the jth row consists of hj repeated j times, followed by −jhj and then q−j−1
zeros, j = 1, . . . , q − 1.
For q = 3 the full Helmert matrix is explicitly
Hf =
1/√
3 1/√
3 1/√
3
−1/√
2 1/√
2 0
−1/√
6 −1/√
6 2/√
6
Chapter 4. Match Statistics 61
and the Helmert sub-matrix is
H =
−1/√
2 1/√
2 0
−1/√
6 −1/√
6 2/√
6
.
For q = 4 the full Helmert matrix is
Hf =
1/2 1/2 1/2 1/2
−1/√
2 1/√
2 0 0
−1/√
6 −1/√
6 2/√
6 0
−1/√
12 −1/√
12 −1/√
12 3/√
12
and the Helmert sub-matrix is
H =
−1/√
2 1/√
2 0 0
−1/√
6 −1/√
6 2/√
6 0
−1/√
12 −1/√
12 −1/√
12 3/√
12
.
Definition 4.1.2. The pre-shape of a configuration X is all the geometrical infor-
mation that remains when location and scale effects are filtered out from the object.
That is the pre-shape of X is given by
Z =XHT
‖XHT‖where H is the Helmert sub-matrix.
The pre-shape of an object is invariant under translation and scaling of the
original configuration.
Definition 4.1.3. The pre-shape space is the space of all possible pre-shapes.
Formally, the pre-shape space Sqd is the orbit space of the non-coincident q point set
configuration in ℜd under the action of translation and isotropic scaling.
The pre-shape space Sqd ≡ Sd(q−1)−1 is a hypersphere of unit radius in d(q − 1)
real dimensions, since the centroid size of Z, ‖Z‖ = 1.
Definition 4.1.4. The Procrustes distance ρ(X,µ) is the closest great circle
distance between pre-shapes of X and µ on the pre-shape sphere. The minimisation
is carried out over rotations.
Chapter 4. Match Statistics 62
From the distribution of size-and-shape we will derive the distribution of RMSD
which is a function of size-and-shape. RMSD is commonly used in Bioinformat-
ics applications while size-and-shape is mainly used in Morphometry. We use the
distribution of RMSD in a Bioinformatics application to rank best matches from a
database search in section 4.1.4.
Consider the distribution of RMSD, r = dS(X,µ)/√q under the isotropic Gaus-
sian model for corresponding points i.e. xj ∼ N(µi, σ2Id) where point µi corresponds
to point xj and d = 3 is the dimension. Thus RMSD is a function of two random
variables SX and ρ. Under our model, we assume Sµ is fixed while S2X is distributed
as non-central χ2ν(λ) with ν = dq − d and λ = S2
µ/σ2. After optimal superposition
of configurations with q points, the full Procrustes distance,
d2F = sin2 ρ(X,µ) ∼ τ 2
0χ2dq−d(d−1)/2−d−1
with τ 20 = σ2/S2
µ. We consider exact and approximate distributions for r in the
following sections. The approximation is when SX ≈ Sµ and variability of Sx is
small.
Exact Distribution
We first consider the distribution for d2S(X,µ). With
sin2 ρ(X,µ) ∼ τ 20χ
2dq−d(d−1)/2−d−1
the density for x = cos ρ is
f(x) =2xβα
Γ(α)(1 − x2)e−β(1−x2) (4.2)
where β = 1/2τ 20 and α = dq−d(d−1)/2−d−1
2. and the density for y = S2
X is
f(y) =1
2σ2
√2π√λy/σ2
e(λ+y/σ2)/2( y
λσ2
)−1/4 {e√λy/σ2
+ e−√λy/σ2
}(4.3)
Chapter 4. Match Statistics 63
where β = 1/2τ 20 and α = dq−d(d−1)/2−d−1
2. Assuming independence between x =
cos ρ and y = S2X , the joint distribution for x and y is
f(x, y) = xβα
Γ(α)(1 − x2)e−β(1−x2)
× 1
σ2
q2π√λy/σ2
e(λ+y/σ2)/2(
yλσ2
)−1/4{e√λy/σ2
+ e−√λy/σ2
}.
(4.4)
Let v = S2µ + y − 2Sµx
√y and u = y. Inverse functions are y = u, x =
v−S2µ−u
2√uSµ
and
the Jacobian of transformation, |J | = 12√uSµ
. Hence the joint distribution for v and
u is
f(u, v) =(v−S2
µ−u)βα
4√uS2
µΓ(α)
{1 −
(v−S2
µ−u2√uSµ
)2}e
(−β(1−
„v−S2
µ−u
2√
uSµ
«2
)
)
× 1
σ2
q2π√λu/σ2
e(λ+u/σ2)/2(
uλσ2
)−1/4{e√λu/σ2
+ e−√λu/σ2
}.
(4.5)
It is not easy to integrate out u in order to get the distribution for v. Thus we
consider an approximation for size-and-shape distance.
Approximation
We consider the distribution for approximate size-and-shape distance when variabil-
ity of SX is so small or SX ≈ Sµ such that we can treat SX as a constant as well.
For example in Bioinformatics applications, interesting cases are where matching is
good hence configurations are of the same size i.e. SX ≈ Sµ. Thus
d2S(X,µ) ≈ 2S2
µ(1 − cos ρ(X,µ)). (4.6)
With sin2 ρ(X,µ) ∼ τ 20χ
2dq−d(d−1)/2−d−1, the approximate1 density for r is
f(r) =2qrβα
S2µΓ(α)
(2 − qr2
S2µ
)(qr2
S2µ
−(qr2
2S2µ
)2)α−1
e−β
qr2
S2µ−„
qr2
2S2µ
«2!
(4.7)
where β = 1/2τ 20 and α = dq−d(d−1)/2−d
2. We adjust degrees of freedom because we
do not allow scaling i.e. we multiply with S2µ. We only lose d(d− 1)/2 − d degrees
1This is the density for Sµ
√2(1 − cos ρ(X, µ))/q, an approximate size-and-shape distance in
closely fitting configurations.
Chapter 4. Match Statistics 64
of freedom for rotation and translation as
d2S(X,µ) = inf
A∈SO(d)‖µ− AX − b‖2
where SO(d) denotes a set of all d× d rotation matrices (orthogonal matrices with
determinant equal to +1)
4.1.3 Simulations for RMSD Distribution
We simulate {µ} and {x} as in section 3.1.1. However here n = m = 20 and we
simulated 10,000 pairs. The order of xi are randomly permuted as in section 3.2.1
so that we do not “know” corresponding µi and xj points.
Figure 4.2a gives a histogram of RMSD after optimal superposition using graph
theoretic method. Superimposed on this histogram is the probability density func-
tion in equation 4.7. We observe that this approximate distribution is a good fit.
Figure 4.2b is a plot of empirical distribution function and the cumulative density
function of equation 4.7. We also observe a good fit here. Therefore a goodness-of-fit
p-value from our approximate distribution can be used.
4.1.4 Application
We did a database search with a functional site of 5-aminolaevulinate dehydratase
(1b4e 0) using the graph method of Gold (2003). The standard deviation, σ is esti-
mated to be around 0.3 for matching functional sites known to be related (functional
sites from 17 − β hydroxysteroid-dehydrogenase and carbonyl reductase proteins
shown in Figure 1.3) at a threshold of 1.5A. Thus we set σ = 0.3 for matching
distance tolerance of 1.5A(cf section 5.3 of Chapter 5).
Table 4.1 gives the results for best 50 matches sorted by goodness-of-fit p-values.
Also given are scores proposed by Gold (2003) and described in section 3.2.2. The
scores given in Table 4.1 are found by dividing values of the score option 2 (see
section 3.2.2) by the RMSD.
We observed that 1eb3 0 (No 13) has a higher p-value than 1i8j 2 (No 14) al-
though the later has a lower RMSD. There is an agreement between the p-value and
Chapter 4. Match Statistics 65
Figure 4.2: Approximate RMSD distribution. a) Histogram of RMSD after opti-
mal superposition using graph theoretic method. b) Empirical and approximate
(equation 4.7) distribution functions for RMSD.
the score ranking. A better match has 21 corresponding amino acids compared to
8 for the other match. This justifies a better goodness-of-fit even though its RMSD
is higher than the other. This scenario is also observed for 1gjp 0 and 1l6s 2 (No 16
and 17); 1h7r 0 and 1l6y 3 (No 18 and 19).
4.1.5 Summary
A bigger challenge is to analytically work out the exact distribution for number of
matches and RMSD when matching random configurations. Unlike our attempt
to find a good approximating distribution for a Procrustes metric when matching
random configurations, Stark et al. (2003b) empirically modelled the distribution
for RMSD with the extreme value distribution. We follow Stark et al. (2003b) to
Chapter 4. Match Statistics 66
Table 4.1: Best fitting functional sites in the database when matched against 5-
aminolaevulinate dehydratase functional site (1b4e 0).
No. SITE q Sµ SX RMSD SCORE P-value
1 1b4e 0 21 41.46 41.46 0.000000 NA 1.0000000
2 1h7n 0 21 41.46 42.19 0.325806 0.513 0.9999999
3 1i8j 4 15 32.56 32.84 0.264983 0.502 0.9999993
4 1l6s 6 15 32.56 32.87 0.275493 0.492 0.9999982
5 1l6y 6 18 37.31 37.90 0.325130 0.479 0.9999966
6 1l6s 7 15 32.56 32.87 0.285211 0.465 0.9999946
7 1i8j 5 15 32.56 32.79 0.280753 0.467 0.9999943
8 1h7p 0 21 41.46 42.53 0.394005 0.414 0.9999942
9 1ohl 0 21 41.46 42.33 0.376413 0.437 0.9999863
10 1l6y 0 20 40.27 41.07 0.371252 0.540 0.9999744
11 1i8j 0 8 16.40 16.60 0.204785 0.325 0.9999703
12 1h7o 0 21 41.46 42.57 0.412440 0.407 0.9999651
13 1eb3 0 21 41.46 42.38 0.391614 0.422 0.9999554
14 1i8j 2 8 15.34 15.65 0.239518 0.300 0.9998613
15 1l6s 4 8 15.34 15.64 0.242736 0.302 0.9998003
16 1gjp 0 20 40.27 41.48 0.447832 0.352 0.9995251
17 1l6s 2 8 15.29 15.56 0.250936 0.290 0.9995213
18 1h7r 0 20 39.97 41.16 0.473749 0.318 0.9939235
19 1l6y 3 7 12.98 13.14 0.315031 0.192 0.9617287
20 1b4k 0 20 40.27 41.00 0.457764 0.266 0.9577919
21 1e51 0 20 40.07 41.56 0.573448 0.279 0.8148693
22 1gzg 0 21 41.46 42.20 0.504140 0.234 0.7643607
23 1b4k 1 17 34.74 35.62 0.582555 0.203 0.2456983
24 1hrs 2 3 5.72 5.46 0.766603 0.020 0.0006607
25 1m7h 6 5 9.52 8.68 0.869824 0.019 0.0002631
Chapter 4. Match Statistics 67
calculate p-values for matching random (unrelated) configurations in our application
in section 7.4 of Chapter 7.
The distribution for number of matches can also be modelled by the extreme
value distribution e.g. Chen and Crippen (2005).
Chapter 5
EM Algorithm Alignment
The commonly used graph theoretic approach (reviewed in section 1.2.1) and other
related approaches e.g. geometric hashing (Wallace et al., 1997) require adjustment
of a matching distance threshold a priori according to the noise in atomic positions.
This is difficult to pre-determine when matching sites related by varying evolutionary
distances and crystallographic precision.
To avoid the problem of specifying matching distance threshold, in this chapter
we consider using an EM algorithm in the mixture model formulation of the prob-
lem to finding an alignment and point correspondences between two configurations.
Assume we are given two configurations {µi : i = 1, . . . , m} and {xj : j = 1, . . . , n}in ℜd. Suppose there are q ∈ {2, . . . n} corresponding points in these configurations
under rigid body transformation1. However we do not know
(a) which are the corresponding points;
(b) the number of corresponding points, q;
(c) as well as transformation parameters.
We reviewed some approaches for solving this problem in section 1.2.1. We review
a statistical approach by Kent et al. (2004) using a mixture model in section 5.1. In
1We are interested in matching at least two points.
68
Chapter 5. EM algorithm Alignment 69
section 5.2, we consider using concomitant information to point coordinates in Kent
et al. (2004) mixture model framework.
5.1 Mixture Model
Given configurations {µi : i = 1, . . . , m} and {xj : j = 1, . . . , n} in ℜd with corre-
spondence and alignment unknown, Taylor et al. (2003) formulate a mixture model
to solve for both correspondence and alignment simultaneously. Correspondence is
considered to be missing data and EM algorithm is used. Expected values of mix-
tures indicator variables are calculated in the E-step; alignment parameters that
maximise the expected log likelihood are estimated in the M-step using Procrustes
analysis. This is known as “soft” matching because we use expected values of cor-
respondence indicator variables.
5.1.1 Soft Matching of Forms
Let {µi} have more points than {xj} i.e. n ≤ m in order to assume that {xj}has risen from {µi} through some transformation and possibly some points in {µi}not appearing in {xj}. This model is plausible for the motivating problem in pro-
tein functional sites and certainly in some applications in chemoinformatics as well
(Dryden et al., 2006). The restriction that n ≤ m is without loss of generality in
many applications because in practice there is no knowledge of which of the two
configurations to be matched gave rise to the other (parentage) i.e. {xj} and {µi}are exchangeable. Furthermore the parentage is of no practical use as far as match-
ing configurations is concerned. Indeed, for the Bayesian approach in Chapter 6,
configuration sizes do not matter even for formulating the methodological matching
framework and algorithm.
Let the map π(j) = i denote correspondence between points xj and µi. If
π(j) = i then assume
xj = ATµi + b+ εi,
εi ∼ IN(0, σ2)
Chapter 5. EM algorithm Alignment 70
where σ2 is unknown and A is an orthogonal matrix. That is, for fixed j, we take
for i = 1, . . . , m,
φ(xj|π(j) = i) =
(2πσ2)−d/2 exp
{−1
2‖xj − ATµi − b‖2/σ2
}if i 6= 0
1‖W‖ if i = 0.
(5.1)
The convention π(j) = 0 is used to classify a point xj which does not correspond to
any point µi. These points are referred to as coffin bin points. Coffin bin points are
assumed to be uniformly distributed in region W ∈ ℜd i.e.
xj |(π(j) = 0) ∼ Uniform(W ).
The marginal distribution of xj is given by the mixture model
xj ∼
m∑
i=1
P (π(j) = i)N(Aµi + b, σ2Id) + P (π(j) = 0)Uniform(W ) (5.2)
where P (π(j) = i), i = 0, . . . , m, are marginal membership probabilities and
m∑
i=0
P (π(j) = i) = 1.
Alternatively we can assume normal distribution for coffin bin points i.e.
xj |(π(j) = 0) ∼ N(µ0, σ20Id)
where µ0 can be taken to be the centre of mass for {µ} and σ20 is large.
5.1.2 Model Likelihood
Let X = (x1, . . . , xn)T , L be a set of labels. Given L, the likelihood is
Q(X|L) =
m∏
i=0
n∏
j=1
pI[π(j)=i]i φ(xj |π(j) = i)I[π(j)=i]
where I is an indicator function such that
I[π(j) = i] =
1 if π(j) = i
0 otherwise
and pi = P (π(j) = i) is the mixing probability for any x to be with label i.
(Sanchez and Sali, 1998). We used ad hoc weights (section 5.2.2) with α = 2 when
matching with concomitant information. Amino acids were grouped into hydropho-
bic, charged, polar and glycine (see Table 5.4). The difference in centres of mass for
the configurations and the identity matrix were taken to be the starting values for
the translation vector and rotation matrix respectively.
Table 5.4: Groups of amino acids (Branden and Tooze, 1999, p. 6).
Symbols: A C D F G H I K L M N P Q R S T V W Y
Group 1 (hydrophobic) A F I L M P V
Group 2 (charged) D E K R
Group 3 (polar) C H N Q S T W Y
Group 4 (glycine) G
Table 5.5 summarises the results when using EM algorithm with and without
amino acid group information in matching a functional site of 17−β hydroxysteroid-
dehydrogenase (1a27 0) against other functional sites. Table 5.6 summarises the
results obtained by the graph method. We use three scoring options as defined in
Chapter 5. EM algorithm Alignment 89
section 3.2.2. The final score (Score*) is got by dividing the option 2 raw score
by the RMSD. The rule of thumb is, the bigger the score the better the solution.
All scores by the EM algorithm using colour are bigger than when not using colour
information. EM algorithm using colour also find better matches for 1bfk 0, 1cyd 1
and 3pga 0 than the graph methods. EM algorithm did not converge for 1h5q 0
which has ridiculously large RMSD. In Table 5.7 we give a solution for 1h5q 0 after
proper convergence and using distance constraining techniques in section 5.3 to
improve the EM algorithm.
Table 5.5: Comparison of with and without colour matching results when matching
a functional site of 17 − β hydroxysteroid-dehydrogenase (1a27 0) against other
functional sites. Relative weight of (α = 2) was used for similar amino acids when
using colour information.
No colour Colour
Raw Score Raw Score
Option Option
site 0 1 2 RMSD Score* 0 1 2 RMSD Score*
1ajr 0 7 0 1.0 5.13 0.19 13 2 5.5 4.06 1.35
1b4e 0 12 1 2.0 2.79 0.72 13 1 3.0 2.67 1.12
1bfk 0 12 1 2.5 5.24 0.48 18 4 6.5 3.36 1.93
1cyd 1 32 11 15.0 1.82 8.24 31 12 16.0 1.81 8.85
1g0n 0 19 2 4.0 3.33 1.20 22 5 9.5 4.05 2.35
1h5q 0 20 1 7.0 9.35 0.75 22 4 21.0 8.99 2.34
3pga 0 13 1 3.0 5.89 0.51 24 4 11.5 4.82 2.39
Score*= Option 2 Raw Score divided by the RMSD.
Figure 5.7 illustrates that increasing the weight for same group residues (amino
acids), increases the number of same group matches. Figure 5.6 shows superim-
position of 17 − β hydroxysteroid-dehydrogenase on carbonyl reductase when EM
algorithm method is used. There are 27 common matches between colour and no
colour methods. There are 3 pairs exclusively matched when using colour informa-
Chapter 5. EM algorithm Alignment 90
Table 5.6: Results when matching a functional site of 17 − β hydroxysteroid-
dehydrogenase (1a27 0) against other functional sites using Gold (2003) method.
Raw Score
Option
site 0 1 2 RMSD Score*
1ajr 0 12 0 3.0 1.85 1.62
1b4e 0 10 2 5.0 4.19 1.19
1bfk 0 12 1 3.5 2.44 1.43
1cyd 1 27 14 18.5 3.31 5.59
1g0n 0 31 13 21.0 3.07 6.84
1h5q 0 33 16 22.0 2.72 8.09
3pga 0 15 1 4.5 3.20 1.41
Score*= Option 2 Raw Score divided by the RMSD.
tion but not without colour information. On the other hand, 2 pairs are exclusively
matched when not using colour and not matched when using colour. Two of the 3
pairs matched exclusively when using colour are for identical amino acids, the other
pair is for same group amino acids. However amino acids from different groups are
matched in the two exclusive pairs when not using colour information.
Chapter 5. EM algorithm Alignment 91
Figure 5.6: Superposition of carbonyl reductase and 17 − β hydroxysteroid dehy-
drogenase sites when matching with EM algorithm. Amino acid classes information
not used in (a) but used in (b).
Advantages of using amino acid grouping
Use of amino acid group as concomitant information increases
(a) The number of same group/residue matches.
(b) The volume of a region for initial parameter estimates for which the EM algo-
rithm converges to a global maximum parameter vector.
Challenges of using amino acid group information
Using amino acid group information in this way to increase the number of same
residue matches might be at the expense of an overall number of geometrical matches.
Increasing the number of same residue matches sometimes also lead to an increase
in RMSD as seen in Figure 5.7 for matching 17 − β hydroxysteroid dehydrogenase
and 5-aminolevulinic acid.
5.2.6 Summarising Comments
• Use of amino acid type information through weights improves on the quality
of match.
Chapter 5. EM algorithm Alignment 92
• From experimentation, if total number of colours is a then setting α = a
for ad hoc weights and β = a − 1 for simple prior conditional probabilities
gives optimal results. Heavy weights for similar residues is at the expense
of geometrical matching (RMSD) and the gain in class matching is marginal
(for each pair of sites there is a maximum number of possible class matches).
Typical scenario is shown in Figure 5.7.
• It is seen from simulation studies that with the use of concomitant information
we are able to find a set of good starting values for the EM algorithm and the
algorithm converges faster. Figure 5.5 shows good starting values with and
without colour information use.
• To overcome the problem of starting values, a simple approach would be to try
several random starting values. However we consider a more comprehensive
approach using Markov chain Monte Carlo (MCMC) technique in a Bayesian
framework in Chapter 6.2.
5.3 Distance Constraints
It is observed that EM algorithm in sections 5.1 and 5.2 tends to match more
points and hence with larger RMSD than graph method. In the graph method,
matching all inter-point distances enforces strict geometrical matching constraints.
Here we consider more techniques to enforce matching points to be closer in the EM
algorithm.
In addition to using a posterior probability weighted variance estimated at each
iteration of the algorithm for the mixture model, we incorporate three techniques
to ensure smaller distances between matched points:
(a) Variance cooling. If the variance increases from that of the previous estimate at
iteration t of N total number of iterations allowed then use:
Chapter 5. EM algorithm Alignment 93
1 2 3 4 5 6 7
0.4
0.6
0.8
1.0
1.2
weight: α
scor
e
option 1option 2
a)
1 2 3 4 5 6 7
24
68
1012
weight: α
mat
ches
same residue matchessame group matchestotal geometrical matches
b)
1 2 3 4 5 6 7
34
56
weight: α
RM
SD
c)
Score:
option 1 = No. identity matches
RMSD
option 2 = option 1 +No. similar matches
2 x RMSD
Figure 5.7: Match scores and RMSD against α (relative weight for similar amino
acids). Matched sites are 17−β hydroxysteroid dehydrogenase and 5-aminolevulinic
acid. a) Option 1 and 2 scores. b) Total number of pairs matched, pairs with the
same amino acid and pairs with the same group. c) RMSD.
Chapter 5. EM algorithm Alignment 94
σ2 = A0
(ANA0
)t/N
where A0 and AN are desired variance values at t = 0 and t = N respectively.
From an application in section 5.3.1 we observe easy convergence and better
RMSD values for A0 = 100 and A200 = σ2g = 0.32 in most cases. We choose
0.3 to correspond to the threshold value of 1.5A for matching distances in the
graph method (see section 4.1.4 in Chapter 4). Furthermore, under the Gaussian
model, the width of a 85% C.I. for matching distances in graph method is
2 × 1.04√
3 × 2σ2g . Equating this to threshold value of 1.5A gives σg = 0.297.
Kent et al. (2004) independently found out that using σ = 0.3 for EM algorithm
gives similar results to graph method when matching 17 − β hydroxysteroid
dehydrogenase and carbonyl reductase functional sites. And indeed, conversely,
using the graph solution when matching 17 − β hydroxysteroid dehydrogenase
and carbonyl reductase functional sites we estimate σ to be around 0.3A.
(b) Fixing the variance for the coffin bin, σ20 . This value is calculated from the
volume of {µ}, W . Consider a sphere with volume W i.e. W = 43πR3. Let
2σ0 = R then σ20 = 1
4
(3W4π
)2/3.
(c) In linear programming, rule out correspondences with probability less than
cL = φ(r, σ2cI3) where φ is a standard normal density; r and σ2
c are applica-
tion specific values to be specified by the user. We use a probability threshold
value of 0.038 for r = 1 and σ2c = 1.019 which seem to give reasonably good
results. Probability thresholding is similar to the Bayesian approach considered
in section 6.1.5.
5.3.1 Results
Here we consider both query and templates from tyrosine-dependent oxidoreduc-
tases family. We compare a functional site of 17 − β hydroxysteroid dehydrogenase
(1a27 0) to representative sites from each of the 33 domains in this family. Sites
Chapter 5. EM algorithm Alignment 95
in the first column of Table 5.7, were chosen as representatives for their respective
domains.
As in section 5.2.5, we used ad hoc weights with α = 2 when matching with
concomitant information (colour). Amino acids were also grouped into hydrophobic,
charged, polar and glycine. The difference in centres of mass for the configurations
and the identity matrix were taken to be the starting values for the translation
vector and rotation matrix respectively.
Reported in Table 5.7 are RMSD values for graph and EM algorithm with and
without colour information use. Also reported are differences in rotations (A) used
to match the sites by graph and EM algorithm methods. If A and A being rotation
matrices in graph and EM algorithm respectively, A is such that the trace of the
orthogonal matrix taking A to A is approximately equal to 1 + 2 cos A (Green and
Mardia, 2006). Thus A = cos−1(
tr(AAT )−12
).
Results show that these distance constraining techniques considerably lower the
number of matching points and RMSD. Solutions for 1h5q 0 when using the EM
algorithm are now comparable to the graph theoretic solution unlike in Table 5.6
where the EM algorithm did not possibly converge. In general the higher the number
of matching points (q) and the lower the RMSD, the better the solution. RMSD
and q are combined into a single score e.g. in section 5.2.4 (Tables 5.5 and 5.6) to
rank the matches. Alternatively p-values e.g in Chapter 4, section 4.1.4 (Table 4.1)
can be used. However here we just informally note a number of cases with clearly
better solutions by the EM algorithm compared to solutions by the graph method
(cases italicised in Table 5.7). Obvious cases are solutions with many more matching
points with RMSD of similar magnitude or solutions with much lower RMSD but
with comparable matching points.
5.4 Multiple Transformations
For simplicity we consider a situation where the configuration {x} is related to {µ}through two different transformations. The extension to many transformations is
be the prior probability of the label π(j) to be i and label S(j) to be s. The posterior
probability is
pji = P (γs(j) = i|xj) =P (xj|γs(j) = i)
P (xj)pi
and (pji) is an n× (2m+ 1) matrix. Note that
P (xj) =
2∑
s=1
m∑
i=1
piφ(xj |γs(j) = i) + p0φ(xj|γs(j) = 0)
and P (xj|γs(j) = i) ≡ φ(xj |γs(j) = i).
It is straightforward to extend this model formulation to more than two groups
of transformation. For the model to be identifiable, obviously s should be much
smaller than n.
5.4.3 The EM Algorithm
A simple extension of an EM algorithm with a coffin bin to two separate transfor-
mations is considered.
Let pi be given, with starting values p(0)i = 1/(2m+ 1), say. Then the E-step is:
p(r+1)ji =
P (xj|γs(j) = i)
P (xj)p
(r)i .
Substituting pji for I[γs(j) = i] the log likelihood is
2∑
s=1
m∑
i=0
n∑
j=1
{pji log pi + pji logφ(xj |γs(j) = i)} . (5.12)
Thus in the M-step, we minimise:
f(A1, A2, b1, b2) =
2∑
s=1
m∑
i=1
n∑
j=1
pji‖xj − ATs µi − bs‖2 (5.13)
using Procrustes fit for rigid body motion. As is an orthogonal matrix. If VsΓUTs
is a singular value decomposition of Bs =∑m
i=1
∑nj=1 pji(µi − µs)(xj − xs)
T where
µs =
m∑
i=1
n∑
j=1
pjiµi
m∑
i=1
∑
j=1
pji
; xTr and yTr are the rth rows of X and Y then As = VsUTs .
Chapter 5. EM algorithm Alignment 100
Thus for the (r + 1)th iteration we have
B(r+1)s =
m∑
i=1
n∑
j=1
p(r)ji (µi − µs)(xj − xs)
T , A(r+1)s = (VsU
Ts )(r+1).
By minimising (5.13) w.r.t. bs, we have
b(r+1)s =
∑mi=1
∑nj=1 p
(r)ji (xj − (A
(r+1)s )Tµi)
∑mi=1
∑nj=1 p
(r)ji
.
Finally update the mixing proportions:
p(r+1)i =
∑j p
(r)ji∑
ji p(r)ji
.
E and M steps are repeated until convergence of residual sum of squares:
2∑
s=1
n∑
i=1
(x(r+1)i − x
(r)i )T (x
(r+1)i − x
(r)i−s)
where x(r)i = A
(r)s µi + b
(r+1)s . To ensure convergence of the correspondence matrix
(pji) as well, use x(r)i =
∑nj=1 pjiA
(r)s µi+b
(r+1)s . Another criteria of convergence could
be the log-likelihood (5.12).
At the rth iteration, the correspondence probability weighted maximum likeli-
hood estimate of σ2 is
(σ2)(r) =P2
s=1
Pmi=1
Pnj=1 p
(r)ji ‖xj−(AT
s )(r)µi−b(r)s ‖2
d×P2
s=1
Pmi=1
Pnj=1 p
(r)ji
where d = 3 is the dimension.
The unweighted estimate is∑2
s=1
∑mi=1
∑nj=1 ‖xj − (ATs )(r)µi − b
(r)s ‖2
2 × d× n×m.
We assumed these two transformations have the same nuisance parameter σ. It is
straightforward to extend the theory to different parameters case. The maximum
likelihood estimate of variance, (σ2s )
(r) becomes
Pmi=1
Pnj=1 p
(r)ji ‖xj−(AT
s )(r)µi−b(r)s ‖2
d×Pmi=1
Pnj=1 p
(r)ji
and the unweighted estimator is
Pmi=1
Pnj=1 ‖xj−(AT
s )(r)µi−b(r)s ‖2
d×n×m .
Chapter 5. EM algorithm Alignment 101
5.4.4 Simulations
We did some simulations to evaluate the algorithm. We evaluated performance
for m = 24, n = 20 and sets S1, S2 having 10 points each. With equal number
of points in each set, we expect to have equal membership preference for either
transformations.
Table 5.8 gives correct correspondence proportions and rotation errors for A1 and
A2 i.e. measures of distance between true and estimated rotation matrices. Reported
are results for several runs using different starting values for A1, A2, b1 and b2 (The
EM algorithm is very sensitive to starting values, see Figure 5.5). For each run
we had 30 dataset replicates. Correct correspondence proportions for the algorithm
are around 0.7. Here a point has a correct correspondence if assigned to a true
corresponding point µi and under the true transformation s or rightly not assigned
to any other µi. As expected the performance is not as good as in a simpler case of
one transformation only. Rotation errors for the first transformation are around 0.05
radians while for the second transformation are in the range of 0.1 to 0.7 radians.
There is higher accuracy in estimating the rotation for the first transformation than
for the second. This is surprising considering that we had equal number of points in
each set. However since we estimated A1 first, the higher accuracy could be due to
the algorithm drifting quickly towards the first transformation as we started quite
near the true parameter setting (otherwise the designation of transformations as first
or second is arbitrary). Obviously, extensive simulations are required to conclusively
assess performance of the algorithm especially transformation errors.
Chapter 5. EM algorithm Alignment 102
Table 5.8: Proportions of correct correspondence and rotation errors when using
EM algorithm for matching forms with two transformations. A point has a correct
correspondence if assigned to a true corresponding point µi and under the true
transformation s
or the point is rightly not assigned to any other point µi.
Correspondence Rotation error
Run All Points Set 1 Set 2 A1 A2
1 0.681 0.692 0.669 0.055 0.767
(0.0057) (0.0071) (0.0075) (0.0070) (0.0340)
2 0.695 0.706 0.684 0.051 0.136
(0.0058) (0.0070) (0.0077) (0.0040) (0.0072)
3 0.681 0.692 0.670 0.054 0.702
(0.0057) (0.0072) (0.0074) (0.0053) (0.0336)
Given in parentheses are the std. errors.
Chapter 6
Bayesian Alignment
In this chapter we consider Markov chain Monte Carlo (MCMC) technique in a
Bayesian paradigm to overcome the problem of sensitivity to starting values for EM
algorithm in Chapter 5. We consider finding alignment and point correspondences
between two configurations using a full joint distribution for correspondence matrix
and transformation parameters. Using MCMC with detailed balance update and
drawing from the posterior of all parameters should stand a better chance of escaping
from local maxima for the model better than by simply trying several starting values
for the EM algorithm.
6.1 Bayesian Hierarchical Model
Green and Mardia (2006) build a hierarchical model to solve alignment and matching
of configurations, according to the Bayesian paradigm. This method gives a complete
distribution of probable matches and hence an opportunity to explore several other
solutions near the “optimal” solution.
103
Chapter 6. Bayesian Alignment 104
6.1.1 Point Process Model, with Geometrical Transforma-
tion and Random Thinning
Suppose there are two point configurations in d-dimensional space Rd: {xj , j =
1, 2, . . . , m} and {yk, k = 1, 2, . . . , n}. The points are labelled for identification, but
arbitrarily.
Both point sets are regarded as noisy observations on subsets of a set of true
locations {µi}, where the mappings from j and k to i is unknown. There may be a
geometrical transformation between the x-space and the y-space, which may also be
unknown. The objective is to make model-based inference about these mappings,
and in particular make probability statements about matching – which pairs (j, k)
correspond to the same true location?
The geometrical transformation between the x-space and the y-space is denoted
A; thus y in y-space corresponds to x = Ay in x-space. The notation does not
imply that the transformation A is necessarily linear. It may be a rotation or more
general linear transformation, a translation, both of these, or some non-rigid motion.
Regard the true locations {µi} as being in x-space.
The mappings between the indexing of {µi} and that of data {xj} and {yk} are
captured by indexing arrays {ξj} and {ηk}; specifically assume that
xj = µξj + ε1j (6.1)
for j = 1, 2, . . . , m, where {ε1j} have probability density f1, and
Ayk = µηk+ ε2k (6.2)
for k = 1, 2, . . . , n, where {ε2k} have density f2. All {ε1j} and {ε2k} are independent
of each other, and independent of {µi}.
6.1.2 Formulation of Poisson Process Prior
Suppose that the set of true locations {µi} forms a homogeneous Poisson process
with rate λ over a region V ⊂ Rd of volume v, and that there are N points realised in
Chapter 6. Bayesian Alignment 105
this region. Some of these give rise to both x and y points, some to points of one kind
and not the other, and some are not observed at all. Suppose these four possibilities
occur independently for each realised point, with probabilities parameterised so that
with probabilities (1−px−py−ρpxpy, px, py, ρpxpy) observe neither, x alone, y alone,
or both x and y, respectively. The parameter ρ is a certain measure of the tendency
a priori for points to be matched: the random thinnings leading to the observed x
and y configurations can be dependent, but remain independent from point to point.
Given N , m and n, there are L matched pairs of points in the sample if and
only if the numbers of these four kinds of occurrence among the N points are
(N −m− n+ L,m− L, n− L,L). Under the assumptions above these four counts
will be independent Poisson distributed variables, with means (λv(1 − px − py −ρpxpy), λvpx, λvpy, λvρpxpy). The prior marginal1 probability distribution of L con-
ditional on m and n is therefore proportional to
e−λvpx(λvpx)m−L
(m− L)!× e−λvpy(λvpy)
n−L
(n− L)!× e−λvρpxpy(λvρpxpy)
L
L!
so that
P (L) ∝ (ρ/λv)L
(m− L)!(n− L)!L!
for L = 0, 1, . . . ,min{m,n}. Here and later, use the generic P (·) notation for distri-
butions and conditional distributions in the hierarchical model.
The matching of the configurations is represented by the matching matrix M ,
where Mjk indicates whether xj and yk are derived from the same µi point, or not,
that is,
Mjk =
1 if ξj = ηk
0 otherwise.
(6.3)
Note that M is the adjacency matrix for the bipartite graph representing the match-
ing, and that∑
j,kMjk = L. Assume for the moment that conditional on L, M is
a priori uniform: there are L!(mL
)(nL
)different M matrices consistent with a given
1Integrated over N ,
∞∑
N=n+m−L
{λv(1 − px − py − ρpxpy)}N−m−n−L
(N − m − n + L)!= 1.
Chapter 6. Bayesian Alignment 106
value of L, and these are taken as equally likely. Thus
P (M) = P (L)P (M |L) ∝ (ρ/λv)L
(m− L)!(n− L)!L!
{L!
(m
L
)(n
L
)}−1
∝ (ρ/λv)L,
(where here and later “∝” means proportional to, as functions of the variable(s) to
the left of the conditioning |, in this case, M). Thus
P (M) =(ρ/λv)L
∑min{m,n}ℓ=0 ℓ!
(mℓ
)(nℓ
)(ρ/λv)ℓ
. (6.4)
Because of the choice of parameterisation for the probabilities of observing hidden
points, P (M) does not involve px and py.
µ
ξ η
M
X Y
σ
A
τ
Figure 6.1: Directed acyclic graph representing the model, showing all data and
parameters treated as variable.
6.1.3 Data Likelihood
Given M , the likelihood of the observed configurations of points is specified as
follows. Assume that A is an affine transformation: Ay = Ay + τ . From (6.1)
and (6.2), the densities of xj and yk, conditional on A, τ , {µi}, {ξj} and {ηk} are
f1(xj − µξj) and |A|f2(Ayk + τ − µηk), respectively, |A| denoting the absolute value
of the determinant of A.
The locations {µi} of the m − L points that generate an x observation but not
a y observation are independently uniformly distributed over the region V , so that
Chapter 6. Bayesian Alignment 107
the likelihood contribution of these m− L observations, namely {j :∑
k
Mjk = 0},
is∏
j:Mjk=0∀kv−1
∫
V
f1(xj − µ)dµ.
Similarly, the contributions from the unmatched y observations, and from the matched
pairs are
∏
k:Mjk=0∀jv−1
∫
V
|A|f2(Ayk+τ−µ)dµ and∏
j,k:Mjk=1
v−1
∫
V
f1(xj−µ)|A|f2(Ayk+τ−µ)dµ
respectively. These integrals all exhibit “edge effects” from the boundary of the
region V , which can be neglected if V is large relative to the supports of f1 and f2.
In this case these three expressions approximate to
v−(m−L), (|A|/v)n−L, and (|A|/v)L∏
j,k:Mjk=1
∫
Rd
f1(xj − µ)f2(Ayk + τ − µ)dµ
respectively. The last expression can be written
(|A|/v)L∏
j,k:Mjk=1
g(xj −Ayk − τ)
where g(z) =∫f1(z + u)f2(u)du (the density of ε1j − ε2k).
Combining these terms together, the complete likelihood is
P (x, y|M,A, τ) = v−(m+n)|A|n∏
j,k:Mjk=1
g(xj − Ayk − τ). (6.5)
Multiplying (6.4) and (6.5), then
P (M,x, y|A, τ) ∝ |A|n∏
j,k:Mjk=1
{(ρ/λ)g(xj −Ayk − τ)}.
Note that the constant of proportionality involves m, n, λ, ρ, and v, but not A, τ ,
any parameters in f1 or f2, or M of course.
By further making assumptions of spherical normality for f1 and f2:
xj ∼ Nd(µξj , σ2xI) and Ayk + τ ∼ Nd(µηk
, σ2yI),
with σx = σy = σ, say, then
g(z) =1
(σ√
2)dφ(z/σ
√2)
Chapter 6. Bayesian Alignment 108
where φ is the standard normal density in Rd, and the final joint model is
P (M,A, τ, σ, x, y) ∝ |A|nP (A)P (τ)P (σ)∏
j,k:Mjk=1
(ρφ({xj − Ayk − τ}/σ√2)
λ(σ√
2)d
).
(6.6)
Note that not only px and py but also v does not appear in this expression, principally
from the choice of parameterisation, and that only the ratio ρ/λ is identifiable.
The directed acyclic graph representing this joint probability model, including the
variables (µ, ξ and η) that have been integrated out, is displayed in Figure 6.1.
6.1.4 Prior Distributions and Computations
We assumed the existence of true but unobservable locations {µi} from a Poisson
process just to conveniently formulate the mathematical framework and simplify the
algebra. The assumption of Poisson points would not exactly represent the model for
functional sites (see Chapter 2). However in section 6.1.8 we do sensitivity analysis
for the Poisson assumption and find that violations of the assumption do not impede
the effectiveness of the algorithm.
Green and Mardia (2006) treat ρ and λ as fixed, and consider inference for the
remaining unknowns M , τ , σ2 and sometimes A, given the data {xj} and {yk}.Markov chain Monte Carlo methods are used for the computation.
Suppose that prior information about τ , σ2 and A will be at best weak and use
generic prior formulations that facilitate the posterior analysis. Prior assumptions
are therefore discussed in parallel with MCMC implementation. Note that the for-
mulation has some affinity with mixture models, the matching matrix M playing a
similar role to the allocation variables often used in computing with mixtures; see,
for example, Richardson and Green (1997). As in that paper, this full Bayesian anal-
ysis aims at simultaneous joint inference about both the discrete and continuously
varying unknowns, in contrast to frequentist approaches.
This model has another similarity with a mixture formulation, in that as M
varies, the number of hidden points needed to generate all the observed data also
varies, and thus there seems to be a “variable-dimension” aspect to the model. How-
Chapter 6. Bayesian Alignment 109
ever, the approach of integrating out hidden point locations eliminates the variable-
dimension parameter, so that reversible jump MCMC is not needed.
Priors and MCMC updating for a rotation matrix
From equation 6.6, the full conditional distribution for A given data and values for
all other parameters is
P (A|M, τ, σ, x, y) ∝ |A|nP (A)∏
j,k:Mjk=1
φ({xj − Ayk − τ}/σ√2).
Viewing this as a density for A, there is still freedom to choose the dominating
measure for P (A) arbitrarily. Then the full conditional density will be with respect
to the same measure.
In matching functional sites, we would only consider rigid body transforma-
tion other than a general (linear) transformation. Thus considering only rotations
(orthogonal matrices A with positive determinant) and expanding the expression
above:
P (A|M, τ, σ, x, y) ∝ P (A) exp
∑
j,k:Mjk=1
−0.5(||xj −Ayk − τ ||/σ√2)2
∝ P (A) exp
(1/2σ2)∑
j,k:Mjk=1
(xj − τ)TAyk
∝ P (A) exp
tr
(1/2σ2)∑
j,k:Mjk=1
yk(xj − τ)TA
.
There is (conditional) conjugacy – if P (A) has the form P (A) ∝ exp(tr(F T0 A))
for some matrix F0. That is the posterior has the same form with F0 replaced by
F = F0 + (1/2σ2)∑
j,k:Mjk=1
(xj − τ)yTk . (6.7)
This is known as the matrix Fisher distribution (Downs, 1972; Mardia and Jupp,
2000, p. 289). Here for symmetry we use uniform prior with F0 = 0.
Chapter 6. Bayesian Alignment 110
Sampling the matrix Fisher distribution
We will review how to sample from the matrix Fisher distribution in the 3-dimensional
case.
For 3-dimensional case, A can be represented as a product of 3 elementary rota-
tions
A = A12(θ12)A13(θ13)A23(θ23) (6.8)
as in Raffenetti and Ruedenberg (1970), and Khatri and Mardia (1977). For i < j,
Aij(θij) is the matrix with mii = mjj = cos θij , −mij = mji = sin θij , mrr = 1 for
r 6= i, j and other entries 0. Each of the generalised Euler angles θij is sampled
in turn, conditioning on the other two angles and the other variables (M, τ, σ, x, y)
entering the expression for F .
The joint full conditional density of the Euler angles is
∝ exp[tr{F TA}] cos θ13
for θ12, θ23 ∈ (−π, π) and θ13 ∈ (−π/2, π/2). The cosine term arises since the natural
dominating measure, corresponding to uniform distribution of rotation, has volume
element cos θ13dθ12dθ13dθ23 in these coordinates.
By substituting the representation (6.8) and simplifying, the trace can be written
variously as
tr{F TA} = a12 cos θ12 + b12 sin θ12 + c12 + a13 cos θ13 + b13 sin θ13 + c13
+a23 cos θ23 + b23 sin θ23 + c23
where
a12 = (F22 − sin θ13F13) cos θ23 + (−F23 − sin θ13F12) sin θ23 + cos θ13F11
b12 = (− sin θ13F23 − F12) cos θ23 + (F13 − sin θ13F22) sin θ23 + cos θ13F21
a13 = sin θ12F21 + cos θ12F11 + sin θ23F32 + cos θ23F33
b13 = (− sin θ23F12 − cos θ23F13) cos θ12 + (− sin θ23F22 − cos θ23F23) sin θ12 + F31
a23 = (F22 − sin θ13F13) cos θ12 + (− sin θ13F23 − F12) sin θ12 + cos θ13F33
b23 = (−F23 − sin θ13F12) cos θ12 + (F13 − sin θ13F22) sin θ12 + cos θ13F32
Chapter 6. Bayesian Alignment 111
and the cij can be ignored, combined into the normalising constants. Thus the full
conditionals for θ12 and θ23 are von Mises distributions. These can be updated by
Gibbs sampling or an efficient rejection method, Best/Fisher algorithm (see Mardia
and Jupp, 2000, p. 43).
However the distribution of θ13 is proportional to
exp[a13 cos θ13 + b13 sin θ13] cos θ13.
Mardia and Gadsden (1977) studied this distribution without discussing how to
simulate a sample from it. Green and Mardia (2006) use a random walk Metropolis
algorithm, with a perturbation uniformly distributed on [−0.1, 0.1], to sample from
this distribution.
Priors and updating for other parameters
Here τ and σ−2 are taken to have respectively prior Gaussian and Gamma distri-
butions. These priors are computationally convenient and most importantly also
plausible for τ and σ in matching functional sites. Thus
τ ∼ Nd(µτ , σ2τI)
and
σ−2 ∼ Γ(α, β).
Under the assumptions of (6.6), there is conjugacy for τ and σ, and the explicit full
conditionals:
τ |M,A, σ, x, y ∼ Nd
(µτ/σ
2τ +
∑j,k:Mjk=1(xj −Ayk)/2σ
2
1/σ2τ + L/2σ2
,1
1/σ2τ + L/2σ2
I
),
σ−2|M,A, τ, x, y ∼ Γ
α + (d/2)L, β + (1/4)∑
j,k:Mjk=1
||xj − Ayk − τ ||2
and Gibbs sampler is used to update these parameters.
Chapter 6. Bayesian Alignment 112
Updating M
The matching matrix M is updated in detailed balance using Metropolis-Hastings
moves that only propose changes to a few entries: the number of matches L =∑
j,kMjk can only increase or decrease by 1 at a time, or stay the same. The
possible changes are
(a) adding a match: changing one entry Mjk from 0 to 1.
(b) deleting a match: changing one entry Mjk from 1 to 0.
(c) switching a match: simultaneously changing one entry from 0 to 1, and another
in the same row or column from 1 to 0.
These changes respect the constraint that there should be unique matches between
js and ks (0 ≤∑
j
Mjk ≤ 1 and 0 ≤∑
k
Mjk ≤ 1).
The proposal proceeds as follows: first a uniform random choice is made from all
m+n data points x1, x2, . . . , xm, y1, y2, . . . , yn. Suppose without loss of generality, by
the symmetry of the set-up, that an x is chosen, say xj . There are two possibilities:
either xj is currently matched (∃k such that Mjk = 1) or not (there is no such k).
If xj is matched to yk, with probability p⋆ propose deleting the match, and with
probability 1 − p⋆ propose switching it from yk to yk′, where k′ is drawn uniformly
at random from the currently unmatched y points. On the other hand, if xj is not
currently matched, propose adding a match between xj and a yk, where again k is
drawn uniformly at random from the currently unmatched y points.
The acceptance probabilities for these three possibilities are easily derived from
the expression (6.6) for the joint distribution, since in each case the proposed
new matching matrix M ′ is only slightly perturbed from M , so that the ratio
P (M ′, τ, σ|x, y)/P (M, τ, σ|x, y) has only a few factors. Taking into account also
the proposal probabilities, whose ratio is (1/nu)÷p⋆, where nu = #{k ∈ 1, 2, . . . , n :
Mjk = 0∀j} is the number of unmatched y points in M , the acceptance probability
for adding a match (j, k) is
min
{1,ρφ({xj −Ayk − τ}/σ√2)p⋆nu
λ(σ√
2)d
}. (6.9)
Chapter 6. Bayesian Alignment 113
Similarly, the acceptance probability for switching the match of xj from yk to yk′ is
min
{1,φ({xj − Ayk′ − τ}/σ√2)
φ({xj − Ayk − τ}/σ√2)
}(6.10)
and for deleting the match (j, k) is
min
{1,
λ(σ√
2)d
ρφ({xj −Ayk − τ}/σ√2)p⋆n′u
}(6.11)
where n′u = #{k ∈ 1, 2, . . . , n : M ′
jk = 0∀j} = nu + 1. Since the changes effected are
so modest, typically make several moves updating M per sweep along with just one
at a time for each of the other updates.
6.1.5 Inference
Point estimates for M , A and τ are important in Bioinformatics applications. We
need to specify loss functions giving the cost incurred in declaring point estimates.
We consider estimators which minimise expected loss functions with respect to con-
ditional posterior distributions.
Match Matrix
Suppose that the loss when Mjk = a and Mjk = b, for a, b = 0, 1 is ℓab; for example,
ℓ01 is the loss associated with declaring a match between xj and yk when there is
In all these cases, at least GGXG of the dinucleotide binding motif GXXGGXG is
matched. Graph theoretic solutions before MCMC refinement step are not signifi-
Chapter 7. Bayesian Refinement of Graph Solutions 150
Figure 7.5: Effect of MCMC refinement on graph matches of the NADP-binding
site of 17 − β hydroxysteroid dehydrogenase (1a27 0) against NAD(P)(H) binding
sites of SCOP tyrosine dependent oxidoreductase family proteins (Case 2) where
corresponding amino acids are not restricted to others in the same group. Each site
is represented by a green circle (graph only) and blue cross (after MCMC refinement)
connected by a straight line to highlight the difference.
cant. Figure 7.7 is a superposition of corresponding amino acids between functional
sites of glyceraldehyde-3-phosphate dehydrogenase (3dbv 3) and alcohol dehydroge-
nase (1hdx 1) after MCMC refinement step.
7.4.4 Case 4: Alcohol Dehydrogenase and FAD/NAD(P)-
binding Domain
We took a NAD-binding functional site of alcohol dehydrogenase (1hdx 1) and
matched it against FAD/NAD(P)(H) binding sites belonging to members of SCOP
FAD/NAD(P)-binding domain (c.3.1.x) without taking into account the amino acid
chemistry. A distance threshold value of 1.0A other than 1.5A was found to give
better matches for graph theoretic solution and was used in this case. A total of 338
Chapter 7. Bayesian Refinement of Graph Solutions 151
Figure 7.6: RMSD against number of corresponding amino acids for matching 17−β hydroxysteroid dehydrogenase NADP-binding site against NAD(P)(H) binding
sites of SCOP tyrosine dependent oxidoreductase family proteins (Case 2). a)
Graph matching prior to MCMC refinement showing results with/without amino
acid property information. Each site is represented by a green circle (with) and blue
cross (without) connected by a straight line to highlight the difference. b) MCMC
refinement of (a).
Chapter 7. Bayesian Refinement of Graph Solutions 152
Figure 7.7: Superposition of matching amino acids (Case 3) between alcohol dehy-
drogenase (1hdx 1; blue) and glyceraldehyde-3-phosphate dehydrogenase (3dbv 3;
red) binding sites after MCMC refinement (RMSD = 0.672; number of correspond-
ing amino acids = 12; p-value = 3.68e-05). The matched dinucleotide binding motif
is shown in ball-and-stick representation. Ligands are coloured in CPK colours.
pair-wise comparisons were made and 64 were significant before MCMC refinement
step. Sites from dihydropyrimidine dehydrogenase (1gth 13) and fumarate reductase
(1qla 5; 1qla 7; 1qlb 2; 1qlb 6) become statistically significant only after MCMC re-
finement step (p-values for 1gth 13, 1qla 5, 1qla 7, 1qlb 2 and 1qlb 6 before MCMC
refinement step: 0.3742, 0.6621, 0.6766, 0.6199 and 0.6199; after MCMC refinement
step: 0.0258, 0.0141, 0.0141, 0.0001 and 0.0001).
7.4.5 Assessing MCMC Refinement
Table 7.1 gives a summary on improvements achieved after MCMC refinement step
in the applications considered when matching with and without physico-chemistry
properties. In all considered cases, there are sites which give statistically significant
matches only after MCMC refinement step.
When not using physico-chemistry properties, much improvement (relative to
the number of sites considered) is registered in matching the query from 17 − β
hydroxysteroid dehydrogenase against sites from the same SCOP family members.
Chapter 7. Bayesian Refinement of Graph Solutions 153
There are also many improved cases in matching the query from alcohol dehydroge-
nase against sites from different families and fold (FAD/NAD(P)-binding domain).
Tables 7.2 and 7.3 compare RMSD and number of matched amino acids found be-
fore and after refinement when not using physico-chemistry properties. RMSD does
not change much after refinement. There are marginal RMSD mean increases for
matching functional sites of alcohol dehydrogenase and 17 − β hydroxysteroid de-
hydrogenase against sites of same SCOP family members. However there are also
marginal mean decreases for matching alcohol dehydrogenase against sites from dif-
ferent families and fold. The mean number of matched amino acids increases after
MCMC refinement except in matching functional sites of alcohol dehydrogenase and
members of the same superfamily but different families where there is a marginal
decrease.
There are even more improvements after MCMC refinement step when using
physico-chemistry properties. However there are less significant matches both before
and after refinement when matching with physico-chemistry properties compared
to matching without physico-chemistry properties. Tables 7.4 and 7.5 compare
RMSD and number of matched amino acids found before and after refinement when
using physico-chemistry properties. There are marginal mean decreases after MCMC
refinement in all cases. The mean number of matched amino acids increases after
MCMC refinement as well.
7.5 Comments
The examples given above make a clear case that MCMC refinement can improve
ligand binding site matches generated by graph matching, in terms of both the sta-
tistical and biological significance of the match. We attribute this success to the lack
of dependence on a strict matching tolerance, which is enforced in graph matching.
Statistical modelling in refinement of matches appears to have been successful in
automatically adapting to shape variations in ligand binding sites, which might be
due to different noise levels in atomic positions or protein phylogeny differences,
Chapter 7. Bayesian Refinement of Graph Solutions 154
Table 7.1: Assessment of statistical significance of functional site matching before
and after MCMC refinement step with/out amino acid property information.Without amino acid property With amino acid property
Case Total Sig. Graph Sig. MCMC Sig. Graph Sig. MCMC
alcohol dehydrogenase 145 142 142 125 132
and family
17 − β hydroxysteroid 326 248 318 159 236
dehydrogenase and family
alcohol dehydrogenase 897 200 324 33 222
and superfamily
alcohol dehydrogenase and 338 64 69 5 12
FAD/NAD(P)-binding domain
Sig. Graph: significant before refinement.
Sig. MCMC: significant after MCMC refinement step.
Table 7.2: RMSD(A) before and after MCMC refinement step without amino acid
property.Graph MCMC
Case Mean Std. Dev. Mean Std. Dev.
alcohol dehydrogenase and family 0.590 0.2350 0.619 0.2824
17 − β hydroxysteroid dehydrogenase and family 0.874 0.2208 0.958 0.1987
alcohol dehydrogenase and superfamily 2.093 1.6820 1.934 1.7367
alcohol dehydrogenase and FAD/NAD(P)-binding domain 1.723 1.3155 1.715 1.3188
Table 7.3: The number of matched amino acids before and after MCMC refinement
step without amino acid property.Graph MCMC
Case Mean Std. Dev. Mean Std. Dev.
alcohol dehydrogenase and family 33.7 12.26 34.6 13.11
17 − β hydroxysteroid dehydrogenase and family 17.0 6.83 21.8 7.04
alcohol dehydrogenase and superfamily 13.3 2.14 12.4 2.24
alcohol dehydrogenase and FAD/NAD(P)-binding domain 10.5 1.03 10.6 1.22
Chapter 7. Bayesian Refinement of Graph Solutions 155
Table 7.4: RMSD(A) before and after MCMC refinement step with amino acid
property.Graph MCMC
Case Mean Std. Dev. Mean Std. Dev.
alcohol dehydrogenase and family 0.805 0.8892 0.737 0.7944
17 − β hydroxysteroid dehydrogenase and family 1.047 0.6706 0.997 0.6226
alcohol dehydrogenase and superfamily 2.751 1.9127 2.459 2.0150
alcohol dehydrogenase and FAD/NAD(P)-binding domain 3.424 2.3234 3.337 2.3814
Table 7.5: The number of matched amino acids before and after MCMC refinement
step with amino acid property.Graph MCMC
Case Mean Std. Dev. Mean Std. Dev.
alcohol dehydrogenase and family 28.1 13.51 32.8 14.74
17 − β hydroxysteroid dehydrogenase and family 12.4 6.55 17.0 7.75
alcohol dehydrogenase and superfamily 8.3 1.21 8.7 1.69
alcohol dehydrogenase and FAD/NAD(P)-binding domain 8.7 0.81 9.0 1.06
among other factors. Refined matches usually retain a similar RMSD, and achieve
greater significance through expansion of the number of matching residues from the
core graph match. We have noted however that in some cases significant reductions
in the match RMSD are also achieved by refinement.
Dependence on a strict matching tolerance is not limited to graph matching,
but is also a feature of other matching methods commonly used in the field (e.g.
geometric hashing: Wallace et al., 1997). It is important to note that the MCMC re-
finement procedure can be applied to a starting match generated by any method; and
that the graph procedure chosen here was simply intended as an example. Equally
MCMC procedure can be applied to matching with no previously generated start-
ing match, for example by starting from randomly generated matches. That is, the
MCMC method provides a stand-alone algorithm for matching. Furthermore, the
method provides the full joint posterior distribution so that we have for example,
the posterior distribution for the matching matrix as well as the parameters of the
transformation simultaneously. However, we find that obtaining good matches by
Chapter 7. Bayesian Refinement of Graph Solutions 156
this method is very expensive in terms of computational time. While methods such
as graph matching can be applied to database searching, where a site is matched
against all members of a large database of sites, this would be impractical for match-
ing by MCMC alone. We suggest therefore that the MCMC procedure would be
most advantageous when applied to the best hits from a database search using a
faster method, and that in many cases it would increase the number of significant
hits.
We have made only a very basic study of the effect of including amino-acid residue
physico-chemical property information in matching, contrasting matches obtained
without restriction (any residue may match any other) with slightly more restrictive
matching (residues only allowed to match within relatively broadly defined groups).
It is interesting that even with very broadly defined groups, fewer statistically sig-
nificant matches are generally obtained than when matching is without restriction.
This could suggest that the physico-chemical properties of sites binding the same
or similar ligands can change significantly in evolution. It is however most likely to
reflect increased flexibility to change in peripheral residues that are less important
for binding, and needs further investigation. The main point of this work is that
MCMC refinement can improve matches under either matching regime. Indeed in
a few cases of matching with physico-chemical groups, we showed that some graph
matches without statistical significance were converted to significant matches by the
MCMC procedure, revealing that using graph matching alone could lead to some
erroneous conclusions in this respect.
Chapter 8
Conclusions and Further Work
In this Chapter we summarise important points from Chapters 3 to 7. We also
highlight potential areas for further work on the topics discussed in this thesis.
8.1 Conclusions
A few conclusions can already be drawn from work reported in this thesis.
8.1.1 Functional Sites
Exploratory analysis shows that functional sites in SITESDB tend to consist of short
contiguous segments (motifs) from the protein chain. Although in theory, side chains
from different parts of the chain can come together spatially to form an active site
or binding site, the automatic extraction of these sites in SITESDB leads to the
inclusion of all adjacent side chains (within 5A of residues annotated with SITE
RECORD in PDB or bound ligands). Presence of adjacent side chains is reflected
in the dataset but could be not part of the core binding or functional site. However
it has to be noted that currently most well known functional sites or active sites are
motifs.
157
Chapter 8. Conclusions and Further Work 158
8.1.2 Simulating Random Protein Structures
In section 3.1.2 we successfully simulated random protein Cα traces. A simpler
and more flexible approach similar to Aszodi and Taylor (1994) was used. With a
simple modelling of hydrophobic effects, the method produces compact and globular
structures.
8.1.3 Matching Algorithms
All the algorithms considered here (Graph theoretic, EM algorithm and MCMC) do
better when configuration points are further apart. As expected the performance
decreases with more cluttering of points and increasing positional noise.
The graph method
The graph theoretic is robust, fast but not very flexible to account for concomitant
information and different noise levels in functional sites. However the method can
also break down or become very computer intensive when matching configurations
with many inter-point distances of the same magnitude since the product graph
becomes very huge. Consequently the graph method might not be the ideal approach
in some applications like matching whole protein chains.
The EM algorithm
Concomitant information can flexibly be used in the EM algorithm. With good
starting values the EM algorithm does impressively well in finding corresponding
points. However the EM algorithm is sensitive to starting values. It is recommended
to try the algorithm from several starting values for rotation, translation and noise
parameters then monitor convergence. Simple match constraining techniques e.g.
variance cooling improves the algorithm to find better solutions.
Chapter 8. Conclusions and Further Work 159
The Bayesian hierarchical model (MCMC)
The algorithm is mostly not sensitive to starting values. Unlike the EM algorithm,
MCMC can easily escape local minima.
Although an assumption of a hidden homogeneous Poisson process was made
to formulate the model, the algorithm is not sensitive to this assumption. The
algorithm can match hardcore configurations, simulated short and virtual protein
chains and most importantly real functional sites.
The meta algorithm
There seems to be no silver bullet solution to matching functional sites. MCMC
does better than EM algorithm when starting values are far from true parameter
values in the EM algorithm. MCMC can escape local maxima. However MCMC is
very computer intensive and sometimes can drift away from the optimal solution.
There is need to monitor convergence in both EM and MCMC algorithms. On
the other hand the graph theoretic method is robust, fast but not very flexible to
account for concomitant information and different noise levels in functional sites.
Thus Bayesian modelling of the graph solution i.e. using MCMC method starting
from transformation parameter estimates by the graph method was suggested. This
meta algorithm is observed to be a good strategy. MCMC refinement step was able
to improve graph based matches to be more biologically significant.
8.1.4 Concomitant Information
Concomitant information (amino acid type) guides the EM algorithm to converge
(faster) to the true solution. However in most cases geometric information is so rich
such that the contribution from amino acid types information is marginal in both
EM and MCMC algorithms.
Chapter 8. Conclusions and Further Work 160
8.1.5 Hardening Soft Matches
Both EM and MCMC algorithms give probabilities of matching points in a pair-wise
alignment. Using linear programming to optimally assign unique matches gives best
results. However for small problems, just getting first non-duplicate set of matches
with high probabilities or using greedy algorithm gives practically similar solutions.
8.1.6 Assessing Significance of Matches
We have considered assessing the significance of matches that they are non-random
under the null hypothesis of random matches. We have also considered the goodness-
of-fit for matching related configurations.
Random versus non-random matches
Significance of matching two configurations under the null hypothesis of random
matches depends on RMSD, total number of amino acids in each configuration and
the number of amino acids matched. Extreme value distribution with empirically
derived constants for matching two random configurations can be used for evaluat-
ing p-values. The p-value calculation takes into account the RMSD, the number of
amino acids matched and the total number of amino acids in each of the configura-
tions.
Goodness-of-fit
We considered goodness-of-fit for matches known to be related (not matching by
chance) in section 4.1 of Chapter 4. P-values for assessing goodness-of-fit or ranking
matches from the RMSD distribution under the isotropic Gaussian error model
mostly agrees with the decision using the score suggested by Gold (2003).
Chapter 8. Conclusions and Further Work 161
8.1.7 Application: Matching NAD Binding Functional Sites
In Chapter 7, section 7.4, when using the meta algorithm of graph theoretic and
MCMC to match NAD(P)(H) binding sites, we find examples where significant shape
based matches do not retain similar amino acid chemistry. Matches were within
single Rossmann fold families, between different families in the same superfamily,
and in different folds. This indicates that even within families the same ligand may
be bound using substantially different physico-chemistry. We also showed that the
procedure finds significant matches between binding sites for the same cofactor in
different families and different folds.
In our basic study of the effect of including amino-acid residue physico-chemical
property information in matching, we contrasted matches obtained without restric-
tion (any amino acid could match any other) with slightly more restrictive matching
(amino acids only allowed to match within relatively broadly defined groups). It is
interesting that even with very broadly defined groups, fewer statistically signifi-
cant matches were generally obtained than when matching is without restriction. It
is also interesting to note that MCMC refinement improved matches under either
matching regime. Indeed in a few cases of matching with physico-chemical groups,
we showed that some graph matches without statistical significance were converted
to significant matches by the MCMC procedure, revealing that using graph matching
alone could lead to some erroneous conclusions in this respect.
8.2 Further Work
This work has also raised some issues which are interesting and need further work.
8.2.1 Simulating Random Protein Structures
Aszodi and Taylor (1994) and our alternative method in section 3.1.2 use fixed target
distances between Cα atoms, only taking into account the hydrophobic property
of amino acids. Further work could explore the idea of varying target distances
Chapter 8. Conclusions and Further Work 162
according to the target chain length. There would be need to explore how distances
between different types of amino acids in 3-dimensional structures vary with the
spacing in the sequence as well as the length of the chain.
In our method, the minimum of three random angles from the von Mises distri-
bution was used at each hydrophobic Cα atom in order to fold the chain towards
the centre of mass and create a hydrophobic interior core. Further work could in-
corporate using variable number of random angles, depending on the number of Cα
atoms (already) in the chain. This approach would control the level of structure
compactness. Furthermore, this approach would decrease chances for the chain to
crash into itself.
8.2.2 Matching Statistics
More research on the distribution for RMSD or size-and-shape distance and number
of matches when matching random configuration is required. The following are quite
interesting and very much open questions:
(a) What is the exact distribution for the number of matches q when matching two
random configurations {x} and {µ} with say n and m points?
(b) And what is the exact distribution of RMSD for matching q pairs of points for
two random configurations with m and n points?
Empirical approaches (Stark et al., 2003b; Chen and Crippen, 2005) have been
quite successful in answering these question. However the true analytical distribu-
tions have not been worked out. With analytical distributions, the adjustment for
database size would be straightforward. Empirical (model fitting) approximation by
the limiting distribution (EVD) in section 1.2.2 for minimum RMSD or number of
matches in a database search would not be required.
Chapter 8. Conclusions and Further Work 163
8.2.3 Matching Algorithms
In this thesis we have only considered matching pair-wise configurations. An impor-
tant extension to this approach is matching multiple configurations simultaneously.
The EM algorithm
There are still a number of questions of interest to be investigated with regard
to the EM algorithm alignment. For example, further work need to be done on
exploring the idea of multiple transformations approach. We observe that there
is asymmetry in the performance of the algorithm when matching configurations
with two transformations. The algorithm gave fewer errors with respect to the first
transformation compared to the second transformation. Further work needs to be
done in order to understand this observation.
Sensitivity Analysis for Multiple Transformations
Relevant questions in multiple transformation approach include:
(a) What are the effects of mis-specifying the number of transformations e.g. as-
suming the presence of two transformations when actually there is only one?
Simulations similar to those in section 5.4.4 but more extensive are required.
(b) How to infer on the number of transformations?
(c) When does the problem become over-parameterised? (number of transforma-
tions versus the number of points in the configurations).
Using Multiple Atoms for each Amino Acid
We will consider using more than one atom from each amino acid for matching
functional sites in the future. Some of the issues to be considered are:
(a) Which atoms to choose?
(b) How to account for dependence between atoms.
Chapter 8. Conclusions and Further Work 164
The Bayesian hierarchical model
In the future we will consider alternative formulations to relax the assumption of
conditional normal distribution for the second atom given the first atom when using
two atoms in an amino acid for matching functional sites.
Sequence ordering
All matching algorithms (MCMC, graph theoretic and EM algorithms) can be
extended to take into account the sequence ordering information especially when
matching whole protein structures. In addition to an enhanced capability to solve
alignment and correspondence for configurations with many points, this would speed
up running times of the algorithms. Sequence information would constrain the
matching further and dramatically reduce the solution space.
8.2.4 Application: Matching NAD Binding Functional Sites
There is a suggestion that the physico-chemical properties of sites binding the same
or similar ligands can change significantly in evolution. This was observed when
matching NAD(P)(H) binding sites within single Rossmann fold families, between
different families in the same superfamily, and in different folds in Chapter 7, section
7.4. It is however most likely to reflect increased flexibility to change in peripheral
residues that are less important for binding, and this needs further investigation.
Bibliography
Andreeva, A., Howorth, D., Brenner, S.E, Hubbard, T.J.P., Chothia, C. and Murzin,
A.G. (2004). SCOP database in 2004: refinements integrate structure and se-
quence family data. Nucl. Acid Res. 32 (1), D226–D229.
Applegate, D. and Johnson, D. An implementation of the Carraghan and Pardalos