-
Evaluating Distance Functions in TR Clustering 1
Evaluating distance functions for clustering
tandem repeats
Suyog Rao1,2 Alfredo Rodriguez2 Gary Benson2,3
[email protected] [email protected] [email protected]
1 Department of Electrical and Computer Engineering, Boston
University, Boston, MA2 Laboratory for Biocomputing and
Informatics, Boston University, Boston, MA3 Departments of Computer
Science and Biology, Graduate Program in Bioinformatics,
Boston University, Boston, MA
Abstract
Tandem repeats are an important class of DNA repeats and much
research has focused ontheir efficient identification [3, ?, ?],
their use in DNA typing and fingerprinting [9,???], and
theircausative role in trinucleotide repeat diseases such as
Huntington Disease, myotonic dystrophy, andFragile-X mental
retardation. We are interested in clustering tandem repeats into
groups or familiesbased on sequence similarity so that their
biological importance may be further explored. To clustertandem
repeats we need a notion of pairwise distance which we obtain by
alignment. In this paperwe evaluate five distance functions used to
produce those alignments - Euclidean, Entropy-weighted,Consensus,
Entropy-Surface, and Shannon Divergence. It is important to analyze
and comparethese functions because the choice of distance metric
forms the core of any clustering algorithm.We employ a novel method
to compare alignments and thereby compare the distance
functionsthemselves. We rank the distance functions based on the
cluster validation techniques AverageCluster Density and Silhouette
Index. Finally, we propose a multi-phase clustering method
whichproduces good-quality clusters. In this study, we analyze
clusters of tandem repeats from fivesequences: Human Chromosomes 3,
5, 10 and X and C. Elegans Chromosome III.
Keywords: Tandem repeats, Cluster Analysis, Cluster
Validation
1 Introduction
DNA molecules are subjected to a variety of mutational events,
one of which is tandem duplicationwhich produces tandem repeats. A
tandem repeat is an occurrence of two or more adjacent,
oftenapproximate copies of a sequence of nucleotides. For
example,
ACTTAGT ACTTAGT ACTAAGT ACTTAGT
We are interested in clustering repeats into families based on
their sequence similarities. Members ofa family have similar
sequence but occur at different locations in a genome or in
different genomes.Families have been detected in both prokaryotic
and eukaryotic genomes, including the E. coli, P.aeruginosa, S.
cerevisiae, C. elegans, and human genomes. To accurately and
effectively comparerepeats, we cannot use standard measures like
BLAST [2] or straightforward sequence alignment [7],because variant
copy number and copy ordering are problematic for these methods.
Benson [4] useda profile representation of the repeats to overcome
these difficulties. A profile [5] is a sequence whoselength equals
the number of columns in a multiple alignment and whose individual
elements are thecharacter compositions of the columns.
In this representation, the n individual copies of a single
tandem repeat are aligned to form amultiple alignment M of length k
(see Figure 1). Let Mi,j represent the element in the ith row
and
-
2 Rao et al.
Figure 1: Multiple alignment view of a tandem repeat. Individual
copies are aligned to the repeatconsensus to obtain the profile
representation of the repeat. Common mutations among the copies,
asare evident in this view, are reflected in the profile
compositions.
jth column of M . A profile for M is a sequence S = C1, C2, . .
. Ck of compositions, where each Cj is avector of frequencies of
characters in M∗,j : Cj = (fA, fC , fG, fT , f−), with fσ
indicating the frequencyof letter σ and f− indicating the frequency
of gaps in the column.
Alignment of profiles requires a pairwise distance function for
compositions. In [4], Benson exploreda distance function based on
minimal path lengths along an entropy surface. In this paper we
explorefive distance functions, two related to the entropy surface
Entropy-weighted and Entropy-Surface,a third, the Shannon
Divergence which is also based on entropy, and two others,
Euclidean andConsensus. Each function produces a different score
and the alignments may differ also. Hence tryingto gauge the effect
of distance functions strictly from scores is not a very convincing
or effective processand might lead us to wrong conclusions. To
overcome these difficulties we propose a new approachfor gauging
the closeness or similarity of these distance functions to each
other by comparing thealignments which they produce. Thus by the
end of our experiment we obtain a metric between thesedistance
functions with respect to the alignments. From this analysis and
further examination ofclusters produced with these functions, we
choose a single function for use in clustering.
Finally, we present a multi-phase clustering scheme, which
initially uses the Hierarchical ClusteringMethod, and as a
secondary step uses the Partition Around Medoids (PAM) [?]
algorithm to obtaingood quality clusters. Multiphase clustering
methods like CURE [?] and BIRCH [?] have been usedpreviously to
refine cluster quality. Clustering methods available in the R
statistical programminglanguage [?, ?] were used in our
analysis.
The paper is organized as follows. Section 2 describes the
repeats data we used in our analysis,section 3 defines the distance
functions and describes our method for comparing the alignments
pro-duced by the different functions, section 4 describes our
analysis of cluster quality and the multiphaseclustering approach.
Finally, section 5 summarizes our conclusions.
2 Repeats
2.1 Data Collection and Cleaning
It is important to start with a good set of data so our
conclusions will be robust. The data used forthis analysis were
obtained from the Tandem Repeats Database (TRDB) [1] using default
parameters
-
Evaluating Distance Functions in TR Clustering 3
and consist of 1000 pairs of related tandem repeats from Human
Chromosomes 3, 5, 10 and X (NCBIBuild 34, July 2003 Assembly) and
C. elegans Chromosome III (Sanger Institute, Aug 2002).
Theserepeats were obtained using the Tandem Repeats Finder (TRF)
[3] program. From the original set ofrepeats in each chromosome, we
used the TRDB filtering capability to select only repeats which
havea copy number greater than 5 and whose pattern size is greater
than 35. The results of this selectionare shown in Table 1.
Table 1: Results of TRF analysis and TRDB filtering on the
chromosomes.
Chromosome Size in bp Number of Repeats after(incl. gaps)
repeats filter
Human Chr. 3 199344050 37643 660Human Chr. 5 181034922 35922
971Human Chr. 10 135037215 29510 795Human Chr. X 153692391 31779
794C. elegans Chr. III 13002367 5011 354
These repeats were subjected to the pre-existing clustering
algorithm in TRDB. Repeats from allchromosomes were clustered
separately with a connected components algorithm using Entropy
Surfaceas the distance function. There were totally 223 clusters
over all five chromosomes that resulted fromthe clustering. Our
data set was sampled from this pool of repeats in an automated and
randomizedmanner. From the pool, we chose 400 pairs of repeats,
each pair from a single cluster, from each ofHuman Chromosomes 3,
5, 10 and X and 150 pairs from C. elegans Chromosome III. This
candidateset of 1750 pairs was next subjected to a data cleansing
process described in the next section.
2.2 Data Cleansing - identifying and removing subpatterns in
tandem repeats
A tandem repeat in the data set may contain within itself one or
more copies of tandem repeats(repeats within a repeat), which we
define to be subpatterns of the original repeat. A tandem repeatis
considered to have a perfect subpattern if its pattern length is a
perfect multiple of the subpatternlength. Although there are many
tandem repeats which have perfect subpatterns, it is important
thatwe also consider tandem repeats whose pattern size is a close
multiple of some subpattern length.Thus, we are interested in
subpatterns whose copies span exactly or almost the entire length
of theoriginal pattern. We call these strong subpatterns. Note also
that the subpatterns in a tandem repeatmay be approximate copies of
each other.
Why care about strong subpatterns?
Because we are comparing alignments produced by the different
scoring functions, we do not want toinclude situations where the
alignments differ because a repeat has cyclically shifted in the
alignmentsimply because it contains a strong subpattern. Hence we
eliminate these repeats in our data set. Asan illustration,
consider the tandem repeat X, consisting of sub patterns X1, X2 and
X3 as shown inFigure 1. If X1 ' X2 ' X3, and we try to align this
tandem repeat with another tandem repeat, thepattern might rotate
cyclically as shown in Figure 2, depending on the distance function
used, whichis undesired. Formalizing, given a tandem repeat
sequence X, the problem is to find the existenceof a strong
subpattern within it. In order to achieve this we first identify
the subpatterns in a givenset of repeats and then associate with
each, a notion of a subpattern score or strength. This allowsus to
identify a threshold that separates the strong subpatterns from the
weak subpatterns. We omitfurther discussion of this task.
-
4 Rao et al.
Figure 2: Subpatterns can cause cyclic shifting of a repeat
within an alignment when using differentdistance functions.
3 Comparing Alignments
3.1 Distance functions
We evaluated five distance functions. In what follows, Ci is a
composition vector of k = 5 characterfrequencies fσ1 , . . . , fσk
, one for each DNA base and the gap character:
1. Consensus:
Cons(C1, C2) ={
0 if majority character in C1 and C2 match, and1 otherwise
2. Euclidean:
Euc(C1, C2) =
√√√√ k∑i=1
∆2σi
where ∆2σi is the square of the difference between the
frequencies for character σi in C1 and C2.
3. Jensen-Shannon Divergence:
JS(C1, C2) = H(π1C1 + π2C2)− π1H(C1)− π2H(C2)
where H(C) is the entropy of vector C, that is, H(C) =∑k
i=1 fσi log2(fσi), and πi is a weightingfactor for vector Ci. We
used πi = 0.5.
4. Entropy Surface: This function and the next are related to
one defined by Benson in [4]. Theyare based on the entropy
function
H(C) = −∑σ
fσ log(fσ)
defined over all possible compositions C. The entropy function
describes a six dimensional curvedsurface (five for the character
frequencies fσ and one for the entropy value). Any composition,C,
defines a point H(C) which is the projection of C in 5-space onto
the entropy surface.For the distance measure, we project the
straight line segment connecting C1 and C2 onto theentropy surface.
The distance between C1 and C2 is the length of the resulting curve
(which wenumerically approximate with chords).
5. Entropy Weighted: Similar to the preceding, except the length
of the curve is weighted by theentropy value itself (in our
numerical approximation, the length of each approximating chord
ismultiplied by its midpoint entropy).
-
Evaluating Distance Functions in TR Clustering 5
Each distance function was scaled to values between 0 and 255
inclusive. To correct for differencesbetween the number of copies
in each repeat, all composition vectors were normalized to standard
vec-tors for a 10 copy repeat. The frequencies in a standard vector
are drawn from {0, 0.1, 0.2, . . . , 0.9, 1.0}.The standard vector
closest by Euclidean distance to an original vector becomes that
vector’s normal-ized representative.
3.2 Calculating the distance between repeats
The distance between repeats is calculated using alignment
scores. We use a cyclic alignment algorithm[6] in conjunction with
our distance functions because the relative starting position of
one profile toanother may be incorrect in the original data. (That
is why we omit repeats with strong subpatterns).We create the
inter-repeat distance matrix for all repeats in our dataset, which
becomes the input toour clustering method. The distance between two
repeats R1 and R2 is calculated as follows:
Dist(R1, R2) =Alignment Score (R1, R2)
255 ∗Alignment length (R1, R2)(1)
where 255 is the worst score possible (scaled), using any
distance function.
3.3 Effect of distance functions on the alignments
After collecting a good set of data and cleaning it to remove
repeats with strong subpatterns, weperform our experiment of
comparing the different distance functions. We cannot simply use
thedifferent alignment scores produced by the five functions on a
particular repeat pair, as each functionhas its own properties and
the distribution of alignment scores is very much dependent on the
distancefunction used. For example consider two tandem repeats from
the human genome (???) with consen-sus patterns ATACACAC and
CTCCCAGC, and copy numbers 24.6 and 30.0 respectively. Table
3.3shows the resulting alignment scores and cyclic shifting of the
patterns when using the five distancefunctions. Also shown are the
worst possible scores with each function for the same pair. The
worstpossible score depends on the alignment length and the
alignment length could vary with distancefunction used. This worst
possible score is our normalizing factor in formula (1). The
distance betweenthe same pair using different distance function is
provided in the last column of the table.
3.4 Computing relative distances of the distance functions with
respect to align-ments
To compare two distance functions we use the following
procedure:
• Let A and B be the two repeats to be aligned, and let D1 and
D2 be the two distance functionsto be compared
• Align A and B with D1 to get the alignment AP1.
• Align A and B with D2 to get the alignment AP2.
• Calculate the number of identical pairs in AP1 and AP2. By
identical pairs we mean the samepair of nucleotides aligned in both
alignments.
Iterating this procedure for N pairs of repeats, we calculate
the distance between two function as:
Distnb(D1, D2) = 1−
∑N
No. of indentical pairs * 2len(AP1)+len(AP2)
N(2)
-
6 Rao et al.
Distance Aligned Alignment Worst DistanceFunction Repeats Score
Score
Consensus - C - - T - - - C C C A G - C 1020 3825 0.27- A - - T
- - A C A C A - - C
Entropy-Weighted - C - - T - - - C C C A G - C 727 3825 0.19- A
- - T - - A C A C A - - C
Euclidean A - - G - - C - C T - C C C 1185 3570 0.33A - - T - -
A - C A - C A C
Jensen-Shannon A G - C - - - C T - C C C 1105 3315 0.33A - - T -
- A C A - C A C
Entropy-Surface A G - C - - - C T - C C C 1265 3315 0.38A - - T
- - A C A - C A C
Table 2: Alignments produced by distance functions. The starting
position of the upper pattern iscyclically permuted in these
alignments. Note that columns aligning dash to dash are an artifact
ofthe consensus pattern representation. Dash columns are not
present in consensus patterns, but arepresent in the profile when a
repeat contains characters in a column in less than half its
copies. Thesecolumns often occur when a repeat has many copies as
is the case here.
Thus we take into account here the number of positions the
alignments were identical, and usethis to form the output of our
alignment experiment. Figure 3 shows a tree comparing the 5
distancefunctions. The tree was obtained by hierarchical single
linkage clustering of the distance functions.Based on this tree,
Euclidean and Entropy Surface are the closest in terms of the
alignments produced.Height in the figure represents distance
between the functions.
4 Clustering
Using the repeats from Human Chromosome 10 we produced clusters
using the Hierarchical Agglom-erative Clustering (HAC) method using
the single linkage algorithm [?]. Hierarchical Clustering isa
widely used algorithm despite its time complexity. The HAC
algorithm is a bottom-up strategyand initially places all data
points as singleton clusters. It then merges these clusters into
larger andlarger clusters based on the cluster linkage criteria.
The single linkage method works by mergingtwo clusters or points
which are closest to each other. The hierarchical clustering
algorithm takes asinput an N ×N distance matrix and a cut-off value
which specifies at which height the clustering isterminated. We
performed this clustering procedure with the different distance
functions. (??? whatwas the cutoff)
4.1 Cluster Validation
The importance and effect of cluster structure with respect to
tandem repeat families is still unclear.However, we analyze the
shape and density of the clusters and would like to produce good
clustersusing these metrics. We assess the quality of clusters
produced by the Hierarchical Clustering methodusing the cluster
validation techniques Average Cluster Density and Silhouette Index
defined by [?].Consequently we can rank the individual distance
functions based on the quality of clusters theyproduce.
Average Cluster Density: This measures the compactness and
density of the clusters. The cluster
-
Evaluating Distance Functions in TR Clustering 7
Figure 3: Cluster tree depicting the relativecomparision of
distance functions. Figure 4: Hierarchical Cluster tree of re-
peats in Human Chromosome 10.
density is calculated by using the cluster diameter, which is
the largest distance between any pair ofpoints in the cluster.
ClusterDensity =ClusterDiameter
AverageLength(3)
where Average Length is the average distance between any two
points in the cluster. The AverageCluster Density is the average
over all clusters. If the Average Cluster Density is close to 1, we
havehighly compact clusters.
Silhouette Width: This is a measure of the membership of an
object i to a cluster C. Thesilhouette width shows which objects
lie well within the cluster and which ones are between
clusters.Consider an object i of the data set, and let Ci denote
the cluster to which it is assigned. Wecalculate: 1) a(i) = average
distance of i to all other objects of Ci, 2) For each cluster C
such thatC 6= Ci, d(i, C) = the average distance of i to all other
objects of C, and 3) Over all clusters C 6= Ci,b(i) = min(d(i, C)),
the average distance of i to its nearest neighbor cluster.
Silhouette width S(i) isgiven by,
S(i) =b(i)− a(i)
max{b(i), a(i)}From this, S(i) lies between -1 and +1. The
average silhouette Savg(i) is the average over all theobjects in
the dataset. If Savg(i) is close to 1, the objects are well
clustered or structured.
We calculate the cluster statistics using different distance
functions. Table 3 shows the clusterqualities of each of the five
distance functions on Human Chromosome 10. We chose the
EntropyWeighted distance function for the remainder of our analysis
because it scores well on both measuresand was best in terms of
number of clusters produced and the percentage of repeats
clustered. Figure 4
-
8 Rao et al.
Figure 5: These graphs show the relationship between the number
of clusters and percentage of repeatsclustered at different
distance cut-offs (75% – 99%) using the Hierarchical clustering
method and theEntropy-weighted distance function. The “mountain”
line is the number of multi-repeat clusters(unary clusters are not
counted), the descending line is the number of repeats in
multi-repeat clusters.Comparison of these graphs with those
produced by the other functions (not shown) indicated
thatEntropy-weighted was able to cluster a higher percentage of
repeats than the other distance functions.This was one criterion
for picking Entropy-Weighted as the preferred distance
function.
shows the cluster dendrogram of repeats in Human Chromosome 10
with the cut-off criteria as 25.5in distance.
4.2 Multiple-phase Clustering
A defect of hierarchical clustering is that clusters can be
low-quality in the sense that they are elon-gated and less dense.
To split these chained clusters formed by single-linkage, we can
subject themto clustering again, using other partition based
clustering methods. Using the cluster validation tech-niques, we
can identify these low-quality clusters. We use the Partition
around Medoids (PAM) [?] tore-cluster the chained clusters,
splitting them into smaller clusters.
PAM is one of the variants of the popular k-means approach but
is more robust than k-meansbecause medoids are less influenced by
outliers. PAM works by iteratively finding representativeobjects,
called medoids in the clusters. PAM requires as input K, which is
the number of clusters tobe formed from the data set and an N × N
distance matrix . To determine K, we run PAM on thedata set several
times, each time with a different K and select the K which yields
the highest AverageSilhouette Width. PAM works effectively for
small data sets but does not scale well for large data setsbecause
of its time complexity, O(K(N − K)2), where N is the number of data
points and K is thenumber of clusters. Human Chromosome 10 when
subjected to this multi-phase clustering yielded 44clusters with an
Average Silhouette Width of 0.76.
Figure 6 shows an example were a poor quality cluster produced
by the Hierarchical method onHuman Chromosome 10 was re-clustered
using the PAM method. Initially the Hierarchical clusteringproduced
a cluster containing 138 repeats and a Sil Width of 0.45. Running
PAM on this cluster
-
Evaluating Distance Functions in TR Clustering 9
Table 3: Human Ch10 clustering results using different distance
functions ??? at what cutoff
Distance No. of Sil Avgfunction clusters Width DiameterConsensus
38 0.8 0.85Entropy-Weighted 40 0.73 0.90Euclidean 38 0.64
0.93Entropy-Surface 40 0.76 0.89Jensen-Shannon 36 0.62 0.93
produced two smaller clusters while increasing the Sil Width to
0.6. As the alignments illustrate,repeats within a cluster are much
more closely related than those between clusters.
5 Conclusion
We have described a new quantitative approach to evaluate
distance functions with respect to align-ments, and we study their
effects on discovering families of tandem repeats. We describe a
relativecomparision between the distance functions and also an
individual evaluation using cluster validationtechniques. Tandem
repeats from Human Chromosome 10 were clustered using a multi-phase
approachby using the Hierarchical Agglomerative method and
Partition around Medoids in combination. Ourresults show that for
clustering repeats a multi-phase clustering approach produces
better qualityclusters. The two entropy based functions – Entropy
Weighted and Entropy Surface – outscore theother distance functions
in our alignment experiment, quality and number of clusters
produced, andalso the number of repeats clustered. Future
clustering tools in the Tandem Repeats Database willemploy entropy
based distance functions and multi-phase clustering as demonstrated
in this work.
References
[1] TRDB at http://tandem.bu.edu/cgi-bin/trdb/trdb.exe.
[2] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman.
Basic local alignment search tool. J.Mol. Biol., 215:403–410,
1990.
[3] G. Benson. Tandem repeats finder: a program to analyze DNA
sequences. Nucleic Acids Research,27:573–580, 1999.
[4] G. Benson. A new distance measure for comparing sequence
profiles based on paths along anentropy surface. In Proceedings of
the European Conference on Computational Biology 2002, 2002.
[5] M. Gribskov, R. Lüthy, and D. Eisenberg. Profile analysis.
Methods in Enzymology, 183:146–159,1990.
[6] M. Maes. On a cyclic string-to-string correction problem.
Information Processing Letters, 35:73–78,1990.
[7] T. Smith and M. Waterman. Comparison of biosequences.
Advances in Applied Mathematics,2:482–489, 1981.
-
10 Rao et al.
Figure 6: Result of PAM on a cluster produced by Hierarchical
Clustering method.