Evaluating distance functions for clustering tandem [email protected] [email protected] [email protected] 1 Department of Electrical and Computer Engineering, Boston University, Boston,

Evaluating Distance Functions in TR Clustering 1

Evaluating distance functions for clustering

tandem repeats

Suyog Rao1,2 Alfredo Rodriguez2 Gary Benson2,3

[email protected] [email protected] [email protected]

1 Department of Electrical and Computer Engineering, Boston University, Boston, MA2 Laboratory for Biocomputing and Informatics, Boston University, Boston, MA3 Departments of Computer Science and Biology, Graduate Program in Bioinformatics,

Boston University, Boston, MA

Abstract

Tandem repeats are an important class of DNA repeats and much research has focused ontheir efficient identification [3, ?, ?], their use in DNA typing and fingerprinting [9,???], and theircausative role in trinucleotide repeat diseases such as Huntington Disease, myotonic dystrophy, andFragile-X mental retardation. We are interested in clustering tandem repeats into groups or familiesbased on sequence similarity so that their biological importance may be further explored. To clustertandem repeats we need a notion of pairwise distance which we obtain by alignment. In this paperwe evaluate five distance functions used to produce those alignments - Euclidean, Entropy-weighted,Consensus, Entropy-Surface, and Shannon Divergence. It is important to analyze and comparethese functions because the choice of distance metric forms the core of any clustering algorithm.We employ a novel method to compare alignments and thereby compare the distance functionsthemselves. We rank the distance functions based on the cluster validation techniques AverageCluster Density and Silhouette Index. Finally, we propose a multi-phase clustering method whichproduces good-quality clusters. In this study, we analyze clusters of tandem repeats from fivesequences: Human Chromosomes 3, 5, 10 and X and C. Elegans Chromosome III.

Keywords: Tandem repeats, Cluster Analysis, Cluster Validation

1 Introduction

DNA molecules are subjected to a variety of mutational events, one of which is tandem duplicationwhich produces tandem repeats. A tandem repeat is an occurrence of two or more adjacent, oftenapproximate copies of a sequence of nucleotides. For example,

ACTTAGT ACTTAGT ACTAAGT ACTTAGT

We are interested in clustering repeats into families based on their sequence similarities. Members ofa family have similar sequence but occur at different locations in a genome or in different genomes.Families have been detected in both prokaryotic and eukaryotic genomes, including the E. coli, P.aeruginosa, S. cerevisiae, C. elegans, and human genomes. To accurately and effectively comparerepeats, we cannot use standard measures like BLAST [2] or straightforward sequence alignment [7],because variant copy number and copy ordering are problematic for these methods. Benson [4] useda profile representation of the repeats to overcome these difficulties. A profile [5] is a sequence whoselength equals the number of columns in a multiple alignment and whose individual elements are thecharacter compositions of the columns.

In this representation, the n individual copies of a single tandem repeat are aligned to form amultiple alignment M of length k (see Figure 1). Let Mi,j represent the element in the ith row and

2 Rao et al.

Figure 1: Multiple alignment view of a tandem repeat. Individual copies are aligned to the repeatconsensus to obtain the profile representation of the repeat. Common mutations among the copies, asare evident in this view, are reflected in the profile compositions.

jth column of M . A profile for M is a sequence S = C1, C2, . . . Ck of compositions, where each Cj is avector of frequencies of characters in M∗,j : Cj = (fA, fC , fG, fT , f−), with fσ indicating the frequencyof letter σ and f− indicating the frequency of gaps in the column.

Alignment of profiles requires a pairwise distance function for compositions. In [4], Benson exploreda distance function based on minimal path lengths along an entropy surface. In this paper we explorefive distance functions, two related to the entropy surface Entropy-weighted and Entropy-Surface,a third, the Shannon Divergence which is also based on entropy, and two others, Euclidean andConsensus. Each function produces a different score and the alignments may differ also. Hence tryingto gauge the effect of distance functions strictly from scores is not a very convincing or effective processand might lead us to wrong conclusions. To overcome these difficulties we propose a new approachfor gauging the closeness or similarity of these distance functions to each other by comparing thealignments which they produce. Thus by the end of our experiment we obtain a metric between thesedistance functions with respect to the alignments. From this analysis and further examination ofclusters produced with these functions, we choose a single function for use in clustering.

Finally, we present a multi-phase clustering scheme, which initially uses the Hierarchical ClusteringMethod, and as a secondary step uses the Partition Around Medoids (PAM) [?] algorithm to obtaingood quality clusters. Multiphase clustering methods like CURE [?] and BIRCH [?] have been usedpreviously to refine cluster quality. Clustering methods available in the R statistical programminglanguage [?, ?] were used in our analysis.

The paper is organized as follows. Section 2 describes the repeats data we used in our analysis,section 3 defines the distance functions and describes our method for comparing the alignments pro-duced by the different functions, section 4 describes our analysis of cluster quality and the multiphaseclustering approach. Finally, section 5 summarizes our conclusions.

2 Repeats

2.1 Data Collection and Cleaning

It is important to start with a good set of data so our conclusions will be robust. The data used forthis analysis were obtained from the Tandem Repeats Database (TRDB) [1] using default parameters


and consist of 1000 pairs of related tandem repeats from Human Chromosomes 3, 5, 10 and X (NCBIBuild 34, July 2003 Assembly) and C. elegans Chromosome III (Sanger Institute, Aug 2002). Theserepeats were obtained using the Tandem Repeats Finder (TRF) [3] program. From the original set ofrepeats in each chromosome, we used the TRDB filtering capability to select only repeats which havea copy number greater than 5 and whose pattern size is greater than 35. The results of this selectionare shown in Table 1.

Table 1: Results of TRF analysis and TRDB filtering on the chromosomes.

Chromosome Size in bp Number of Repeats after(incl. gaps) repeats filter

Human Chr. 3 199344050 37643 660Human Chr. 5 181034922 35922 971Human Chr. 10 135037215 29510 795Human Chr. X 153692391 31779 794C. elegans Chr. III 13002367 5011 354

These repeats were subjected to the pre-existing clustering algorithm in TRDB. Repeats from allchromosomes were clustered separately with a connected components algorithm using Entropy Surfaceas the distance function. There were totally 223 clusters over all five chromosomes that resulted fromthe clustering. Our data set was sampled from this pool of repeats in an automated and randomizedmanner. From the pool, we chose 400 pairs of repeats, each pair from a single cluster, from each ofHuman Chromosomes 3, 5, 10 and X and 150 pairs from C. elegans Chromosome III. This candidateset of 1750 pairs was next subjected to a data cleansing process described in the next section.

2.2 Data Cleansing - identifying and removing subpatterns in tandem repeats

A tandem repeat in the data set may contain within itself one or more copies of tandem repeats(repeats within a repeat), which we define to be subpatterns of the original repeat. A tandem repeatis considered to have a perfect subpattern if its pattern length is a perfect multiple of the subpatternlength. Although there are many tandem repeats which have perfect subpatterns, it is important thatwe also consider tandem repeats whose pattern size is a close multiple of some subpattern length.Thus, we are interested in subpatterns whose copies span exactly or almost the entire length of theoriginal pattern. We call these strong subpatterns. Note also that the subpatterns in a tandem repeatmay be approximate copies of each other.

Why care about strong subpatterns?

Because we are comparing alignments produced by the different scoring functions, we do not want toinclude situations where the alignments differ because a repeat has cyclically shifted in the alignmentsimply because it contains a strong subpattern. Hence we eliminate these repeats in our data set. Asan illustration, consider the tandem repeat X, consisting of sub patterns X1, X2 and X3 as shown inFigure 1. If X1 ' X2 ' X3, and we try to align this tandem repeat with another tandem repeat, thepattern might rotate cyclically as shown in Figure 2, depending on the distance function used, whichis undesired. Formalizing, given a tandem repeat sequence X, the problem is to find the existenceof a strong subpattern within it. In order to achieve this we first identify the subpatterns in a givenset of repeats and then associate with each, a notion of a subpattern score or strength. This allowsus to identify a threshold that separates the strong subpatterns from the weak subpatterns. We omitfurther discussion of this task.

4 Rao et al.

Figure 2: Subpatterns can cause cyclic shifting of a repeat within an alignment when using differentdistance functions.

3 Comparing Alignments

3.1 Distance functions

We evaluated five distance functions. In what follows, Ci is a composition vector of k = 5 characterfrequencies fσ1 , . . . , fσk , one for each DNA base and the gap character:

1. Consensus:

Cons(C1, C2) ={

0 if majority character in C1 and C2 match, and1 otherwise

2. Euclidean:

Euc(C1, C2) =

√√√√ k∑i=1

∆2σi

where ∆2σi is the square of the difference between the frequencies for character σi in C1 and C2.

3. Jensen-Shannon Divergence:

JS(C1, C2) = H(π1C1 + π2C2)− π1H(C1)− π2H(C2)

where H(C) is the entropy of vector C, that is, H(C) =∑k

i=1 fσi log2(fσi), and πi is a weightingfactor for vector Ci. We used πi = 0.5.

4. Entropy Surface: This function and the next are related to one defined by Benson in [4]. Theyare based on the entropy function

H(C) = −∑σ

fσ log(fσ)

defined over all possible compositions C. The entropy function describes a six dimensional curvedsurface (five for the character frequencies fσ and one for the entropy value). Any composition,C, defines a point H(C) which is the projection of C in 5-space onto the entropy surface.For the distance measure, we project the straight line segment connecting C1 and C2 onto theentropy surface. The distance between C1 and C2 is the length of the resulting curve (which wenumerically approximate with chords).

5. Entropy Weighted: Similar to the preceding, except the length of the curve is weighted by theentropy value itself (in our numerical approximation, the length of each approximating chord ismultiplied by its midpoint entropy).


Each distance function was scaled to values between 0 and 255 inclusive. To correct for differencesbetween the number of copies in each repeat, all composition vectors were normalized to standard vec-tors for a 10 copy repeat. The frequencies in a standard vector are drawn from {0, 0.1, 0.2, . . . , 0.9, 1.0}.The standard vector closest by Euclidean distance to an original vector becomes that vector’s normal-ized representative.

3.2 Calculating the distance between repeats

The distance between repeats is calculated using alignment scores. We use a cyclic alignment algorithm[6] in conjunction with our distance functions because the relative starting position of one profile toanother may be incorrect in the original data. (That is why we omit repeats with strong subpatterns).We create the inter-repeat distance matrix for all repeats in our dataset, which becomes the input toour clustering method. The distance between two repeats R1 and R2 is calculated as follows:

Dist(R1, R2) =Alignment Score (R1, R2)

255 ∗Alignment length (R1, R2)(1)

where 255 is the worst score possible (scaled), using any distance function.

3.3 Effect of distance functions on the alignments

After collecting a good set of data and cleaning it to remove repeats with strong subpatterns, weperform our experiment of comparing the different distance functions. We cannot simply use thedifferent alignment scores produced by the five functions on a particular repeat pair, as each functionhas its own properties and the distribution of alignment scores is very much dependent on the distancefunction used. For example consider two tandem repeats from the human genome (???) with consen-sus patterns ATACACAC and CTCCCAGC, and copy numbers 24.6 and 30.0 respectively. Table 3.3shows the resulting alignment scores and cyclic shifting of the patterns when using the five distancefunctions. Also shown are the worst possible scores with each function for the same pair. The worstpossible score depends on the alignment length and the alignment length could vary with distancefunction used. This worst possible score is our normalizing factor in formula (1). The distance betweenthe same pair using different distance function is provided in the last column of the table.

3.4 Computing relative distances of the distance functions with respect to align-ments

To compare two distance functions we use the following procedure:

• Let A and B be the two repeats to be aligned, and let D1 and D2 be the two distance functionsto be compared

• Align A and B with D1 to get the alignment AP1.

• Align A and B with D2 to get the alignment AP2.

• Calculate the number of identical pairs in AP1 and AP2. By identical pairs we mean the samepair of nucleotides aligned in both alignments.

Iterating this procedure for N pairs of repeats, we calculate the distance between two function as:

Distnb(D1, D2) = 1−

∑N

No. of indentical pairs * 2len(AP1)+len(AP2)

N(2)

6 Rao et al.

Distance Aligned Alignment Worst DistanceFunction Repeats Score Score

Consensus - C - - T - - - C C C A G - C 1020 3825 0.27- A - - T - - A C A C A - - C

Entropy-Weighted - C - - T - - - C C C A G - C 727 3825 0.19- A - - T - - A C A C A - - C

Euclidean A - - G - - C - C T - C C C 1185 3570 0.33A - - T - - A - C A - C A C

Jensen-Shannon A G - C - - - C T - C C C 1105 3315 0.33A - - T - - A C A - C A C

Entropy-Surface A G - C - - - C T - C C C 1265 3315 0.38A - - T - - A C A - C A C

Table 2: Alignments produced by distance functions. The starting position of the upper pattern iscyclically permuted in these alignments. Note that columns aligning dash to dash are an artifact ofthe consensus pattern representation. Dash columns are not present in consensus patterns, but arepresent in the profile when a repeat contains characters in a column in less than half its copies. Thesecolumns often occur when a repeat has many copies as is the case here.

Thus we take into account here the number of positions the alignments were identical, and usethis to form the output of our alignment experiment. Figure 3 shows a tree comparing the 5 distancefunctions. The tree was obtained by hierarchical single linkage clustering of the distance functions.Based on this tree, Euclidean and Entropy Surface are the closest in terms of the alignments produced.Height in the figure represents distance between the functions.

4 Clustering

Using the repeats from Human Chromosome 10 we produced clusters using the Hierarchical Agglom-erative Clustering (HAC) method using the single linkage algorithm [?]. Hierarchical Clustering isa widely used algorithm despite its time complexity. The HAC algorithm is a bottom-up strategyand initially places all data points as singleton clusters. It then merges these clusters into larger andlarger clusters based on the cluster linkage criteria. The single linkage method works by mergingtwo clusters or points which are closest to each other. The hierarchical clustering algorithm takes asinput an N ×N distance matrix and a cut-off value which specifies at which height the clustering isterminated. We performed this clustering procedure with the different distance functions. (??? whatwas the cutoff)

4.1 Cluster Validation

The importance and effect of cluster structure with respect to tandem repeat families is still unclear.However, we analyze the shape and density of the clusters and would like to produce good clustersusing these metrics. We assess the quality of clusters produced by the Hierarchical Clustering methodusing the cluster validation techniques Average Cluster Density and Silhouette Index defined by [?].Consequently we can rank the individual distance functions based on the quality of clusters theyproduce.

Average Cluster Density: This measures the compactness and density of the clusters. The cluster


Figure 3: Cluster tree depicting the relativecomparision of distance functions. Figure 4: Hierarchical Cluster tree of re-

peats in Human Chromosome 10.

density is calculated by using the cluster diameter, which is the largest distance between any pair ofpoints in the cluster.

ClusterDensity =ClusterDiameter

AverageLength(3)

where Average Length is the average distance between any two points in the cluster. The AverageCluster Density is the average over all clusters. If the Average Cluster Density is close to 1, we havehighly compact clusters.

Silhouette Width: This is a measure of the membership of an object i to a cluster C. Thesilhouette width shows which objects lie well within the cluster and which ones are between clusters.Consider an object i of the data set, and let Ci denote the cluster to which it is assigned. Wecalculate: 1) a(i) = average distance of i to all other objects of Ci, 2) For each cluster C such thatC 6= Ci, d(i, C) = the average distance of i to all other objects of C, and 3) Over all clusters C 6= Ci,b(i) = min(d(i, C)), the average distance of i to its nearest neighbor cluster. Silhouette width S(i) isgiven by,

S(i) =b(i)− a(i)

max{b(i), a(i)}From this, S(i) lies between -1 and +1. The average silhouette Savg(i) is the average over all theobjects in the dataset. If Savg(i) is close to 1, the objects are well clustered or structured.

We calculate the cluster statistics using different distance functions. Table 3 shows the clusterqualities of each of the five distance functions on Human Chromosome 10. We chose the EntropyWeighted distance function for the remainder of our analysis because it scores well on both measuresand was best in terms of number of clusters produced and the percentage of repeats clustered. Figure 4

8 Rao et al.

Figure 5: These graphs show the relationship between the number of clusters and percentage of repeatsclustered at different distance cut-offs (75% – 99%) using the Hierarchical clustering method and theEntropy-weighted distance function. The “mountain” line is the number of multi-repeat clusters(unary clusters are not counted), the descending line is the number of repeats in multi-repeat clusters.Comparison of these graphs with those produced by the other functions (not shown) indicated thatEntropy-weighted was able to cluster a higher percentage of repeats than the other distance functions.This was one criterion for picking Entropy-Weighted as the preferred distance function.

shows the cluster dendrogram of repeats in Human Chromosome 10 with the cut-off criteria as 25.5in distance.

4.2 Multiple-phase Clustering

A defect of hierarchical clustering is that clusters can be low-quality in the sense that they are elon-gated and less dense. To split these chained clusters formed by single-linkage, we can subject themto clustering again, using other partition based clustering methods. Using the cluster validation tech-niques, we can identify these low-quality clusters. We use the Partition around Medoids (PAM) [?] tore-cluster the chained clusters, splitting them into smaller clusters.

PAM is one of the variants of the popular k-means approach but is more robust than k-meansbecause medoids are less influenced by outliers. PAM works by iteratively finding representativeobjects, called medoids in the clusters. PAM requires as input K, which is the number of clusters tobe formed from the data set and an N × N distance matrix . To determine K, we run PAM on thedata set several times, each time with a different K and select the K which yields the highest AverageSilhouette Width. PAM works effectively for small data sets but does not scale well for large data setsbecause of its time complexity, O(K(N − K)2), where N is the number of data points and K is thenumber of clusters. Human Chromosome 10 when subjected to this multi-phase clustering yielded 44clusters with an Average Silhouette Width of 0.76.

Figure 6 shows an example were a poor quality cluster produced by the Hierarchical method onHuman Chromosome 10 was re-clustered using the PAM method. Initially the Hierarchical clusteringproduced a cluster containing 138 repeats and a Sil Width of 0.45. Running PAM on this cluster


Table 3: Human Ch10 clustering results using different distance functions ??? at what cutoff

Distance No. of Sil Avgfunction clusters Width DiameterConsensus 38 0.8 0.85Entropy-Weighted 40 0.73 0.90Euclidean 38 0.64 0.93Entropy-Surface 40 0.76 0.89Jensen-Shannon 36 0.62 0.93

produced two smaller clusters while increasing the Sil Width to 0.6. As the alignments illustrate,repeats within a cluster are much more closely related than those between clusters.

5 Conclusion

We have described a new quantitative approach to evaluate distance functions with respect to align-ments, and we study their effects on discovering families of tandem repeats. We describe a relativecomparision between the distance functions and also an individual evaluation using cluster validationtechniques. Tandem repeats from Human Chromosome 10 were clustered using a multi-phase approachby using the Hierarchical Agglomerative method and Partition around Medoids in combination. Ourresults show that for clustering repeats a multi-phase clustering approach produces better qualityclusters. The two entropy based functions – Entropy Weighted and Entropy Surface – outscore theother distance functions in our alignment experiment, quality and number of clusters produced, andalso the number of repeats clustered. Future clustering tools in the Tandem Repeats Database willemploy entropy based distance functions and multi-phase clustering as demonstrated in this work.

References

[1] TRDB at http://tandem.bu.edu/cgi-bin/trdb/trdb.exe.

[2] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. J.Mol. Biol., 215:403–410, 1990.

[3] G. Benson. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research,27:573–580, 1999.

[4] G. Benson. A new distance measure for comparing sequence profiles based on paths along anentropy surface. In Proceedings of the European Conference on Computational Biology 2002, 2002.

[5] M. Gribskov, R. Lüthy, and D. Eisenberg. Profile analysis. Methods in Enzymology, 183:146–159,1990.

[6] M. Maes. On a cyclic string-to-string correction problem. Information Processing Letters, 35:73–78,1990.

[7] T. Smith and M. Waterman. Comparison of biosequences. Advances in Applied Mathematics,2:482–489, 1981.

10 Rao et al.

Figure 6: Result of PAM on a cluster produced by Hierarchical Clustering method.

Evaluating distance functions for clustering tandem [email protected] [email protected] [email protected] 1 Department of Electrical and Computer Engineering, Boston University, Boston,

Documents