A generic motif discovery algorithm for sequential data

BIOINFORMATICS

A generic motif discovery algorithm for sequential dataKyle L. Jensena, Mark P. Styczynskia, Isidore Rigoutsosa,b, Gregory N.Stephanopoulosa∗

a Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA02139, USAb IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA

ABSTRACTMotivation: Motif discovery in sequential data is a problem of great

interest and with many applications. However, previous methods havebeen unable to combine exhaustive search with complex motif repre-sentations and are each typically only applicable to a certain class ofproblems.

Results: Here we present a GEneric MOtif DIscovery Algorithm(Gemoda) for sequential data. Gemoda can be applied to any datasetwith a sequential character, including both categorical and real–valued data. As we show, Gemoda deterministically discovers motifsthat are maximal in composition and length. As well, the algo-rithm allows any choice of similarity metric for finding motifs. Finally,Gemoda’s output motifs are representation–agnostic: they can berepresented using regular expressions, position weight matrices, orany number of other models for any type of sequential data. Wedemonstrate a number of applications of the algorithm, includingthe discovery of motifs in amino acids sequences, a new solutionto the (l,d)–motif problem in DNA sequences, and the discovery ofconserved protein sub–structures.

Availability: Gemoda is freely available at http://web.mit.

edu/bamel/gemoda.Contact: [email protected] Information: Available at http://web.mit.

edu/bamel/gemoda.

INTRODUCTIONMotif discovery encompasses a wide variety of methods used to findrecurrent trends in data. In bioinformatics, the two predominant app-lications of motif discovery are sequence analysis and microarraydata analysis. Less common applications include discovering struc-tural motifs in proteins and RNA (Holmet al., 1992; Murthy andRose, 2003).

Motif discovery in sequence analysis typically involves the disco-very of binding sites, conserved domains, or otherwise discrimina-tory subsequences. There are many publicly–available tools, each ofwhich is quite adept at addressing a specific subclass of motif disco-very problems. Some of the commonly-used tools for motif disco-very in nucleotide and amino acid sequences include MEME (Baileyand Elkan, 1994), Gibbs sampling (Lawrenceet al., 1993), Consen-sus (Hertz and Stormo, 1999), Block Maker (Henikoffet al., 1995),Pratt (Jonassenet al., 1995), and Teiresias (Rigoutsos and Floratos,1998). Newer, less-widely used tools include Projection (Buh-ler and Tompa, 2001), MultiProfiler (Keich and Pevzner, 2002),

∗to whom correspondence should be addressed

MITRA (Eskin and Pevzner, 2002), and ProfileBranching (Priceet al., 2003). This list is not intended to be exhaustive; however,it is indicative of the wealth of options available for solving suchproblems.

All of the existing motif discovery tools for nucleotide and aminoacid sequences can be classified on a spectrum ranging from exhau-stive tools using simple motif representations to non–exhaustivetools using more complex representations. The majority of the toolscan be found at the extreme ends of the spectrum, with tools thatexhaustively enumerate regular expressions (or single consensussequences) at one end and probabilistic tools, based on positionweight matrices (PWMs), at the other. This partitioning of toolsis due to a computational trade–off: more descriptive motif repre-sentations such as PWMs frequently make exhaustive searchescomputationally infeasible.

Depending on the task at hand, a specific type of motif discoverytool may be more useful than others. For example, the PWM–basedtools excel at findingcis–regulatory binding elements (Tompaet al.,2005), whereas the regular expression–based tools are well–suitedto finding conserved domains in large protein families (Rigoutsoset al., 1999). Generally, it can be difficult to knowa priori whichmotif discovery tool will be right.

ALGORITHMGemoda was designed to meet the demand for complex motifrepresentations, like PWMs, while still being exhaustive. The phi-losophical underpinnings of the Gemoda algorithm can be tracedback to Teiresias (Rigoutsos and Floratos, 1998); Winnower (Pevz-ner and Sze, 2000); the algorithm by Mancheron and Rusu (2003);and a variety of algorithms for association mining (Zaki, 2000; Zakiand Ogihara, 1998). In particular, Gemoda shares some of its logi-cal steps with the Teiresias algorithm while incorporating a moreflexible definition of “similarity” and allowing motif representationsother than regular expressions.

Gemoda’s design goals can be summarized as follows:exhaustivediscoveryof all maximal motifsin a way that allows flexibility inmotif representation, incorporation of a variety ofsimilarity metrics,and the ability to handle diversesequential data types. Each pointof emphasis can be explained as follows:

• Exhaustive discovery:Gemoda’s combinatorial nature provi-des an algorithmic guarantee that all motifs meeting certaincriteria are deterministically discovered.

© The Author (2005). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Bioinformatics Advance Access published October 27, 2005 by guest on June 29, 2014

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from


Jensen et al

• Maximal motifs: Gemoda returns only motifs that are maxi-mal in both length and composition with respect to the simila-rity and clustering functions.

• Motif representation: The motifs discovered by Gemoda arereported as short multiple sequence alignments (in the caseof motif discovery in nucleotide and amino acid sequences)and can be modeled using regular expressions, PWMs/PSSMs,Markov models, or any other representation.

• Similarity metrics: Any criterion, ranging from sequence ali-gnment scores to geometric functions, may be used to comparesequences.

• Sequential data types:The nature of Gemoda’s computationsis not unique to any specific type of data, and thus can be usedon any data with a sequential character — that is, data in whichthere is a natural left–to–right order, such as a sequence ofnucleotides or amino acids. In the most general sense, sequen-tial data also include real–valued series data, such as a stockprice or the ordered(x, y, z) triplets of an alpha–carbon tracein a protein structure.

The algorithm has three distinct phases: comparison, clustering,and convolution. During the comparison phase, short overlappingwindows in the data set are compared. During clustering, thesewindows are grouped together to form elementary motifs. Finally,during convolution, these motifs are “stitched” together to formmaximal motifs (see Figure 5). In the following sections, we givesome brief definitions and nomenclature, then describe each ofthe algorithm’s three phases in detail. Finally, we illustrate a fewapplications of Gemoda.

Preliminary definitions and nomenclatureThe input to Gemoda is a set of sequences of data pointsS ={s1, s2, . . . , sn}, where sequencesi has lengthWi. So, for exam-ple, thejth member of theith sequence is denoted bysi,j . Eachsi,j

is a primitive, or atomic unit, for the data that is being analyzed.For time–series data,si,j may be a point sampled fromRK (withK arbitrary), whereas for a DNA sequence it would be one of thecharacters{A,T,G,C}.

Typically, one seeks motifs of a minimal, domain–dependentlength. We denote this minimum length byL and we define a matrixA of size N × N , whereN =

Pni=1(Wi − L + 1). That is,

A is a matrix with one row and one column for each window ofsizeL in our entire sequence set. For example, the10th windowof sizeL in the5th sequence would be expressed ass5,10:10+L−1,where “10 : 10 +L − 1” denotes “position10 through position10 + L − 1, inclusive.” To keep track of which window corre-sponds to which index inA, we define the one–to–one functionM (si,j:j+L−1) 7→ q ∈ [1, N ]. (For simplicity, we define(si,j +1)to besi,j+1, unlesssi,j+1 does not exist, in which case(si,j +1) isundefined.) Similarly,M−1(q) 7→ (si,j:j+L−1) such thati ∈ [1, n]andj ∈ [1, Wi − L + 1].

We also define a similarity functionS (si,j:j+L−1, sq,z:z+L−1),that takes as arguments two arbitrary windows and returns a real–valued number indicating the level of similarity between the twowindows. In the most simple case,S may use the identity matrixto count how many DNA bases two windows have in common; forreal–valued data, the function may return the sum–of–squares errorbetween two windows or any other measure of similarity.

We define a motifp as a data structure with two features: a widthW (p) and a list of locations in the data where the motif occurs,L (p). A motif has the property that the locations inL (p) meetsome predefined clustering requirements (discussed below) basedon the similarity functionS for each window of lengthL withinthe motif. The support of a motif is equal to the number of itsoccurrences (or “embeddings”),|L (p)|.

We say a maximal motif is a motif which has the followingproperties:

1. The motif’s width cannot be extended in either direction (left orright) without producing a motif with fewer embeddings (i.e.,without |L (p)| decreasing); and

2. The motif is not missing any instances, i.e.L (p) includes thelocations of all instances of the motif.

These two criteria can be summarized qualitatively by stating thata maximal motif is not “missing” any locations and is as wide aspossible, and thus it is as specific and sensitive as possible.

Given these explanations and definitions, we can now detail thecomputations involved in each phase of the Gemoda algorithm.A simple natural–language example illustrating how each phaseproceeds is included in the supplementary materials.

Comparison phaseIn the comparison phase of the Gemoda algorithm, the sequencesare divided into overlapping windows of sizeL which are thencompared to each other in a pairwise manner to produce a simi-larity matrix, A (see Figure 5). Formally,Ai,j is equal toS (M−1(i), M−1(j)) = S (si,j:j+L−1, sq,z:z+L−1).

A is then, quite simply, a similarity matrix for allN windowsbased on the similarity functionS . In most cases,S is com-mutative (and theA matrix is symmetric); however, this is not arequirement.

Clustering phaseThe purpose of the clustering phase is to use the similarity matrixA to group similar windows in clusters. These clusters will become“elementary motifs” from which the final, maximal motifs will beconstructed.

We define a clustering functionC (A) = cL = {cL1 , cL

2 , . . . , cLZ}

where eachcLi is a set of indices inA andcL

i [q] is theqth member ofcL

i . Note thatC can be any function; common clustering functionsinclude hierarchical clustering, k–nearest–neighbors clustering, andmany others. We call eachcL

i an “elementary motif” of lengthL. Wenote that a clustering function may assign each node (window) toone or more groups. In the latter case, eachcL

i may have a non–nullintersection with anycL

j .

Convolution phaseThe purpose of this phase is to “stitch together” the elementarymotifs to generate the final, maximal motifs (Rigoutsos and Flora-tos, 1998). For the purposes of Gemoda (and consistent with theabove concept of convolution), we say that a motifh of widthW (h) > L meets the similarity criterion if for each window oflengthL completely within the motif, all instances participate in acluster together based onS andC . In this manner, we can piecetogether longer continuous motifs from smaller motifs that all meetthe similarity criterion over windows of lengthL.

2

by guest on June 29, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from


Gemoda

Next we define the “directed intersection” of two elementarymotifs, cL

i y cLj = cL+1

r , wherecL+1r is the set of those indi-

cesq in cLi such thatM (M−1(cL

i [q]) + 1) is in cLj . That is,cL+1

r

is the set of indices incLi that are located, in the sequencesS, one

position earlier than the indices incLj . cL+1

r is then a motif of lengthL + 1.

We define the operation “<” as follows:cLi y cL

j < cL+1 is trueif the set of indicescL

i y cLj is a subset or a superset of the indices

in any member ofcL+1. This operation compares a convolved motifof lengthL + 1 to all previously–convolved motifs of lengthL + 1to identify significant overlap: if the list of locations in the proposedmotif is a superset or subset of the list for any other motif, the resultof this operation is true. With this step, Gemoda can identify andeliminate redundant and non–maximal motifs.

If cLi y cL

j < cL+1, then all super– or sub–sets of the proposedconvolved motifs are removed fromcL+1; these windows are thentaken together with the proposed motif, and the union of those setsof windows is returned tocL+1.

Our objective is to find all the maximal motifs in the sequence setusing the elementary patterns. We do this by performingck

i y ckj

for all i andj at each lengthk ≥ L until ck is empty (|ck| = 0).We then define the set of maximal motifs comprisingck for all k asP , the final set of motifs that are returned to the user. This simpleinduction scheme guarantees that all (and only) the maximal motifsare inP given appropriate clustering functions (see supplementarymaterials).

ImplementationChoice of clustering functionGemoda can use any clustering func-tion; however, as the size of the input sequence set increases, storingthe matrixA can become practically difficult. In these cases, it canbe easier to store true/false values inA, where the value is true if thesimilarity score between two windows is better than a user–definedthresholdg. The matrixA can then be viewed as an unweighted,undirected graph with a vertex for each window and edges bet-ween those nodes with pairwise similarity scores better thang (seeFigures 5 and 2). When constructed as such, we have found thatclustering functions based on finding either cliques1 or connectedcomponents (maximal disjoint subgraphs) can be effective for motifdiscovery in diverse applications.

In the case where the clustering functionC (A) is chosen suchthat eachcL

i is a clique in theg–thresholdedA matrix, the Gemodaalgorithm has a guarantee of compositional and length maximality,relative to the thresholdg. That is, Gemoda will discover all motifswhere each pair of instances has a similarity score better thang overevery window of sizeL, there are no “missing” instances havingthis property, and the motif cannot be extended either to the left orright (see inductive proof in the supplementary material).

Clique enumeration is NP–complete (Garey and Johnson, 1979;Tomitaet al., 1989); however, in practice this complexity is usuallynot an issue because the density (the ratio of the number of edgesto the number of vertices) of graphs is usually low for datasets ofnucleotide or amino acid sequences (with reasonable choice ofg).

1 We define a clique as a maximal, fully–connected subgraph. It may bealternatively defined without the requirement for maximality, thus makingthe clusters we discuss “maximal cliques”. We use the former definition forthe sake of brevity and clarity when discussing the maximality of extendingmotifs.

In the case where the clustering functionC (A) is chosen suchthat eachcL

i is a maximal disjoint subgraph in theg–thresholdedA matrix (i.e., cL represents the connected components ofA),the computational complexity for the clustering phase is signifi-cantly less than for clique–based clustering. As well, in the casewhere Gemoda is applied to nucleotide and amino acid sequences,the motifs from this connected components method may be moreintuitive than motifs found using clique–based clustering.

The space and time usage of this implementation is not unrea-sonable. In most cases, memory usage is not a limiting factor. Forinstance, the peak memory usage for a large sequence set contai-ning 65, 000 characters is1 GB, within the reach of many personalcomputers. Furthermore, the upcoming examples given in this workcan all be done in reasonable times. The amino acid sequence exam-ple and protein structure example take at most tens of seconds on anaverage desktop PC, while the hardest of the DNA sequence examp-les takes two hours. These times are more than reasonable given theexhaustive guarantees provided by the algorithm.

Estimation of motif significance The absolute significance ofmotifs depends strongly on the choice of the similarity metric andclustering function and is difficult to derivea priori. However, for aspecific pair of similarity metric and clustering function, therelativesignificance can be easy to calculate. For the clique–based clusteringfunction described above, the relative significance can be estimatedsolely from the matrixA using a bootstrapping method. (A descrip-tion of this calculation is included in the supplementary materials.)Such significance calculations are equally valid for many differentmotif discovery problems (e.g., nucleotide sequences or proteinstructures) because the calculation method uses only the matrixA:it is data–type agnostic.

Summary of user–supplied parametersThe input to Gemodais a set of sequences (categorical or real–valued), a window length,a similarity function, and a clustering function. Various cluste-ring functions may require other parameters. For example, theclique–finding and connected components clustering algorithms dis-cussed above require both a threshold parameterg and, optionally, aminimal support parameterk. Other parameters can be easily incor-porated into various clustering functions, such as a “unique support”parameterp that limits returned motifs to those that occur in at leastp different sequences.

Availability We have written open source programs implemen-ting the Gemoda algorithm that are publicly available at the fol-lowing URL: http://web.mit.edu/bamel/gemoda. Thesoftware includes a number of “helper” applications for interopera-bility with common bioinformatics tools. For example, applicationsare included that allow users to model Gemoda’s output motifs (inthe case of nucleotide or amino acid sequences) as PSSMs — usingthe pftools package available via the Prosite database (Hofmannet al., 1999) — or as hidden Markov models, using the popularHMMer software (Eddy, 1998).

The implementation is distributed in two variants, each with adifferent comparison stage of the algorithm. The gemoda–s vari-ant is for motif discovery in FastA–formatted text strings, typicallynucleotide or amino acid sequences. The gemoda–r variant isused for motif discovery in sets of multi–dimensional, real–valuedsequences. The gemoda–s variant is distributed with a number ofsimilarity functions based on various nucleotide and amino acid

3



ownloaded from


Jensen et al

substitution matrices. The gemoda–r variant is distributed with simi-larity functions based on the root mean square deviation, withoptions for optimal translation and rotation.

APPLICATIONIn this section, we demonstrate Gemoda’s capability by presentingseveral sample applications. Specifically, we address motif dis-covery in amino acid sequences, in nucleotide sequences, and inprotein structures.

As discussed previously, the clustering and convolution stages ofthe Gemoda algorithm are generic — they are independent of thenature of the input data. However, the comparison stage is data–specific. In what follows, we discuss how the comparison stageis changed for each kind of data and outline the types of resultsGemoda is capable of finding.

Motif discovery in amino acid sequencesTo use Gemoda to find motifs in amino acid sequences, the com-parison stage needs to reflect the notion of “similarity” for aminoacid sequences. Specifically, we choose a window comparisonfunctionS that returns a sequence alignment score, such as the bit–score from an amino acid scoring matrix (e.g., the popular Blosummatrices (Henikoff and Henikoff, 1992)).

Here, we demonstrate how Gemoda can be used for motif dis-covery in amino acid sequences by “discovering” known proteindomains in the (ppGpp)ase family of enzymes. These eight enzy-mes catalyze the hydrolysis of guanosine 3’,5’–bis(diphosphate) toguanosine 5’–diphosphate (GDP) and are classified by the EnzymeCommission (EC) number 3.1.7.2 (Bairoch, 2000).

We used Gemoda to identify motifs in these eight (ppGpp)aseenzymes using the Blosum–62 scoring matrix as the basis of oursimilarity functionS and the clique–based clustering function des-cribed previously. Specifically, we sought motifs that occurred in alleight sequences, were at least 50 residues long, and had a pairwisebit–score of at least 50 bits over a window of 50 residues.

With these parameters, Gemoda discovers four motifs in this setof eight sequences; the longest motif, with a length of 103 aminoacids, is shown in Figure 1 as an alignment of the regions that cor-respond to instances of this motif (see also Figure 2). A comparisonwith the known protein domains in the NCBI Conserved DomainDatabase (version 2.02) (Marchler-Baueret al., 2003) reveals thatthis motif captures the RelASpoT domain (CDD PSSM–id 15904).

The remaining three motifs are not present in the CDD data-base. However, further inspection using the tools available from thePFAM database (Batemanet al., 2004) revealed that they compo-sed the left, middle, and right regions of the HD domain (Aravindand Koonin, 1998). In the SpoT enzymes, this domain has a num-ber of insertions and deletions that give rise to gaps such thatGemoda identified and reported individually the left, middle, andright regions of conservation of the HD domain.

In this example, the Blosum–62 matrix was chosen as the simi-larity metric because it is optimized for detecting distant homologs.The Gemoda input parametersL = 50 andg = 50 were chosento enforce a one–bit–per–base score, which should rise above ran-dom “noise” since, by design, the expected bit–score for two alignedamino acids is negative for the Blosum set of scoring matrices.

In order to test the sensitivity of these results to noise, weconducted an experiment to determine the degree to which these

(ppGpp)ase motifs could be found if obscured by noise caused byadding random spurious sequences to the 8 enzyme sequences. Wefound that, with the Gemoda input parameters described aboveand using random sequences selected from Swiss–Prot (Release45.0) (Bairoch and Apweiler, 2000), the target motifs could bedetected in an 8–fold majority of spurious sequences.

Motif discovery in nucleotide sequencesThe discovery of motifs in nucleotide sequences is most commonlyused in the search forcis–regulatory elements. The “Motif Chal-lenge Problem,” or the (l,d)–motif problem (Pevzner and Sze, 2000),is an abstraction of thecis–regulatory element discovery problem.

The original (l,d)–motif problem can be paraphrased as follows:

Within a set of random DNA sequences with i.i.d. nucleotides,a parent motif of lengthl is embedded in each sequence in arandom location. Each time the motif is embedded, it is muta-ted in d locations. The (l,d)–motif problem is to recover thelocations of the embeddings, knowing only the parameterslandd and that each sequence contains exactly one instance ofthe motif.

To a certain extent, this is a somewhat reasonable abstraction ofthe cis–regulatory element discovery problem. It is also a problemin which false positive motifs are not expected to occur by chance:the occurrence of a motif with an instance ofd or less mutations ineach of the20 sequences has a probability of approximately10−15

for l = 15 andd = 4 (Buhler and Tompa, 2001). However, theprobability is 0.057 that any two windows of length15 may be4mutations from a common ancestor. In a set of20 sequences each oflength600, one would then expect any given window to be “similar”to 663 other windows purely by chance. With such significant noiseobscuring the smaller, easily–identifiable signal, this is a difficultproblem that, as Pevzner and Sze (2000) pointed out, commonly–used tools are incapable of solving accurately.

Gemoda can provide a direct solution to this problem, usingclique–based clustering and a comparison function based on theidentity matrix. The selection ofg is simple, as any two motifswith d mutations inl positions must havel − 2d bases in com-mon. The only additional step necessary is to verify that each ofthe motif instances identified by Gemoda could have the same ance-stor, a simple task. We have previously reported (Styczynskiet al.,2004) that a data set used by Pevzner and Sze (2001) in their initialpresentation of the challenge problem in fact had an instance of theparent motif that occurred completely by chance and had gone other-wise undetected. With Gemoda, we can easily identify this instancewithout any additional work or manipulation. The sequence logofor the planted motif from Pevzner and Sze’s initial dataset is shownin Figure 3; the consensus sequence isGGCTTTGTAGCTAAC. The“accidental” instance of the embedded motif that can be identifiedusing Gemoda isGGATTGATAGCTAAG.

Clearly, Gemoda was not originally designed to address the (l,d)–motif problem and, consequently, it does not exploit all of thecharacteristics of the problem to solve it in the fastest possible way.However, it does provide a direct, exhaustive solution to the problemthat identifies otherwise undetectable results.

Identifying natural cis–regulatory elements For some regu-lons in E. coli with mild to strong consensus sequences, Gemodareturns results that are similar to or improve upon the results fromcommonly–used motif discovery tools. For instance, using the set of

4



ownloaded from


Gemoda

upstreamregions (400base pairs upstream and50 base pairs down-stream of the translation start site) for the9 operons believed to beregulated by LexA (Salgadoet al., 2004), Gemoda’s top–scoringmotif was used to generate the sequence logo found in Figure 3.This motif closely matches the literature PWM for the LexA bin-ding site and represents 80% of the literature–found binding siteswith no false positives. Of course, the difficulty of DNA motif dis-covery problems varies greatly, and this is only one straightforwardexample of such problems.

The parameters used for this search wereL = 20, g = 10, andk = 6 with the identity matrix scoring scheme and clique–basedclustering described above. The length was selected based on theknowledge that the DNA–binding domain of LexA is a helix–turn–helix variant, and so it was likely to be a relatively long motif. Thesimilarity threshold was chosen as one–half ofL, which we knowfrom the (l,d)–motif problem ought to be approximately sufficientto prevent the graph from being too dense (and thus expensive tocluster). The support threshold was chosen to be about two–thirdsthe total number of sequences, allowing for some noise in the data.Of course, the judicious selection of parameters is an outstandingproblem in binding site discovery. It is worth noting that most ofthese selections were simple or intuitive and that there was sometolerance in the results for slight perturbations in parameters.

Motif discovery in protein structuresThe detection of 3–dimensional motifs in sets of protein structuresis another problem type that Gemoda can address. Often, homologsthat are related through a distant lineage show little to no sequencesimilarity, particularly at the nucleotide level (Eidhammeret al.,2000). However, these homologs frequently show conserved tertiarystructures (Dietmann and Holm, 2001), making motif discovery inprotein structures often revealing in situations where there appearsto be no similarity at a sequence level.

There are a number of well–developed tools for the pair–wisecomparison of protein structures or the comparison of a singleprotein structure to precomputed structural motifs; these havebeen reviewed elsewhere (Eidhammeret al., 2000). Some of themore popular tools include SSAP (Orengo and Taylor, 1996),VAST (Madej et al., 1995), Dali (Holm and Sander, 1993), andMammoth (Ortizet al., 2002). The Gemoda algorithm, when usedfor structural motif discovery, is most similar to the Sarf algo-rithm (Alexandrov, 1996; Alexandrov and Fischer, 1996) and, toa lesser degree, algorithms by Hunter and Subramaniam (2003)and Jonassenet al. (2002). Conceptually, Gemoda could be thoughtof as a hybrid of the Sarf and Teiresias algorithms, combining 3–D elementary motif discovery with convolution. To the best of ourknowledge, Gemoda is the only tool that can compare an arbi-trary number of protein structures simultaneously and produce anexhaustive set of maximal motifs.

To discover motifs in protein structures, Gemoda comparesL–residue windows of the proteins’ alpha–carbon trace using theminimized RMSD similarity metric (one of many possible metricsfor comparing protein sub–structures (Kolodnyet al., 2005)). Herewe use “minimized” to indicate that the protein structures are opti-mally super–imposed via rigid–body rotation and translation (Horn,1987; Arunet al., 1987); occasionally this term is implicit. Usingthe clique–finding clustering algorithm, Gemoda finds motifs thatare sets of alpha–carbon traces (in a set of protein structures) that

can be super–imposed with an RMSD less thang A over each win-dow of L residues on a pair–wise basis. Similar to the amino acidand nucleotide applications of Gemoda, these structural motifs aremaximal in both length and support.

Here, we demonstrate how the Gemoda algorithm can be usedfor structural motif discovery by “discovering” the structural homo-logy between the human galactose-1-phosphate uridylyltransferase(PDB id 1HXQ) (Wedekindet al., 1996) and fragile histidine triadproteins (PDB id 3FIT) (Limaet al., 1997), originally reported else-where (Holm and Sander, 1997). Using Gemoda, we looked formotifs of at least 30 residues, occurring in at least three chains, thathad a pairwise RMSD of 1.5A or less (based on superposition ofthe alpha–carbon backbone) over each window of 30 residues.

This search returns 4 motifs, the longest of which is 66 residues(see Figure 4). This motif has one embedding in the 3FIT proteinand two, in different chains, in the 1HXQ protein. As shown in thefigure, the motif is an alpha helix followed by a beta sheet.

DISCUSSIONGemoda makes four contributions. First, the algorithm is gene-ric in that it is equally applicable to any variety of sequentialdata. Second, Gemoda allows arbitrary similarity metrics. In theexamples shown here, we chose relatively simple metrics (scoringmatrices and RMSD–base metrics); however, similarity metrics canbe easily changed or added. For example, in the case of aminoacid sequences, one can easily define hybrid metrics incorpora-ting primary, secondary, and tertiary structure features. In the caseof nucleotide sequences, the metric may be changed to incorpo-rate methylation information. The third contribution is that Gemodareturns motifs that are not tied to any particular motif representa-tion. In the case of amino acid sequence motifs, it is easy to modelGemoda’s motifs using regular expressions, hidden Markov models,or position–specific scoring matrices. Finally, when used with theclique–finding clustering algorithm, Gemoda returns an exhaustiveset of maximal motifs. To the best of our knowledge, Gemoda is theonly motif discovery algorithm incorporating the above features.

As mentioned in the introduction, Gemoda integrates the bestcharacteristics from a number of previously published motif andassociation discovery algorithms. For specific problems, Gemoda’sperformance can be improved further, though at the expense ofgenerality. For example, a window sampling approach such as thatused by Blast (Altschulet al., 1997) would be useful in applica-tions where speed is more important than completeness of results.For protein structure comparisons Gemoda could also be altered touse contact maps like those used by Dali (Holm and Sander, 1993).The convolution stage could also be made faster by using heuristi-cal, non–exhaustive convolution methods. Also, the clustering phasecould be expedited by using approximate clique finding methods.

Futhermore, the Gemoda algorithm could be modified to find gap-ped motifs. As currently formulated, Gemoda can find motifs withshort, fixed length gaps; however if a gap causes a motif to fail tomeet the similarity threshold during convolution, then it is not exten-ded. It may be possible to alter the convolution step to allow forlarge or variable–length gapped motifs. Another option is to lookfor maximal motifs whose offsets are highly correlated. Our studiesindicate that suchpost hocanalysis of Gemoda’s output can usuallyfind well–conserved gapped motifs, including those with variablegap lengths, as was the case for the (ppGpp)ase example.

5



ownloaded from


Jensen et al

Gemoda’s generic nature makes it readily applicable for manyproblems. In the protein sequence application, Gemoda’s exhaustivesearch using a scoring matrix as a similarity metric identified multi-ple motifs. It provided an accurate representation of these domainsin as much as an eight–fold excess of spurious sequences. In theDNA motif discovery application, Gemoda identified an otherwiseunintentional result in a synthetic dataset and satisfactorily descri-bed a motif embedded in a genomic dataset. In the protein structureapplication, Gemoda demonstrated that it can compare multiplearbitrary–dimensional structures simultaneously and return resultspreviously shown in the literature. Gemoda can also be directlyapplied to other diverse types of sequential datasets, or it can beextended to address problems not yet considered.

REFERENCESAlexandrov,N.N. (1996) SARFing the PDB.Protein Eng,9 (9), 727–732.Alexandrov,N.N. and Fischer,D. (1996) Analysis of topological and nontopological

structural similarities in the PDB: new examples with old structures.Proteins, 25(3), 354–365.

Altschul,S.F., Madden,T.L., Zhang,A.A.S., J., Zhang,Z., Miller,W. and Lipman,D.J.(1997) Gapped BLAST and PSI-BLAST: a new generation of protein databasesearch programs.Nucleic Acids Res,25, 3389–402.

Aravind,L. and Koonin,E.V. (1998) The HD domain defines a new superfamily ofmetal-dependent phosphohydrolases.Trends Biochem Sci,23 (12), 469–472.

Arun,K.S., Huang,T.S. and Blostein,S.D. (1987) Least-squares fitting of two 3-d pointsets.IEEE Trans. Pattern Anal. Mach. Intell.,9 (5), 698–700.

Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by expectation maximizationto discover motifs in biopolymers.Proc Int Conf Intell Syst Mol Biol,2, 28–36.

Bairoch,A. (2000) The ENZYME database in 2000.Nucleic Acids Res,28 (1),304–305.

Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database andits supplement TrEMBL in 2000.Nucleic Acids Res,28 (1), 45–48.

Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., Griffiths-Jones,S., Khanna,A.,Marshall,M., Moxon,S., Sonnhammer,E.L.L., Studholme,D.J., Yeats,C. andEddy,S.R. (2004) The Pfam protein families database.Nucleic Acids Res,32Database issue, 138–141.

Buhler,J. and Tompa,M. (2001) Finding motifs using random projections. InProceedings of the fifth annual international conference on Computational biologypp. 69–76 ACM Press.

Dietmann,S. and Holm,L. (2001) Identification of homology in protein structureclassification.Nat Struct Biol,8 (11), 953–957.

Eddy,S.R. (1998) Profile hidden Markov models.Bioinformatics,14 (9), 755–763.Eidhammer,I., Jonassen,I. and Taylor,W.R. (2000) Structure comparison and structure

patterns.J Comput Biol,7 (5), 685–716.Eskin,E. and Pevzner,P.A. (2002) Finding composite regulatory patterns in DNA

sequences.Bioinformatics,18 Suppl 1, 354–363. Evaluation Studies.Garey,M. and Johnson,D. (1979)Computers and Intractability: A Guide to the Theory

of NP–Completeness. W.H. Freeman and Company, New York.Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from protein

blocks.Proc Natl Acad Sci U S A,89 (22), 10915–10919.Henikoff,S., Henikoff,J.G., Alford,W.J. and Pietrokovski,S. (1995) Automated

construction and graphical presentation of protein blocks from unalignedsequences.Gene,163(2), GC17–26.

Hertz,G.Z. and Stormo,G.D. (1999) Identifying DNA and protein patterns withstatistically significant alignments of multiple sequences.Bioinformatics,15 (7-8),563–577.

Hofmann,K., Bucher,P., Falquet,L. and Bairoch,A. (1999) The PROSITE database, itsstatus in 1999.Nucleic Acids Res,27, 215–9.

Holm,L., Ouzounis,C., Sander,C., Tuparev,G. and Vriend,G. (1992) A database ofprotein structure families with common folding motifs.Protein Sci,1 (12),1691–1698.

Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of distancematrices.J Mol Biol, 233(1), 123–138.

Holm,L. and Sander,C. (1997) Enzyme HIT.Trends Biochem Sci,22 (4), 116–117.Letter.

Horn,B.K.P. (1987) Closed-form solution of absolute orientation using unitquaternions.Journal of the Optical Society of America A,4 (4), 629–642.

Hunter,C.G. and Subramaniam,S. (2003) Protein fragment clustering and canonicallocal shapes.Proteins, 50 (4), 580–588. Evaluation Studies.

Jonassen,I., Collins,J.F. and Higgins,D.G. (1995) Finding flexible patterns inunaligned protein sequences.Protein Sci,4 (8), 1587–1595.

Jonassen,I., Eidhammer,I., Conklin,D. and Taylor,W.R. (2002) Structure motifdiscovery and mining the PDB.Bioinformatics,18 (2), 362–367.

Keich,U. and Pevzner,P.A. (2002) Finding motifs in the twilight zone.Bioinformatics,18 (10), 1374–1381. Evaluation Studies.

Kolodny,R., Koehl,P. and Levitt,M. (2005) Comprehensive evaluation of proteinstructure alignment methods: scoring by geometric measures.J Mol Biol, 346(4),1173–88.

Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S., Neuwald,A.F. and Wootton,J.C.(1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiplealignment.Science,262(5131), 208–214.

Lima,C.D., D’Amico,K.L., Naday,I., Rosenbaum,G., Westbrook,E.M. andHendrickson,W.A. (1997) MAD analysis of FHIT, a putative human tumorsuppressor from the HIT protein family.Structure,5 (6), 763–774.

Madej,T., Gibrat,J.F. and Bryant,S.H. (1995) Threading a database of protein cores.Proteins, 23 (3), 356–369.

Mancheron,A. and Rusu,I. (2003) Pattern discovery allowing wild-cards, substitutionmatrices, and multiple score functions. InAlgorithms in Bioinformatics,Proceedings Lecture notes in Bioinformaticspp. 124–138 Springer–Verlag.

Marchler-Bauer,A., Anderson,J.B., DeWeese-Scott,C., Fedorova,N.D., Geer,L.Y.,He,S., Hurwitz,D.I., Jackson,J.D., Jacobs,A.R., Lanczycki,C.J., Liebert,C.A.,Liu,C., Madej,T., Marchler,G.H., Mazumder,R., Nikolskaya,A.N., Panchenko,A.R.,Rao,B.S., Shoemaker,B.A., Simonyan,V., Song,J.S., Thiessen,P.A., Vasudevan,S.,Wang,Y., Yamashita,R.A., Yin,J.J. and Bryant,S.H. (2003) CDD: a curated Entrezdatabase of conserved domain alignments.Nucleic Acids Res,31 (1), 383–387.

Murthy,V.L. and Rose,G.D. (2003) RNABase: an annotated database of RNAstructures.Nucleic Acids Res,31 (1), 502–504.

Orengo,C.A. and Taylor,W.R. (1996) SSAP: sequential structure alignment programfor protein structure comparison.Methods Enzymol,266, 617–635.

Ortiz,A.R., Strauss,C.E.M. and Olmea,O. (2002) MAMMOTH (matching molecularmodels obtained from theory): an automated method for model comparison.Protein Sci,11 (11), 2606–2621. Evaluation Studies.

Pevzner,P. and Sze,S.H. (2001). private communication.Pevzner,P.A. and Sze,S. (2000) Combinatorial approaches to finding subtle signals in

DNA sequences. InProceedings International Conference on Intelligent Systemsfor Molecular Biologypp. 269–278 AAAI Press.

Price,A., Ramabhadran,S. and Pevzner,P.A. (2003) Finding subtle motifs by branchingfrom sample strings.Bioinformatics,19 Suppl 2, II149–II155.

Rigoutsos,I. and Floratos,A. (1998) Combinatorial pattern discovery in biologicalsequences: The TEIRESIAS algorithm.Bioinformatics,14, 55–67.

Rigoutsos,I., Floratos,A., Ouzounis,C., Gao,Y. and Parida,L. (1999) Dictionarybuilding via unsupervised hierarchical motif discovery in the sequence space ofnatural proteins.Proteins, 37, 264–77.

Salgado,H., Gama-Castro,S., Martinez-Antonio,A., Diaz-Peredo,E.,Sanchez-Solano,F., Peralta-Gil,M., Garcia-Alonso,D., Jimenez-Jacinto,V.,Santos-Zavaleta,A., Bonavides-Martinez,C. and Collado-Vides,J. (2004)RegulonDB (version 4.0): transcriptional regulation, operon organization andgrowth conditions in Escherichia coli K-12.Nucleic Acids Res,32 (Databaseissue), 303–306.

Styczynski,M., Jensen,K., Rigoutsos,I. and Stephanopoulos,G. (2004) An extensionand novel solution to the Motif Challenge Problem.Genome Informatics,15 (2).In press.

Tomita,E., Tanaka,A. and Takahasi,H. (1989) An optimal algorithm for finding all thecliques.SIG Algorithms,12, 91–98.

Tompa,M., Li,N., Bailey,T.L., Church,G.M., De Moor,B., Eskin,E., Favorov,A.V.,Frith,M.C., Fu,Y., Kent,W.J., Makeev,V.J., Mironov,A.A., Noble,W.S., Pavesi,G.,Pesole,G., Regnier,M., Simonis,N., Sinha,S., Thijs,G., van Helden,J.,Vandenbogaert,M., Weng,Z., Workman,C., Ye,C. and Zhu,Z. (2005) Assessingcomputational tools for the discovery of transcription factor binding sites.NatBiotechnol,23 (1), 137–144.

Wedekind,J.E., Frey,P.A. and Rayment,I. (1996) The structure of nucleotidylatedhistidine-166 of galactose-1-phosphate uridylyltransferase provides insight intophosphoryl group transfer.Biochemistry,35 (36), 11560–11569.

Zaki,M.J. (2000) Scalable algorithms for association mining.Knowledge and DataEngineering,12 (2), 372–390.

Zaki,M.J. and Ogihara,M. (1998) Theoretical foundations of association rules. InInProceedings of 3 rd SIGMOD’98 Workshop on Research Issues in Data Mining andKnowledge Discovery (DMKD’98), Seattle, Washington.

6



ownloaded from


Gemoda

GKIKYKSEQAENYRKLILATAEDPRVILLKLSDRLDNVKTLWVFREEKRKKIAKETMEIY SPOT_AQUAEHNKTRSIKEANTISKMFFAMTHDIRIIIIKLADKLHNMTTLSYLPKNRQDRIAKDCLSTY SPOT_BORBUKFRDKKEAQAENFRKMIMAMVQDIRVILIKLADRTHNMRTLGSLRPDKRRRIARETLEIY SPOT_ECOLIKFRTRQEAQVENFRKMILAMTRDIRVVLIKLADRTHNMRTLGSLRPDKRRRIAKETLEIY SPOT_HAEINLKNKKENLNLKSFVNIAINSQQEINVMVLKLADRLDNIASIEFLPIEKQKVIAKETLELY SPOT_MYCGELNRKKEDLNLKSLVNIAMSSQQEVNALVLKLADRLDNISSIEFLAVEKQKIIAKETLELY SPOT_MYCPNAKENRTQIKAQYLRKLYLSMAKDIRVIIVKLADRLHNLKTIGYLKPERQQIIARESLEIY SPOT_SPICINFSSTTEHQAENFRRMFLAMAKDIRVIVVKLADRLHNMRTLDALSPEKQRRIARETKDIF SPOT_SYNY3

APLAHRLGVWSIKNELEDWAFKYLYPEEYEKVRNFVKESRKNLEE SPOT_AQUAEVPIAERLGISSLKTYLEDLSFKHLYPKDYKEIKNFLSETKIEREK SPOT_BORBUSPLAHRLGIHHIKTELEELGFEALYPNRYRVIKEVVKAARGNRKE SPOT_ECOLICPLAHRLGIEHIKNELEDLSFQAMHPHRYEVLKKLVDVARSNRQD SPOT_HAEINAKIAGRIGMYPVKTKLADLSFKVLDLKNYDNTLSKINKQKVFYDN SPOT_MYCGEAKIAGRIGMYPVKTQLADLSFKVLDPKNFNNTLSKINQQKVFYDN SPOT_MYCPNSAIAHRLGMKAVKQEIEDISFKIINPVQYNKIVSLLESSNKEREN SPOT_SPICIAPLANRLGIWRFKWELEDLSFKYLEPDSYRKIQSLVVEKRGDRES SPOT_SYNY3

Fig. 1. The RelASpoTmotif detected in the 3.1.7.2 enzyme sequences.

Fig. 2. The similarity graph for the 3.1.7.2 enzyme example. (A) is the simi-larity matrix A, which contains one row and column for each window of50 residues in the set of input sequences. Entries in the matrix have beenthresholded such that pairs of windows that can be aligned with a bit–scoregreater than 20 are given a black dot and all others are white, producingthe familiar dot–plot appearance of the matrix. (B) is a graph representa-tion of A. Each vertex represents a window, and two vertices are connectedwith an edge if they have a black dot in the top image. The breakout showsa clique of size eight, which represents a set of windows that participatein the motif shown in Figure 1. In general, as the bit–score threshold islowered, the number of edges in the graph increases, making the clusteringstage more computationally intensive. When using clique–based clusteringwith too small of a threshold, computational expense may make the probleminfeasible. At these thresholds the “signal” cannot be distinguished from the“noise.” However, with the parameters used in this example, the clusteringphase is quite easy, which is intuitive given the number of disjoint subgraphsshown in the bottom image.

Fig. 3. The sequence logo for a) the motif implanted in each sequence forthe (l,d)–motif problem and b) the LexA binding site motif generated fromthe highest–scoring motif returned by Gemoda.

Fig. 4. A motif showing structural conservation between the humangalactose-1-phosphate uridylyltransferase and fragile histidine triad proteinsoriginally reported by Holm and Sander (1997). The motif, as shown here,was “discovered” using the Gemoda algorithm along with three other, smal-ler, structural motifs that are highly conserved between the two proteins.Notably, the proteins show little sequence similarity over the region dis-played in the structural motif above. Graphics created using PyMol (DeLanoScientific, San Carlos, CA, USA).

7



ownloaded from


Jensen et al

Fig. 5. A sketch showing the flow of the Gemoda algorithm for an example input set of protein sequences. The various colors in the input sequences are usedto indicate the sequential ordering of theL–residue windows. The various shapes are used to indicate a particular window’s sequence of origin. (1) In thecomparison stage, each window is compared to each other window on a pair–wise basis. Here we show the similarity matrix,A, where the values in the matrixhave been thresholded. Those pairs of windows inA that have a similarity score above the threshold are colored black. Note that the graph looks very similarto a standard dot plot. (2) In the clustering phase, groups of windows are clustered together. Here, we show the clusters as cliques, or maximal fully–connectedsubgraphs in the thresholded matrixA. (3) Finally, these clustered are “stitched” together in the convolution phase using the sequential ordering of the windowsto reveal the maximal motifs. A similar process applies for any kind of sequential data analyzed by Gemoda.

8



ownloaded from


A generic motif discovery algorithm for sequential data

Documents