Following the Evolution of New Protein Folds via Protodomains [Report]

Examination for the Advancement to Candidacy

Following the Evolution of New Protein Foldsvia Protodomains

Spencer E. Bliven

January 28, 2013

Bioinformatics & Systems Biology.University of California, San Diego

2

Committee Members

Philip E. Bourne, Chair

Milton H. Saier, Co-Chair

Russell F. Doolittle

Michael K. Gilson

Adam Godzik

3

Abstract

The rate at which novel protein folds are discovered has declined rapidly, leading to some hope thatcurrently known protein structures cover the majority of protein fold space utilized by nature, atleast within well-sampled classes of proteins [1]. This presents an opportunity to globally analyzefold space. In my proposed thesis I will look for answers to the following ambitions questions:

• How do new folds evolve from existing ones? What are the relative frequencies of knownmechanisms for forming new folds?

• Is fold space continuous or discrete? How can it display properties of both?

• How can an understanding of protein fold space translate into insights about specific proteinfamilies?

Although these questions are quite broad, I think that the key elements are now in place to makeanswers accessible in a PhD thesis. I first redefine these questions as clear computational problems.I then propose some algorithms that could be used to solve the problems, as well as summarizesome steps, which have already been taken towards understanding, fold space. Finally, I describean evolutionary model, which places the computational results in a biological framework.

Publications:

[2] Andreas Prlić, Spencer Bliven, Peter W Rose, Wolfgang F Bluhm, Chris Bizon, Adam Godzik,and Philip E Bourne. Pre-calculated protein structure alignments at the RCSB PDB website.Bioinformatics, 26(23):2983–2985, December 2010.

[3] Spencer Bliven and Andreas Prlić. Circular permutation in proteins. PLoS Comput Biol,8(3):e1002445, March 2012.

[4] Andreas Prlić, Andrew Yates, Spencer E Bliven, Peter W Rose, Julius Jacobsen, Peter VTroshin, Mark Chapman, Jianjiong Gao, Chuan Hock Koh, Sylvain Foisy, Richard Holland, Gedim-inas Rimša, Michael L Heuer, H Brandstätter-Müller, Philip E Bourne, and Scooter Willis. BioJava:an open-source framework for bioinformatics in 2012. Bioinformatics, 28(20):2693–2695, October2012.

Contents

1 Introduction 61.1 Fold Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Protodomains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Protodomain Rearrangements in Evolution . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Specific Aims 11

3 Preliminary Research 133.1 Detecting Circular Permutations (CE-CP) . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Detecting Protein Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Classifying Quaternary Structure Based on Symmetry . . . . . . . . . . . . . . . . . 143.4 Domain-based All-vs-all Structural Comparison . . . . . . . . . . . . . . . . . . . . . 16

4 Research Design and Methodology 164.1 Evolutionary Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Impact 21

6 Conclusion 22

5

1 Introduction

1.1 Fold Space

The question of the nature of protein fold spacehas captured the attention of numerous struc-tural and computational biologists. The numberof possible protein sequences is so large as to bepractically infinite, yet proteins with completelydifferent primary sequences may fold into nearlyidentical structures [5]. Since the structure ofa protein is essential to performing its function,understanding the nature of what structures arepossible and how evolution has sampled the spaceof possible structures can have far reaching con-sequences.

A number of questions regarding the nature ofprotein fold space remain open. One of the morecontroversial questions is whether fold space iscomposed primarily of discrete protein folds, orwhether folds are connected by a continuum ofpossible but unobserved folds [6, 7, 8, 9]. Thisquestion has practical implications on the des-ignability of proteins, the utility of structuralgenomics initiatives, and the design of structureclassification methods.

With the proliferation of protein structures, sev-eral schemes for classifying proteins into discretecategories emerged. Some notable examples in-clude SCOP [10], and CATH [9]. Such classi-fications are undeniably useful as a descriptionof the possible folds observed in nature. How-ever, numerous examples exist of proteins withclear structural similarity, yet are classified asdiscrete folds by these methods. For example,Grishin [11] describes a sequence of structurallysimilar proteins leading from an all-β to an all-α protein. Such observations led to the view offold space as a continuum. In this view, proteinclassifications are more like clusters of closely re-lated structures. Some structures may lie nearthe edge of multiple clusters, incorporating struc-tural features of each. Efforts have been made toformalize the notion of continuous fold space bydefining rigorous distance functions for pairwisecomparisons and computing all-against-all pair-wise comparisons of known proteins [12, 8, 13].

Multidimensional scaling can then be used to em-bed protein folds in a Euclidean space [14, 15, 16].The problem with such approaches is that theyare generally able to distinguish protein classes,but cannot capture the finer classifications of foldand superfamily. Thus, they have limited utilityin predicting evolutionary relationships or func-tional characteristics.

It now seems that neither the discrete nor contin-uous views of protein fold space can fully explainthe relationships between protein folds [6]. In-stead of focusing on the geometric similarities be-tween folds, perhaps the secret to understandingprotein fold space lies in the evolutionary historyof proteins. In this proposal I suggest a modelfor the long-term evolution of protein folds, anddiscuss the algorithms that need to be developedin order to apply this model to unraveling theevolutionary relationships among the plethora offolds observed today. By examining the evolu-tionary history of folds, individual cases of foldswhich developed by incremental changes ("con-tinuous fold space") can be distinguished fromthe rapid introduction of new folds either de novoor through the recombination of existing folds("discrete fold space").

1.2 Protodomains

One difficulty with clearly defining the natureof fold space is the multi-scale nature of pro-teins. Various subdivisions of proteins along ge-netic, geometric, structural, and functional con-ditions are broadly used in the literature, ofteninconsistently. To avoid any confusion, I willbriefly explain which definitions will be used inthis proposal, before introducing a new term, theprotodomain.

The largest and most biologically relevant unitof proteins is the biological assembly, whichconsists of the protein as it exists in the cell [17].This may be as a monomer, a protein complex,or an aggregate. A single polypeptide may ex-ist in several types of biological assemblies at thesame time due to dynamics or different cellularconditions. The prevalent biological assembly for

6

a given cell and protein can be determined ex-perimentally, or can be predicted from a crystalstructure using methods such as PISA [17].

The most unambiguous decomposition of a bio-logical assembly is into a set of chains, based onpolypeptide connectivity. Chains typically corre-late one-to-one with genes (although alternativesplicing and post-translational cleavage and liga-tion lead to notable exceptions). Thus, chains area good approximation of a decomposition alonggenetic criteria. Since a chain must be trans-lated as a unit, all transcriptional and transla-tional regulation occurs at the chain level.

Decomposing a biological assembly based onstructural criteria leads to a set of domains.Several different specific criteria are used to de-fine domains (e.g. SCOP [10], CATH [9], or PDP[18], but generally define domains to be com-pact, independent structural units capable of in-dependently folding [19, p. 27]. In SCOP 1.75,60% of protein structures contain multiple do-mains. Because domains are defined based onstructural criteria, domains may consist of sev-eral non-contiguous portions of a chain or evenportions from several chains. Although most do-mains are formed by a single contiguous portionof a chain, about 4% of SCOP domains containtwo or more non-contiguous blocks, and 1% spanmultiple chains.

Several domain classifications schemes exist, in-cluding PFam [20], SCOP, and CATH. Each at-tempts to cluster domains based on structuralsimilarity and evolutionary relationships. Usingthe SCOP nomenclature, a fold is a group ofdomains that contain the same major secondarystructural elements in the same mutual orienta-tion and with the same connectivity (author?)[10, 6]. A continuing challenge is to demonstratewhether domains with a common fold derive froma common ancestor or whether they representconvergent evolution.

The problem with comparing domains is twofold.First, while domains may be able to fold inde-pendently, proteins evolve in the context of theirbiological assembly. Changes in the structureof the biological assembly may not correspond

to significant changes in the structure of compo-nent domains. Second, domains often consist ofcombinations of subdomains, which could plau-sibly be evolutionarily related. A prime exampleis symmetric domains, which consist of multiplecopies of a subdomain following a gene duplica-tion event (see Protodomain Rearrangements inEvolution).

To address the issue of evolution, we introducethe concept of a protodomain. A protodomainis a minimal, independently evolving protein unitwith a conserved structure. In many casesthis may include a whole domain, or even thewhole chain. Other domains may consist of sev-eral protodomains that have independent evo-lutionary histories. As a part of a domain, aprotodomain is not required to fold indepen-dently. Rather, it is a syntenic block that main-tains the same fold throughout evolution. Thefact that members of a protodomain are relatedby homology distinguishes them from struc-tural motifs, which are small, common struc-tures, which can evolve independently (e.g. zincfinger motifs).

1.3 Protodomain Rearrangements inEvolution

To provide motivation for this definition ofprotodomains, it is useful to consider severalevolutionary processes that conserve the struc-ture of the biological assembly while dramaticallychanging the structure of component chains.First, circular permutation can be viewed asan alteration in the sequence order in whichtwo protodomains occur. Second, the evolu-tion of internal pseudosymmetry from quater-nary symmetry maintains the structure of the bi-ological assembly while fusing the participatingprotodomains into a single chain. Arbitrary re-combinations of structural motifs could also rep-resent protodomain rearrangements, but unlessthey show sequence conservation, the evidencefor homology between motifs is to tenuous to con-sider them protodomains.

Two protein chains are related by a circular per-mutation if they contain two regions which are

7

(a) (b)

(c) (d)

Figure 1: Several proteins that contain multiple copies of a hypothetical protodomain. (a) Glyox-alase I from Clostridium acetobutylicum [3HDP] contains two symmetrical copies of the protodomainthat bind a nickel ion near the axis of rotation. The authors list the structure as monomeric, butPISA suggests a dimer similar to c. (b) GTP binding regulator from Thermotoga maritima [1VR8]also contains two copies of the protodomain, but at a different relative orientation. Here it is shownsuperimposed on one protodomain from glyoxalase I, in yellow. (c) Dimer form of glyoxalase I in E.coli [1F9Z]. While each chain individually has a very different structure from the C. acetobutylicumhomologue, the protodomains are oriented identically between chains and conserve the metal-bindingsite between protodomains. (d) One chain from Pseudomonas 1,2-dihydroxynaphthalene dioxyge-nase [2EHZ] contains four copies of the protodomain, but it assembles into an octomer.

8

N

C

C

N

a

a

b

b

cc

Figure 2: Schematic representation of a circularpermutation in two proteins [3]. The first protein(outer circle) has the sequence a-b-c. After thepermutation the second protein (inner circle) hasthe sequence c-a-b. The letters N and C indicatethe location of the amino- and carboxy-terminiof the protein sequences and how their positionschange relative to each other.

1 2

2 1

Fission

1 2

Fusion

Fusion Fission

(a)

1 2

2 1

Truncation

Duplication+

Fusion

1 2 1 2

Truncation

(b)

Figure 3: Schematic of the genetic modifica-tions that can lead to circular permutation, withprotodomains represented as arrows. (a) Fission& Fusion mechanism. (b) Duplication by permu-tation mechanism.

homologous but which occur in a permuted or-der (see figure 2). A large number of proteins re-lated by a circular permutation are known [21].Despite their permuted sequence, the structuresof pairs of circularly permuted proteins are usu-ally extremely similar.

Andreas Prlić and I described the two majormechanisms by which circularly permuted formsof proteins evolve in our 2012 review of the sub-ject [3]:

There are two main models that arecurrently being used to explain the evo-lution of circularly permuted proteins:permutation by duplication and fissionand fusion. The two models have com-pelling examples supporting them, butthe relative contribution of each modelin evolution is still under debate [22].Other, less common, mechanisms havebeen proposed, such as “cut and paste”[23] or “exon shuffling.”

Permutation by Duplication

The earliest model proposed for theevolution of circular permutations is thepermutation by duplication mechanism[24]. In this model, a precursor gene

9

α1 α2 α3 α4 α1’ α2’ α3’

Saposin Saposin

Swaposin

linkerα4’

Saposin

N

C

α1 α2

α3α4

Swaposin

N

C

α1’ α2’

α3α4

linker

Figure 4: Suggested relationship between saposinand swaposin. They could have evolved from asimilar gene [26]. Both consist of four alpha he-lices with the order of helices being permutedrelative to each other.

first undergoes a duplication and fusionto form a large tandem repeat. Next,start and stop codons are introducedat corresponding locations in the du-plicated gene, removing redundant sec-tions of the protein (see figure 3b).One surprising prediction of the per-

mutation by duplication mechanism isthat intermediate permutations can oc-cur. For instance, the duplicated ver-sion of the protein should still befunctional, since otherwise evolutionwould quickly select against such pro-teins. Likewise, partially duplicatedintermediates where only one termi-nus was truncated should be functional.Such intermediates have been exten-sively documented in protein familiessuch as DNA methyltransferases [25].Saposin and swaposin. An ex-

ample for permutation by duplicationis the relationship between saposin andswaposin. Saposins are highly con-served glycoproteins that consist of anapproximately 80 amino acid residuelong protein forming a four alpha he-lical structure. They have a nearlyidentical placement of cysteine residuesand glycosylation sites. The cDNA se-quence that codes for saposin is called

prosaposin. It is a precursor for fourcleavage products, the saposins A, B,C, and D. The four saposin domainsmost likely arose from two tandem du-plications of an ancestral gene [27].This repeat suggests a mechanism forthe evolution of the relationship withthe plant-specific insert (PSI) (see fig-ure 4). The PSI is a domain exclu-sively found in plants, consisting of ap-proximately 100 residues and found inplant aspartic proteases [28]. It be-longs to the saposin-like protein fam-ily (SAPLIP) and has the N- and C-termini ‘‘swapped’’, such that the or-der of helices is 3-4-1-2 compared withsaposin, thus leading to the name ‘‘swa-posin’’ [29]. For a review on functionaland structural features of saposin-likeproteins, see (author?) [30].

Fission and Fusion

Another model for the evolution of cir-cular permutations is the fission and fu-sion model. The process starts with twopartial proteins. These may representtwo independent polypeptides (such astwo parts of a heterodimer), or mayhave originally been halves of a singleprotein that underwent a fission eventto become two polypeptides (see figure3a).The two proteins can later fuse to-

gether to form a single polypeptide. Re-gardless of which protein comes first,this fusion protein may show similarfunction. Thus, if a fusion betweentwo proteins occurs twice in evolution(either between paralogues within thesame species or between orthologues indifferent species) but in a different or-der, the resulting fusion proteins will berelated by a circular permutation.Evidence for a particular protein hav-

ing evolved by a fission and fusionmechanism can be provided by ob-serving the halves of the permutation

10

IIα

B. taurus I (NAD) II III (NADPH)P11024

E. coli IP07001 D8AU95

R. rubrum IQ2RSB2 Q2RSB3 Q2RSB4

III IIβ

Rb. capsulatus

D5APA9

IIαI III IIβ

IIα III IIβ

E. tenella III IIβ IIαI Parasitic protozoans

Vertebrates

Bacteria

Q07600

D5APA8

Figure 5: Transhydrogenases in various organ-isms can be found in three different domain ar-rangements. In cattle, the three domains are ar-ranged sequentially. In the bacteria E. coli, Rb.capsulatus, and R. rubrum, the transhydrogenaseconsists of two or three subunits. Finally, tran-shydrogenase from the protist E. tenella consistsof a single subunit that is circularly permutedrelative to cattle transhydrogenase.

as independent polypeptides in relatedspecies, or by demonstrating experi-mentally that the two halves can func-tion as separate polypeptides [31].Transhydrogenases. An example

for the fission and fusion mechanismcan be found in nicotinamide nucleotidetrans- hydrogenases [32]. These aremembrane-bound enzymes that cat-alyze the transfer of a hydride ion be-tween NAD(H) and NADP(H) in a re-action that is coupled to transmem-brane proton translocation. They con-sist of three major functional units (I,II, and III) that can be found in differ-ent arrangement in bacteria, protozoa,and higher eukaryotes (see figure 5).Phylogenetic analysis suggests that thethree groups of domain arrangementswere acquired and fused independently[22].

Both mechanisms of circular permutation requiremajor rearrangements to the underlying genesthat code for the protein, but result in minimalchanges to the protein structure. The prevail-ing theory for the evolution of internal symmetryshares this pattern of modifying gene structurewhile conserving the functional assembly.

Symmetry has been known to be importantin proteins since Perutz’s 1968 structure ofhemoglobin [33, 34]. It plays a fundamental rolein protein allostery [35], as well as contributes tofunction and folding [36]. Among current struc-tures in the PDB, 43% have symmetric quater-nary structure1. In addition to this, 19% ofSCOP classes consist of domains with internalpseudosymmetry (see 3.2 below).

Symmetric molecules are thought to evolve viaa duplication mechanism from multimers [37].For instance, a monomer with three-fold rota-tional symmetric could evolve from a symmet-ric trimer via fusion with two duplicate copiesof the gene (see figure 6) [31]. The biologicalassembly in both the trimeric and monomericforms consists of three copies of the protodomain,but with a different genetic composition. Onemight expect that monomers with a number ofprotodomains that is not a power of two wouldbe disfavored, since they would require severalsequential duplications whose chains would notform complete assemblies. However, intermedi-ates consisting of two biological assemblies withone strand-swapped chain can stably fold (fig-ure 6b). Intermediates may also be stabilized bysingle-protodomain paralogues.

While symmetry and circular permutation arenot the only ways in which protodomains canrecombine, they are readily identifiable and areassociated with established evolutionary pro-cesses. Thus, these form a basis from whichto start decomposing biological assemblies intoprotodomains. Additional protodomains may bethen identified based on structural similarity, se-quence similarity, or other criteria.

2 Specific Aims

1. Improve algorithms to identify con-served protodomains globally acrossthe PDB. The first step to understand-ing the evolution of protein architecture

1The Protein Data Bank. http://www.rcsb.org, ac-cessed 1/18/2013

11

http://www.rcsb.org

(a) (b) (c)

Figure 6: Hypothetical precursors to fibroblast growth factor 1 (FGF-1) synthesized by (author?)[31]. (a) Trimer, with one protodomain per chain [3OL4]. (b) Trimer, with two protodomains perchain. Two barrels are formed, each consisting of three protodomains [3OGF]. (c) Fully symmetricmonomer consisting of three protodomains [3O4D].

is defining the basic repeating units, orprotodomains in our structural terminology.A crude approximation of this is simply thedomain architecture of various proteins, forinstance SCOP superfamilies. However, thisbreaks down for symmetric proteins, pro-teins that have undergone circular permuta-tion, and other complex cases such as strandswapping. While we have made progress onalgorithms for discovering such cases (seePreliminary Research), additional algorith-mic advances are needed to accurately assignindividual residues to protodomains. Thesealgorithms will be made freely available tothe community under an open source license.

2. Identify structurally similar and po-tentially homologous protodomainsacross fold space. After protodomainsare identified, an all-vs-all comparison ofrepresentative protodomains will be per-formed using structural comparison algo-rithms. These pairwise similarities can beused to construct a network of structurallysimilar protodomains, allowing clusteringanalysis to identify potential homologues.Shared ancestry between protodomains canthen be further established using sequencecomparison and consistency with evolution-ary trees. An all-vs-all comparison has al-ready been performed at the domain levelutilizing the Open Science Grid supercom-

puter, but extending this to protodomainscould identify links between seemingly dis-similar folds.

3. Integrate protodomain arrangementswith domain and quaternary structureinformation to create a parsimoniousmodel of fold evolution across the treeof life. A model of protein fold evolutionthat incorporates protodomain architectureis suggested here and will be further refinedbased on future data. This model will formthe basis for identifying key events in theevolution of protein folds, with the goal ofeventually building a parsimonious ’tree ofproteins’ to document the evolution of ex-isting protodomains.

4. Apply protodomain principles to un-derstanding the evolution of specificprotein families. Apply the protodomainarchitecture data to specific protein familiesin more detail. A deep knowledge of evolu-tionary relationships within a specific familywill bring external corroboration to any in-sights discovered through the new algorith-mic developments. Additionally, knowledgeof symmetry could be used to suggest usefulprotein engineering task relevant to the spe-cific system. Several candidate systems arepresented, including ion transporters andbeta propellers.

12

3 Preliminary Research

3.1 Detecting Circular Permutations(CE-CP)

The Combinatorial Extension (CE) algorithm isable to accurately identify alignments betweendistantly related proteins based on structuralsimilarity [38]. It is one of the most widelyused structural alignment algorithms, and anopen-source implementation is currently avail-able through the BioJava project [39]. How-ever, the algorithm is limited to comparing pro-teins that have the same order of residues. Pro-teins which are related by rearrangement eventscannot be detected by CE, and require moreadvanced topology-independent structural align-ment algorithms.

The most basic case of rearrangement is that ofcircular permutation, which requires just a sin-gle change in sequence topology to align the pairsof proteins. A version of CE adapted for circu-lar permutations was implemented, called Com-binatorial Extension with Circular Permutations(CE-CP) [4]. To get around the requirement thatthe topology of both proteins being aligned beidentical, we use an algorithm analogous to thatproposed by (author?) [40] for detecting circu-lar permutations by sequence similarity. To com-pare two proteins, A and B, which are suspectedof being related by a circular permutation, firsta duplicate of B is created by concatenating thesequence of B to itself. Then the full CE algo-rithm is run to compare A and BB. If no circularpermutation is found, this will result in two iden-tical alignment paths aligning A with each copyof B. However, if a circular permutation has oc-curred, the optimal alignment path will span theboundary between copies of B, as shown in figure7.

One difficulty with the duplication algorithm isthat for difficult cases, the optimal AFP pathmay contain a large enough insertion in BB thatthe same residue in different copies of B is as-signed to different portions of A, one before andone after the permutation site. This introducesambiguity as to where the circular permutation

(a)

(b)

Figure 7: (a) CE-CP alignment of periplasmicmolybdate-binding protein [1ATG] (orange &yellow) on OpuAC [2B4L] (cyan and blue). (b)Dotplot of duplicated search matrix. CE-CPfinds the red alignment, which crosses the dupli-cation boundary at the position of the circularpermutation. This infomation is then mappedback to equivalent positions (grey alignment) toform the final alignment.

13

occurred. To solve this problem we choose thepermutation site that results in the longest align-ment, and discard portions of the optimal path,which are inconsistent with the topology of thechosen permutation site.

The CE-CP algorithm is available as part ofthe BioJava open source library. It is alsoprovided as an alignment algorithm on theRCSB PDB website (http://www.rcsb.org/pdb/workbench/workbench.do).

3.2 Detecting Protein Symmetry

The CE algorithm has also been adapted to de-tect pseudosymmetry within protein chains. Un-like in crystallographic symmetry, where sym-metric chains are known to have identical se-quences and structures, pseudosymmetric sub-units within a domain can have wildly divergentsequences. Thus, detecting pseudosymmetry re-quires a structural alignment algorithm to findregions of self-similarity within the protein.

The CE-Symm algorithm is able to find pseu-dosymmetry and other types of internal repeatsthrough a few modifications of CE. A protein iscompared to itself in a sequential manner usingdynamic programming. Like CE-CP, the pro-tein is searched against a duplicated copy of it-self to allow a single discontinuity in sequence.The alignment is also restricted to lie a minimumdistance from the diagonal, in order to avoidthe trivial alignment. After identifying optimalregions of self-similarity, the alignment is post-processed to identify whether rotational symme-try is present and to determine the symmetryorder.

CE-Symm was run on a non-redundant setof proteins representing one protein from eachSCOP superfamily (http://source.rcsb.org/jfatcatserver/scopResults.jsp). Alignmentswith a z-score greater than 3.5 were consideredto be symmetric molecules. This search foundthat 20% of SCOP superfamilies are symmet-ric, which is slightly higher than results foundby other methods (see table 1) [41].

SCOP class PercentSymmetric

(a) All alpha proteins 23% 23%(b) All beta proteins 26% 26%(c) Alpha and beta proteins(a/b)

16%

(d) Alpha and beta proteins(a+b)

14%

(e) Multi-domain proteins(alpha and beta)

3%

(f) Membrane and cellsurface proteins andpeptides

24%

(a-f) All Classes 19%

Table 1: Percentage of SCOP superfamilies foundto contain symmetry.

A manuscript describing the current version ofCE-Symm is currently being prepared. However,the program is limited in that while it can ac-curately determine whether a domain containspseudosymmetry, it is much less reliable at deter-mining the minimal protodomains that comprisethe protein. Under Aim 1 of this proposal, I willimprove the accuracy of CE-Symm for identify-ing protodomains.

3.3 Classifying Quaternary StructureBased on Symmetry

Peter Rose recently developed an algorithm forquickly determining the symmetry of a proteinat the level of quaternary structure. The algo-rithm first finds the stoichiometry of each compo-nent within the biological assembly of the inputstructure. Identical chains are identified within aprotein using a sequence identity threshold. Forinstance, human hemoglobin consists of two al-pha and two beta subunits. Thus for high se-quence thresholds it would be classified as hav-ing α2β2 stoichiometry. However, the subunitscan be aligned with 43% identity. Thus, if chainsare clustered at 40% identity, hemoglobin will beclassified as having 4 corresponding components(α4). Next, a number of rotations of the assem-bly are performed. Rotations that result in corre-

14

http://www.rcsb.org/pdb/workbench/workbench.do

http://www.rcsb.org/pdb/workbench/workbench.do

http://source.rcsb.org/jfatcatserver/scopResults.jsp

http://source.rcsb.org/jfatcatserver/scopResults.jsp

(a) (b)e

R

R2

F

FR

FR2

(c)

(d) (e)

Figure 8: Three-fold dihedral symmetry in 5-enol-pyruvyl shikimate-3-phosphate (EPSP) synthase[1G6S]. (a) Top and (b) side views of EPSP, projected along the 3-fold axis and one of the three2-fold axes respectively. (c) Dot-plot showing the six possible alignments consistent with the D3point group. The dotted lines represent the hinge regions between the two halves of the structure.(d) CE-Symm finds an alignment around one 2-fold axis (corresponding to the pink alignment in thedot plot, which requires only one circular permutation). (e) After manually resolving the domainswapping between the 2-fold symmetric domains (residues 20-241, middle quadrant between thedotted lines), CE-Symm is able to find the 3-fold axis as well.

15

sponding subunits being superimposed are storedas valid operations. Finally, the point group ofthe assembly is determined based on which rota-tions are valid. Hemoglobin is classified as 2-foldrotational symmetry at strict thresholds, while atrelaxed thresholds where all subunits are consid-ered equivalent, the algorithm detects the 2-folddihedral pseudosymmetry. The algorithm is alsoable to determine all axes of rotation for the pro-teins and display them using a user-friendly javaapplet.

Using this algorithm, a census of quaternarystructure symmetry in the PDB was performed.This found that about 80% of all biological as-semblies of two or more chains contain symmetry.These results will soon be incorporated into thePDB to enable searching for proteins based onquaternary symmetry.

The algorithm currently uses sequence compar-isons to identify corresponding chains. Thus itis only able to align very closely related chains.Future work will focus on extending this algo-rithm to use structurally similar protodomainsas well. This should reveal additional symme-try from protodomains, which have fused into asingle chain.

3.4 Domain-based All-vs-all StructuralComparison

In addition to developing new methods for iden-tifying protodomains, we have also made pre-liminary progress on characterizing fold space.Two all-vs-all structural comparisons of the en-tire PDB have been completed. Initially, a com-parison of all chains in the PDB was calculated.This was later extended to include structuralcomparisons of all domains. The results are avail-able through the PDB website and are updatedweekly.

Detailed methods for the structural compari-son are reported in (author?) [2]. In brief, anon-redundant set of protein chains was selectedbased on a clustering sequences to 40% identity.These representatives were then decomposed intodomains using either SCOP domains (where

available) or Protein Domain Parser (PDP) as-signments. The rigid FATCAT algorithm wasused to align each pair of domains. This wasmade possible by running the approximately 300million alignment on the Open Science Grid dis-tributed supercomputer.

After calculating alignments for all pairs, in-significant alignments were removed and the re-maining significant alignments used to form anetwork of structural similarity (see figure 9).Domains with similar folds tend to cluster to-gether. Mapping additional information onto thenetwork, such as SCOP classification or EC num-ber, allows the correlation between structure andfunction to be probed. However, it is difficultto draw conclusions about evolutionary relation-ships between proteins due to the imperfect cor-relation between structural similarity and evolu-tionary distance.

4 Research Design andMethodology

Aim 1: Improve algorithms to identifyconserved protodomains globally acrossthe PDB

Significant progress has already been made to-wards this aim with the implementation of CE-CP and CE-Symm. These allow the decompo-sition of some domains into subdomains, whichcan then be used to seed searches for other re-lated protodomains with different architectures.However, CE-Symm requires additional devel-opments to be able to identify hypotheticalprotodomains in symmetric proteins.

Since CE identifies structurally similar motifswithin a protein, the alignments it returns donot always define a one-to-one correspondencebetween protodomains in the protein. Addition-ally, the alignment returned is not guaranteed torepresent the minimal symmetric subunit. Forinstance, a four-fold symmetric molecule may bealigned along its 180° rotation rather than iden-tifying the more fundamental 90° rotation which

16

Figure 9: Network showing structural similarity between protein domains that are annotated inthe TCDB as belonging to a membrane protein. Domains with sequence identity above 40% areclustered together into a single node. Domains which can be aligned with TM-Scores greater than.5 are connected by an edge.

17

Figure 10: CE-Symm alignment of chicken TriosePhosphate Isomerase, a TIM barrel composedof 8 αβ repeating units. CE-Symm choosesthe highest scoring of the 7 possible alignments,which happens to correspond to a 180° rotation.Rather than finding the maximal 8-fold symme-try, CE-Symm finds only 2-fold.

would lead to the correct protodomain assign-ment (see figure 10).

The problem of non-one-to-one alignments canbe solved by suitably post-processing alignmentsto restore this property. Some attempts atthis have been made, but a suitable solution isyet to be implemented. The problem of non-minimal protodomain assignment could be solvedby modifying CE to find multiple distinct, high-scoring alignments rather than simply returningthe top hit. An alternative would be to usethe SymD algorithm to identify protein symme-try [41]. SymD is significantly slower than CE-Symm, but is able to find some alignments cor-responding to multiple rotations.

After protodomain detection algorithms becomesufficiently accurate, they will be run on acrossall domains in the PDB’s nonredundant set. Thiswill result in the set of all known protodomains,which could be easily kept up-to-date with thePDB’s weekly release schedule. This list ofprotodomains will be made available to the com-munity as a deliverable from Aim 1.

Aim 2. Identify structurally similar andpotentially homologous protodomainsacross fold space

Given a set of protodomains from Aim 1, an all-vs-all structural comparison will be computed.Since many domains consist of only a singleprotodomain, many of these comparisons willoverlap with the existing domain-based all-vs-allcalculation and will not need to be recomputed.Nonetheless, this will require significant compu-tational resources and be performed on Open Sci-ence Grid (OSG).

Aim 1 is likely to consist of several iterationsof improving the assignment of protodomains,as the algorithms become progressively better.The all-vs-all comparison can also be itera-tively improved, with only the new or changedprotodomain definitions being recomputed.

After the comparison is complete, the networkof structurally similar protodomains will be an-alyzed. Identifying clusters within this networkwill suggest closely related protodomains. Net-work analysis can be used to associate clusterswith interesting properties such as ligand bind-ing, symmetry order, enzymatic activity, and dis-tribution across organisms.

Aim 3. Integrate protodomainarrangements with domain andquaternary structure information tocreate a parsimonious model of foldevolution across the tree of life.

Aim 3 seeks to place protodomains within thecontext of the biological assembly, and thenceinto the broader context of the evolution of pro-tein folds. Much as the preliminary work byDr. Rose characterized biological assemblies bythe symmetry and composition of the chains,so can biological assemblies be classified accord-ing to the symmetry and composition of itsprotodomains. This will be a noisier process doto the much greater divergence of protodomainscompared to identical chains in a crystal, but thesensitivity and specificity of the process can be

18

controlled by the clustering parameters used toestablish the set of protodomains in each assem-bly.

Special attention will be focused on cases wherevery similar protodomains are found in biologi-cal assemblies with different components. Theserepresent major changes in the overall protein,and could lead to different selective pressureson the assembly. Likewise, cases where theprotodomain content of the assembly is con-served but the chains composing that assemblydiffer signal genomic changes without large struc-tural changes. These orthogonal processes of ge-nomic rearrangement and structural rearrange-ment form the core of a general model for theevolution of proteins.

By combining structural similarity information,the protodomain composition, and external in-formation about evolution such as sequence con-servation, distribution among organisms, andevolutionary phylogenies, it should be possible toreconstruct major changes in biological assemblyfor key protein families.

One interesting question is how novel proteinsevolved. For those who view fold space as con-tinuous, new proteins come about through manysmall structural changes that lead from an ances-tral fold to the child fold through countless inter-mediates. However, expanding a structural simi-larity network such as that in figure 9 to includeall known proteins does not result in a single con-nected graph for reasonable similarity thresholds.This indicates that there remain folds which can-not be related to one another through continu-ous structural changes. While some such rela-tionships may require additional sampling of foldspace to reveal, for others finding protodomainrearrangements may be able to detect intermedi-ates connecting these orphan folds with plausibleevolutionary paths.

Aim 4. Apply protodomain principles tounderstanding the evolution of specificprotein families

A protein family will be identified which is suit-able for more detailed study. The purpose ofthis is twofold. First, the family will act asa benchmark to test the algorithms developed.When dealing with global surveys and largedatasets, finding a test set where the evolutionof protodomains is well understood can makenaive assumptions and algorithmic limitationsclear. Second, by focusing on a subset of pro-teins it becomes easier to make testable predic-tions about practical problems. Knowledge ofthe protodomain architecture of a family will fos-ter solutions to applications such as protein en-gineering and structure prediction.

To be a functional test set, the protein familyshould

• Have good structural coverage

• Contain multiple members with symmetryat either domain or quaternary structurelevel.

• Contain circularly permuted members

• Span a diverse set of folds

Furthermore, to have practical applications itshould contain proteins with connections humanhealth and disease effects.

The beta-propeller family could make a goodbenchmark. Propellers composed of betweenfour and eight protodomains are known, as wellas diverse quaternary assemblies. They arewidespread throughout the tree of life, and a rea-sonable amount of evidence suggests that theyevolved from a common ancestor [42]. However,the evolution of beta propellers has been wellstudied and it is not clear whether other all-betaprotein families are closely related.

Ion channels are also an exciting prospect for fur-ther investigation. Symmetry is known to be im-portant to the function of several ion channels[43], and it appears at both the quaternary anddomain levels. Some channels show signs of hav-ing repeat regions, but which now have become

19

(a) (b)

Figure 11: Structure of the E. coli ammoniachannel AmtB [3C1G]. (a) Three-fold quater-nary symmetry. (b) CE-Symm alignment of onechain, showing 2-fold symmetry around the twoion channels.

asymmetric to the point where they are difficultto align [44]. Our preliminary network analysisalso shows that the SCOP membrane proteinsclass is the most likely to be structurally similarto other SCOP classes. This diversity could un-cover interesting structural relationships betweenmembrane proteins and cytosolic families.

Aim 4 will consist of a detailed evolutionary anal-ysis of the benchmark family, for comparisonwith the computational predictions of aim 3.

4.1 Evolutionary Model

Traditional evolutionary analysis relies on se-quence comparisons to establish relationships be-tween proteins. For instance, PFam uses se-quence profiles to create families of homologousproteins. Sophisticated sequence-based methodsare able to detect homology down to the "twilightzone" of about 20% identity [45, 46, 47].

More distantly related proteins can be discernedif structures are available for both proteins. Be-cause structure changes much less rapidly thansequence, classifications that include structuralinformation, such as SUPERFAMILY, are ableto merge much older protein families [48].

Here we present a model of protein fold evolu-tion with an emphasis on structure. The modelcontains six general operations. These oper-ations separate the underlying genetic events,

Figure 12: Diagram showing evolutionarychanges under the model. The two-color shapesrepresent heteromer formation, disintegration,fusion, and fission. The path on the rightrepresents the same processes for homomers.Although local mutations slowly change thestructure of all states, only after a homomerhas monomerized can local mutations break itssymmetry.

20

which form the mechanism of evolution (changesin DNA such as insertions/deletions, mutations,etc) from the effect of those changes on the 3Dprotein complex that provides function to thecell.

1. Local mutation. Any change to proteinstructure, which does not involve changesto the protodomain architecture of the func-tional biological complex.

2. Protodomain fusion. Two chains of aprotein complex fuse to become a singlechimeric gene. This can be from a gene fu-sion (heteromer) or a gene duplication (ho-momer).

3. Protodomain fission. One protein chainis split into two independently translatedgenes.

4. Gain of protein-protein interface. Twopreviously unassociated proteins form acomplex. This can be either a heteromeror a homomer.

5. Loss of protein-protein interface. Pro-teins that previously formed a complex losetheir interaction.

6. Development of new protodomains.Dramatic evolutionary events may lead tothe creation of an entirely novel fold whereno precursor can be found. For instance, theevolution of a folded protein from a disor-dered one could result in a new protodomainwith no structural ancestors. Additionally,primordial folds found in the last universalcommon ancestor (LUCA) would be consid-ered new protodomains, since their evolu-tionary history is lost to us.

Distinguishing protodomain rearrangements at agenetic level from structural changes is especiallyuseful for explaining the evolution of circular per-mutations and pseudosymmetric domains. Thecircularly permuted proteins identified by CE-CP and other methods often contain a high de-gree of conservation. Thus the two halves of acircularly permuted protein can be consideredprotodomains. The fission & fusion mechanism(see figure 3a) can be easily encapsulate in the

proposed model as two fusions, both of whichpreserve the protodomain architecture of the bi-ological assembly. The permutation by duplica-tion mechanism is slightly more complicated tofit into the proposed model, since it appears thatthe intermediate contains twice as many copies ofthe protodomains as the original domain. How-ever, the intermediate phase can be consideredtwo biological assemblies on one chain, each con-taining the same protodomain architecture as theoriginal.

Protodomains are also useful for explaining theevolution of internal pseudosymmetry withinprotein chains. Symmetric proteins are thoughtto evolve from homomeric complexes via gene du-plication. A rotationally symmetric domain con-sists of multiple copies of a single protodomainarranged in a symmetric fashion.

5 Impact

The analysis of fold space is motivated both bybasic science questions about the evolution ofproteins, as well as by practical revelations whichcan come from understanding the relationshipsbetween proteins. In particular, the following ar-eas can benefit from analyzing fold space:

1. Protein design. Designing proteins withnovel functions requires mutating one ormore scaffold proteins such that they at-tain the desired structure [49]. Candidatescaffolds could be selected based on the as-sociation of the desired function with por-tions of fold space. To reduce the compu-tational complexity, pseudosymmetric scaf-folds could be simplified to fully symmet-ric multimeric scaffolds with many fewer de-grees of freedom.

2. Function prediction. A better un-derstanding of fold space could be usedto improve function prediction. For in-stance, identifying proteins with identicalprotodomain architectures could be usedto propagate functional predictions, evenwhen the order and connectivity of thoseprotodomains has changed.

21

3. Protein classification. The graph of foldspace can be used to extend existing classi-fication schemes to newly solved structures.

Although significant prior work has gone into an-alyzing fold space, there are several reasons whynow is a good time to revisit this topic. Thenumber of structures available for analysis hasgreatly increased. Previous attempts to charac-terize fold space relied on all-vs-all comparisonsof hundreds of proteins [16]. Thanks to a recenteffort by the Protein Data Bank (PDB) to pre-pare all-vs-all comparisons of all known proteins,the Bourne lab now has unique access to pairwisealignments of 18,000 non-redundant proteins rep-resenting over 70,000 individual structures. Thismassive increase in the number of structures con-sidered provides a much more thorough sampleof fold space.

Estimates of the rate of discovery of novel foldshave lead to the conclusion that we are nearingthe point of sampling all naturally occurring pro-tein folds. The number of novel folds depositedin the PDB has declined, even while the numberof depositions rises exponentially. This has ledto several estimates of the total number of pro-tein folds, generally in the thousands [50, 51, 52].The most recent version of SCOP (1.75) contains1137 non-transmembrane folds, indicating thatwe are approaching saturation for protein folds.The express goal of structural genomics initia-tives is to determine structures for all remainingnovel folds. Although attaining this goal is stillfar in the future, it is likely that the majority ofdistinct protein folds are present in the currentPDB [1]. Since fold space is fairly well sampledby protein structures (at least among soluble pro-teins), additional insights into the nature of foldspace and its relationship to function may be ac-cessible.

6 Conclusion

A doctorate in Bioinformatics and Systems Bi-ology should show my proficiency at both solv-ing computational challenges and making orig-inal contributions to our knowledge of biology.

I that that if I fulfill the plan set forth in thisproposal I will have achieved both those goals.

The project requires some significant algorith-mic developments. In addition to my existingwork on the CE-CP and CE-Symm algorithms,I will solve the difficult problem of decompos-ing biological assemblies into their constituentprotodomains. I will also develop algorithms tointegrate structural similarity data with evolu-tionary histories so that the evolutionary historyof each protodomain can be calculated.

Running these algorithms globally across thePDB will require dealing with large quantitiesof data. Computation will be run in a scalable,parallel manner on the Open Science Grid super-computer. It will then be analyzed both in aggre-gate and as specific case studies, and the resultsmade available to the public where appropriate.

I believe that this thesis will also make a sig-nificant contribution to biology. The natureof fold space has been hotly debated. By an-swering questions about fold space in the con-text of an evolutionary model, the answers willbe much more biologically relevant and descrip-tive than previous characterizations of fold space.Furthermore, incorporating information on thefunctional biological assemblies into our modelrather than focusing on isolated components bet-ter mimics the selective pressures at work in cells.

Understanding protein symmetry is directly ap-plicable to studies of allostery, protein folding,and evolution. Studies of evolution can give in-sight into the next drug target or protein designscaffold. And the tools to detect relationshipsbetween distant proteins will pave the way forfuture advances in Bioinformatics.

Acknowledgements

Andreas Prlić has been a strong mentor through-out my time at UCSD, and has contributed ad-vice and code to most of the projects discussedhere. Peter Rose did all the quaternary sym-metry studies, and is gracious enough to let

22

me adapt it for protodomains. Douglas Myers-Turnbull has helped with numerous structuralcomparison searches of protodomains and iden-tified some great examples. Almost all of the al-gorithms were either a part of or built on top ofthe BioJava library, to which many great bioin-formaticians have contributed. My wife, Chris-tine, has been very supportive of my long hours.Phil Bourne has been the best advisor a grad stu-dent could hope for, and he always makes timeto discuss science and life.

References

[1] Lukasz Jaroszewski, Zhanwen Li, S Sri Kr-ishna, Constantina Bakolitsa, John Woo-ley, Ashley M Deacon, Ian A Wilson, andAdam Godzik. Exploration of uncharted re-gions of the protein universe. PLoS Biol,7(9):e1000205, September 2009.

[2] Andreas Prlić, Spencer Bliven, Peter WRose, Wolfgang F Bluhm, Chris Bizon,Adam Godzik, and Philip E Bourne. Pre-calculated protein structure alignments atthe RCSB PDB website. Bioinformatics,26(23):2983–2985, December 2010.

[3] Spencer Bliven and Andreas Prlić. Circularpermutation in proteins. PLoS Comput Biol,8(3):e1002445, March 2012.

[4] Andreas Prlić, Andrew Yates, Spencer EBliven, Peter W Rose, Julius Jacobsen,Peter V Troshin, Mark Chapman, Jian-jiong Gao, Chuan Hock Koh, SylvainFoisy, Richard Holland, Gediminas Rimša,Michael L Heuer, H Brandstätter-Müller,Philip E Bourne, and Scooter Willis.BioJava: an open-source framework forbioinformatics in 2012. Bioinformatics,28(20):2693–2695, October 2012.

[5] Manfred J Sippl. Fold space unlimited. CurrOpin Struct Biol, 19(3):312–320, June 2009.

[6] Ruslan I Sadreyev, Bong-Hyun Kim, andNick V Grishin. Discrete-continuous dual-ity of protein structure space. Curr OpinStruct Biol, 19(3):321–328, June 2009.

[7] Jeffrey Skolnick, Adrian K Arakaki, Se-ung Yup Lee, and Michal Brylinski. Thecontinuity of protein structure space isan intrinsic property of proteins. PNAS,106(37):15690–15695, September 2009.

[8] I N Shindyalov and Philip E Bourne. Analternative view of protein fold space. Pro-teins, 38(3):247–260, February 2000.

[9] Christine A Orengo, A D Michie, S Jones,D T Jones, M B Swindells, and Janet MThornton. CATH–a hierarchic classifica-tion of protein domain structures. Structure,5(8):1093–1108, August 1997.

[10] Alexey GMurzin, Steven E Brenner, T Hub-bard, and C Chothia. SCOP: a structuralclassification of proteins database for the in-vestigation of sequences and structures. JMol Biol, 247(4):536–540, April 1995.

[11] N V Grishin. Fold change in evolution ofprotein structures. J Struct Biol, 134(2-3):167–185, April 2001.

[12] Manfred J Sippl. On distance and similar-ity in fold space. Bioinformatics, 24(6):872–873, March 2008.

[13] Brian Marsden and Ruben Abagyan. SAD–a normalized structural alignment database:improving sequence-structure alignments.Bioinformatics, 20(15):2333–2344, October2004.

[14] C A Orengo, T P Flores, W R Taylor, andJ M Thornton. Identification and classifica-tion of protein fold families. Protein Eng,6(5):485–500, July 1993.

[15] L Holm and C Sander. Mapping the pro-tein universe. Science, 273(5275):595–603,August 1996.

[16] Jingtong Hou, Gregory E Sims, ChaoZhang, and Sung-Hou Kim. A global repre-sentation of the protein fold space. PNAS,100(5):2386–2390, March 2003.

[17] Evgeny Krissinel and Kim Henrick. In-ference of macromolecular assemblies fromcrystalline state. J Mol Biol, 372(3):774–797, September 2007.

23

[18] Nickolai Alexandrov and Ilya Shindyalov.PDP: protein domain parser. Bioinformat-ics, 19(3):429–430, February 2003.

[19] Jenny Gu and Philip E Bourne, editors.Structural Bioinformatics. John Wiley &Sons, 2 edition, January 2009.

[20] Marco Punta, Penny C Coggill, Ruth YEberhardt, Jaina Mistry, John Tate, ChrisBoursnell, Ningze Pang, Kristoffer Forslund,Goran Ceric, Jody Clements, AndreasHeger, Liisa Holm, Erik L L Sonnham-mer, Sean R Eddy, Alex Bateman, andRobert D Finn. The Pfam protein familiesdatabase. Nucleic Acids Res, 40(Databaseissue):D290–301, January 2012.

[21] Wei-Cheng Lo, Chi-Ching Lee, Che-Yu Lee,and Ping-Chiang Lyu. CPDB: a databaseof circular permutation in proteins. Nu-cleic Acids Res, 37(Database issue):D328–32, 2009.

[22] January Weiner and Erich Bornberg-Bauer.Evolution of circular permutations in mul-tidomain proteins. Mol. Biol. Evol.,23(4):734–743, April 2006.

[23] Janusz M Bujnicki. Sequence permutationsin the molecular evolution of DNA methyl-transferases. BMC Evol. Biol., 2:3, March2002.

[24] Bruce A Cunningham, John J Hemperly,Thomas P Hopp, and Gerald M Edelman.Favin versus concanavalin A: Circularlypermuted amino acid sequences. PNAS,76(7):3218–3222, July 1979.

[25] A Jeltsch. Circular permutations in themolecular evolution of DNA methyltrans-ferases. J Mol Evol, 49(1):161–164, July1999.

[26] H Ponstingl, K Henrick, and Janet MThornton. Discriminating between homod-imeric and monomeric proteins in the crys-talline state. Proteins, 41(1):47–57, October2000.

[27] Einat Hazkani-Covo, Neta Altman, MiaHorowitz, and Dan Graur. The evolution-ary history of prosaposin: two successive

tandem-duplication events gave rise to thefour saposin domains in vertebrates. J MolEvol, 54(1):30–34, January 2002.

[28] K Guruprasad, K Törmäkangas, J Kervi-nen, and T L Blundell. Comparative mod-elling of barley-grain aspartic proteinase: astructural rationale for observed hydrolyticspecificity. FEBS Lett, 352(2):131–136,September 1994.

[29] C P Ponting and R B Russell. Swaposins:circular permutations within genes encodingsaposin homologues. Trends Biochem Sci,20(5):179–180, May 1995.

[30] Heike Bruhn. A short guided tourthrough functional and structural featuresof saposin-like proteins. Biochem. J., 389(Pt2):249–257, July 2005.

[31] Jihun Lee and Michael Blaber. Experimen-tal support for the evolution of symmet-ric protein architecture from a simple pep-tide motif. PNAS, 108(1):126–130, January2011.

[32] Y Hatefi and M Yamaguchi. Nicotinamidenucleotide transhydrogenase: a model forutilization of substrate binding energy forproton translocation. FASEB J, 10(4):444–452, March 1996.

[33] M F Perutz, H Miurhead, J M Cox, L CGoaman, F S Mathews, E L McGandy,and L E Webb. Three-dimensional Fouriersynthesis of horse oxyhaemoglobin at 2.8A resolution: (1) x-ray analysis. Nature,219(5149):29–32, July 1968.

[34] M F Perutz, H Muirhead, J M Cox, andL C Goaman. Three-dimensional Fouriersynthesis of horse oxyhaemoglobin at 2.8A resolution: the atomic model. Nature,219(5150):131–139, July 1968.

[35] Jacque Monod, Jeffries Wyman, and Jean-Pierre Changeux. On the Nature of Al-losteric Transitions: A Plausible Model. JMol Biol, 12:88–118, May 1965.

[36] K Kinoshita, A Kidera, and N Go. Diversityof functions of proteins with internal sym-metry in spatial arrangement of secondary

24

structural elements. Protein Sci, 8(6):1210–1217, June 1999.

[37] Anne-Laure Abraham, Joël Pothier, andEduardo P C Rocha. Alternative to homo-oligomerisation: the creation of local sym-metry in proteins by internal amplification.J Mol Biol, 394(3):522–534, December 2009.

[38] I N Shindyalov and Philip E Bourne. Pro-tein structure alignment by incrementalcombinatorial extension (CE) of the optimalpath. Protein Eng, 11(9):739–747, August1998.

[39] R C G Holland, T A Down, M Pocock,Andreas Prlić, D Huen, K James, S Foisy,A Dräger, A Yates, M Heuer, and M JSchreiber. BioJava: an open-source frame-work for bioinformatics. Bioinformatics,24(18):2096–2097, September 2008.

[40] S Uliel, A Fliess, A Amir, and R Unger.A simple algorithm for detecting circularpermutations in proteins. Bioinformatics,15(11):930–936, November 1999.

[41] Changhoon Kim, Jodi Basner, andByungkook Lee. Detecting internallysymmetric protein structures. BMCBioinformatics, 11:303, 2010.

[42] Indronil Chaudhuri, Johannes Söding, andAndrei N Lupas. Evolution of the beta-propeller fold. Proteins: Structure, Func-tion, and Bioinformatics, 71(2):795–803,May 2008.

[43] Lucy R Forrest and Gary Rudnick. Therocking bundle: a mechanism for ion-coupled solute flux by symmetrical trans-porters. Physiology (Bethesda), 24:377–386,December 2009.

[44] Lucy R Forrest, Reinhard Krämer, andChristine Ziegler. The structural basisof secondary active transport mechanisms.Biochim Biophys Acta, 1807(2):167–188,February 2011.

[45] R F Doolittle. Similar amino acid sequences:chance or common ancestry? Science,214(4517):149–159, October 1981.

[46] L Jaroszewski, L Rychlewski, and A Godzik.Improving the quality of twilight-zone align-ments. Protein Sci, 9(8):1487–1496, August2000.

[47] J Soding. Protein homology detection byHMM-HMM comparison. Bioinformatics,21(7):951–960, March 2005.

[48] J Gough, K Karplus, R Hughey, andC Chothia. Assignment of homology togenome sequences using a library of hiddenMarkov models that represent all proteinsof known structure. J Mol Biol, 313(4):903–919, November 2001.

[49] Brian Kuhlman, Gautam Dantas, Gre-gory C Ireton, Gabriele Varani, Barry LStoddard, and David Baker. Design ofa novel globular protein fold with atomic-level accuracy. Science, 302(5649):1364–1368, November 2003.

[50] C Zhang and Charles DeLisi. Estimatingthe number of protein folds. J Mol Biol,284(5):1301–1305, December 1998.

[51] S Govindarajan, R Recabarren, and R AGoldstein. Estimating the total number ofprotein folds. Proteins, 35(4):408–414, June1999.

[52] Y I Wolf, Nick V Grishin, and E V Koonin.Estimating the number of protein folds andfamilies from complete genome data. J MolBiol, 299(4):897–905, June 2000.

25

4w 3

d1.

1)Re

fine

CE-S

ymm

alig

nmen

ts

6w1.

2)Re

turn

mul

tiple

CE-

Sym

m a

lgin

men

ts

3w 1

d1.

3)Ru

n Al

gorit

hms

on P

DB

3w 4

d1.

4)Bu

ild N

R pr

otod

omai

n se

t

18w

3d

1.5)

Inve

stig

ate

addi

tiona

l pro

todo

mai

n de

tect

ion

algo

rithm

s

36w

1d

1)A

im 1

10w

2.1)

Prot

odom

ain

all-

vs-a

ll on

OSG

6w2.

2)An

alyz

e pr

otod

omai

n si

mila

rity

netw

ork

9w 4

d2.

3)An

nota

te n

etw

ork

with

phy

loge

netic

his

tory

12w

3d

2.4)

Opt

imiz

e pr

otod

omai

n cl

uste

ring

38w

2d

2)A

im 2

8w3.

1)D

eter

min

e pr

otod

omai

n co

mpo

sitio

n of

BAs

14w

3d

3.2)

Iden

tify

inte

rest

ing

exam

ples

of B

A/pr

otod

omai

n co

evol

utio

n

8w3.

3)Ap

ply

evol

utio

nary

mod

el to

net

wor

k da

ta

30w

3d

3)A

im 3

21w

1d

4.1)

Revi

ew li

tera

ture

on

ion

chan

nel &

bet

a pr

opel

lor

evol

utio

n

10w

3d

4.2)

Com

puta

tiona

lly d

eter

min

e ev

olut

iona

ry h

isto

ry o

f tar

get f

amily

w

.r.t.

mod

el4w

4.3)

Valid

ate

mod

el b

ased

on

liter

atur

e

13w

4.4)

Appl

y ne

w k

now

ledg

e to

pro

tein

-spe

cific

pro

blem

48w

4d

4)A

im 4

13w

5.1)

Com

pile

The

sis

13w

5)Th

esis

Titl

eEff

ort

Aim

1

Aim

2

Aim

3

Aim

4

Thes

is

2012

2013

2014

2015

July

O

ctob

er

Janu

ary

Apr

il Ju

ly

Oct

ober

Ja

nuar

y A

pril

July

O

ctob

er

Janu

ary

Figure 13: Timeline

26

Following the Evolution of New Protein Folds via Protodomains [Report]

Health & Medicine

space protein fold space

nature of protein fold

majority of protein

discrete protein folds

discrete fold space

novel protein folds

tinuous fold space

developedprotein fold