Top Banner
A report from the American Academy of Microbiology r An Experimental Approach to Genome Annotation
19

An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

Aug 18, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

A report from the American Academy of Microbiology

r

An Experimental Approach to Genome Annotation

Page 2: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

Copyright © 2004American Academy of Microbiology1752 N Street, NWWashington, DC 20052http://www.asm.org

This report is based on a colloquium sponsored by theAmerican Academy of Microbiology held July 19-20,2004, in Washington, DC.

The American Academy of Microbiology is the honorificleadership group of the American Society for Microbiology.The mission of the American Academy of Microbiologyis to recognize scientific excellence and foster knowl-edge and understanding in the microbiological sciences.

The American Academy of Microbiology is grateful for thegenerous support of the National Science Foundation. TheAmerican Academy of Microbiology strives to includewomen and underrepresented scientists in all activities.

The opinions expressed in this report are those solely ofthe colloquium participants and may not necessarilyreflect the official positions of our sponsors or theAmerican Society for Microbiology.

Page 3: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

rBy Richard J. Roberts, Peter Karp,

Simon Kasif, Stuart Linn, and

Merry R. Buckley

An Experimental Approach to Genome Annotation

Page 4: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

Board of Governors, American Academy of Microbiology

Eugene W. Nester, Ph.D. (Chair)University of Washington

Kenneth I. Berns, M.D., Ph.D.University of Florida Genetics Institute

Arnold L. Demain, Ph.D.Drew University

E. Peter Greenberg, Ph.D.University of Iowa

J. Michael Miller, Ph.D.Centers for Disease Control and Prevention

Stephen A. Morse, Ph.D.Centers for Disease Control and Prevention

Harriet L. Robinson, Ph.D.Emory University

Abraham L. Sonenshein, Ph.D.Tufts University Medical School

George F. Sprague, Jr., Ph.D.Institute for Molecular Biology, University of Oregon

David A. Stahl, Ph.D.University of Washington

Judy D. Wall, Ph.D.University of Missouri

Colloquium Steering Committee

Richard J. Roberts, Ph.D. (Chair)New England Biolabs, Beverly, Massachusetts

Peter Karp, Ph.D.SRI International, Menlo Park, California

Simon Kasif, Ph.D.Boston University

Stuart Linn, Ph.D.University of California, Berkeley

Carol A. ColganDirector, American Academy of Microbiology

Page 5: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

Colloquium Participants

Cheryl Arrowsmith, Ph.D.University of Toronto, Ontario, Canada

Tadhg Begley, Ph.D.Cornell University

Robert Bender, Ph.D.University of Michigan

Barry R. Bochner, Ph.D.Biolog, Hayward, California

Eric Brown, Ph.D.McMaster University, Hamilton, Ontario, Canada

Frank Collart, Ph.D.Argonne National Laboratory

Valerie de Crecy-Lagard, Ph.D.The Scripps Research Institute

Andras Fiser, Ph.D.Albert Einstein College of MedicineL

Michael Y. Galperin, Ph.D.National Library of Medicine, National Institutes of Health

Jon Goguen, Ph.D.University of Massachusetts Medical School

Howard Goldfine, Ph.D.University of Pennsylvania School of Medicine

Eugene Kolker, Ph.D.Biatech, Bothell, Washington

Eugene Koonin, Ph.D.National Library of Medicine, National Institutes of Health

Frank W. Larimer, Ph.D.Oak Ridge National Laboratory

Thomas Leyh, Ph.D.Albert Einstein College of Medicine

Paul Ludden, Ph.D.University of California, Berkeley

Edward Marcotte, Ph.D.University of Texas at Austin

Kenneth Nealson, Ph.D.University of Southern California

Eugene Nester, Ph.D.University of Washington

Andrei Osterman, Ph.D.The Burnham Institute, La Jolla, California

Julian Parkhill, Ph.D.The Anger Centre, Hinxton, Cambridge, England

Dan Robertson, Ph.D.Diversa Corporation, San Diego, California

Margaret Romine, Ph.D.Battelle Pacific Northwest National Laboratory

Steven Salzberg, Ph.D.The Institute for Genomic Research, Rockville, Maryland

Jeff Skolnick, Ph.D.Buffalo Center of Excellence in Bioinformatics, Buffalo, New York

Gary Stormo, Ph.D.Washington University School of Medicine, St. Louis, Missouri

Alfonso Valencia, Ph.D.CNB-CSIC, Madrid, Spain

Eric Vimr, Ph.D.University of Illinois

Jeremy Zucker, Ph.D.Dana-Farber Cancer Institute, Boston, Massachusetts

Page 6: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

Observers

Elaine Akst, Ph.D.Fundamental Space Biology Division, National Aeronautics and Space Administration

Patrick P. Dennis, Ph.D.Division of Molecular and Cellular Biosciences, National Science Foundation

Daniel W. Drell, Ph.D.Health Effects and Life Sciences, U.S. Department of Energy

Irene Anne Eckstrand, Ph.D.National Institute of General Medical Sciences, National Institutes of Health

Brad W. Fenwick, Ph.D.U.S. Department of Agriculture, Kansas State University

Maria Y. Giovanni, Ph.D.National Institute of Allergy and Infectious Diseases,National Institutes of Health

Maryanna P. Henkart, Ph.D.Division of Molecular and Cellular Biosciences, National Science Foundation

John Houghton, Ph.D.Biological and Environmental Research, U.S. Department of Energy

Eric Jakobsson, Ph.D.National Institute of General Medical Sciences, National Institutes of Health

Matthew D. Kane, Ph.D.Division of Environmental Biology, National Science Foundation

Rachel E. LevinsonOffice of Science and Technology Policy

Ann Lichens-Park, Ph.D.U.S. Department of Agriculture

Anna Palmisano, Ph.D.U.S. Department of Agriculture

Joanne Tornow, Ph.D.Division of Molecular and Cellular Biosciences, ºNational Science Foundation

Page 7: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

Executive Summary

The American Academy for Microbiology convened acolloquium July 19-20, 2004, in Washington, DC, toaddress the critical challenge of prokaryotic genomeannotation and to seek ways to accelerate progress inthe field. Recent advances in DNA sequencing haveproduced a spectacular amount of new data; literallyhundreds of thousands of sequenced prokaryotic genesnow await annotation. These genes can be enumerated,compared, and grouped by sequence similarity into fam-ilies, yet an understanding of their biochemicalfunctions is lacking. Genomics provides that rare oppor-tunity in science where the boundaries of currentknowledge can be clearly defined. The annotation initia-tive proposed in this document will extend thoseboundaries and will likely lead to new applications andnew progress in healthcare, biodefense, energy, theenvironment, and agriculture. This research could alsoimpact many commercial enterprises, such as thechemical, food and dairy industries.

Colloquium participants included microbiologists, bio-chemists, and bioinformaticians. Observers from theNational Institutes of Health, the National Science Foun-dation, the U.S. Department of Energy, the NationalAeronautics and Space Administration, the Office of Sci-ence and Technology Policy, and the U.S. Department ofAgriculture were also in attendance. Participants dis-cussed the currently available sources of genomeannotation information and the strengths and limitationsof those sources. Four areas of concern in genomicannotation were identified:

(1) As many as 40% of all predicted genes in completedprokaryotic genomes have no functional annotation.

(2) Many genes have a predicted function, but that pre-diction has not been experimentally validated.

(3) As many as 5-10% of predicted gene functions maybe incorrect.

(4) Many known enzymes have no correspondinggenes identified in the sequence databases.

Much of the currently available annotation informa-tion is provided by computer programs that predict thefunctions of newly sequenced genes on the basis oftheir similarity to genes of known (or predicted) func-tion. This technique is inherently limited in bothbreadth and accuracy by the small size of the corefoundational set of genes with experimentally estab-lished functions. By expanding that foundational setthrough a systematic program of biochemical study ofgenes of unknown function, we can dramatically

increase the quality of prokaryotic genome annota-tions, and enhance our understanding of current andfuture genome sequences.

The experimental elucidation of function for a hypo-thetical gene can be a significant challenge for thebiochemist. However, in the past five years new bioin-formatics techniques, mostly based on comparativegenomics, have been developed that can provide cluesabout the function of a gene. Functional genomicsmethods, such as gene expression chips, can also pro-vide hints about gene function. Such clues can greatlyaccelerate experimental studies by suggesting plausiblehypotheses to be tested in the laboratory.

Colloquium participants agreed that accurate andcomplete annotation is vital to making full use ofgenomic data. However, there are great deficiencies incurrently available annotation sources. Moreover, thereare few sources of dedicated funding for experimentalapproaches to annotation. In light of these facts, it wasrecommended that a new initiative be undertaken thatwould synergistically combine computational method-ologies for functional prediction with a systematicexperimental approach to test those predictions. Itwould also broaden the foundational set of experimen-tally determined gene functions by finding missinggenes for known enzymatic functions. Such a programwould both increase experimental knowledge and spurfurther accuracy in bioinformatics prediction leading torepeated cycles of validation and prediction.

As part of the proposed initiative, a new resourcefocused on annotation should be developed. The cen-tral component is a database containing:

r Predictions regarding the functions of genes ofunknown function, deposited by bioinformaticians,based on computationally inferred clues, which willserve as a starting point for experimental investiga-tions.

r The results, positive or negative, of those experi-mental investigations, which in many cases willestablish new gene annotations backed by rigorousexperimental work.

r A prioritized list of sequenced genes for which nofunctional information is currently available.

r A list of biochemically-characterized functions forwhich no gene has yet been assigned (referred toas orphan functions).

r Data on previously characterized proteins currentlyin the public databases.

~ 1 ~

Page 8: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

The basic design of the database was discussed, andrecommendations for hosting, administration, and man-agement of the database were put forth.

Achieving an accurate and detailed annotation of anewly sequenced genome is a critical, but often difficult,step in the process of analyzing the sequence data. Thisis especially difficult for organisms where genetic toolshave yet to be developed. Unfortunately, the pace ofexperimental elucidation of gene function is very slowcompared to the pace of sequencing and computationalprediction of function. Thus, rather than attempt toexperimentally explore the functions of every unknowngene in every sequenced genome, it is preferable tofocus experimental investigations on the most informa-tive targets. For instance, the scope and accuracy ofexisting bioinformatics techniques would be greatlyenhanced by obtaining one or a few experimental func-tions for members of gene families that are found inmany organisms. This experimental annotation initiativewill encourage and enable experimental biochemists toparticipate in the annotation of prokaryotic genomes.

The initial focus of this particular initiative would beprokaryotes – bacteria and archaea – because (a) theypossess relatively small genomes comprising genes thatare usually easily defined, (b)a great deal of prokaryoticgenomic sequence data is available in the publicdomain, and (c) because they are experimentallytractable. Schemes need to be developed to determinewhich among the prokaryotic gene products and orphanproteins should receive attention first. In one possibleplan, priority would be given to families of similar genesthat are found in many different genomes, becausedetermining a biochemical function for one member willlikely implicate all family members as possessing thesame or a similar function.

The details of the bioinformatics part of the initiative,such as database design and operation, should beopen to the discretion of those researchers who applyfor funding to construct it, but certain broad recom-mendations for the content and administrative aspectsof the resource were formulated by the colloquium par-ticipants. For example, the database should include notonly protein gene products but also functional RNAproducts. It was stressed that the input of both bioin-formaticians and experimentalists would be vital to thesuccess of the initiative, and their collaboration shouldbe encouraged. The creation of an external databaseadvisory board was also recommended. Funding wouldbe required to support the bioinformaticians who willmake and evaluate the bioinformatics predictions andgenerate and maintain the computational resource.However, the largest requirement for funding would beto support the experimental biochemical work testingbioinformatics predictions. It was proposed that one or

more pilot projects be undertaken to assess the feasi-bility of the approach before embarking on a largescale initiative.

The potential impact of the proposed initiative is diffi-cult to overstate since it would affect all aspects ofbiology. The participants feel that this project is essen-tial to enable the next step in moving genomic scienceforward from accumulating a large depository ofsequences towards achieving a true understanding ofthe basic elements of prokaryotic biology. Without a for-ward-looking initiative like the one proposed here, thefunctional data needed to propel systems biology for-ward will not be available, and those trying tounderstand the complex interactions of genes and theirproducts in living cells will continue to work with manycomponents of unknown function. In addition, elucidat-ing the enzymatic functions essential for prokaryotic lifewill impact our understanding of eukaryotic organisms,which possess many of these same genes. This initia-tive will also foster closer collaborations betweenexperimental and computational scientists and help toreinstate the importance of biochemical research.Finally, although much of the project will focus on tradi-tional biochemistry, the initiative can be expected tostimulate new advances in functional screening, newfunctional genomic technologies such as phenotypearrays, and significant industrial and commercial oppor-tunities in the form of new targets for both medical andindustrial applications of prokaryotic biology.

Introduction

Since the emergence of large-scale sequencing tech-nologies in the 1990s, the complete genomes ofhundreds of organisms have been sequenced andarchived. The success of these sequencing programshas been breathtaking. Today, the genomes of organ-isms as diverse as bacteria and humans and asextraordinary as puffer fish and loblolly pine have beencatalogued. Sequences abound, and we are in themidst of a genomics revolution.

However, the more difficult work of interpretinggenomic sequences has hardly begun. Roughly 40% ofpredicted genes have not been assigned even tentativefunctions. It is rare in science to be able to clearly delin-eate the boundaries of current knowledge, but that isexactly where genomics stands today. The genomesequences available at this time are a great resource, andto fully realize the potential they represent they must beannotated accurately through a combined approach ofbioinformatics and experimentation. Since many genesare found in some form in more than one species,

~ 2 ~

Page 9: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

~ 3 ~

assigning function to any individual gene can impact ourunderstanding of many different organisms. Hence, afocused effort in functional annotation of individual genescan have an extensive impact on our understanding ofmany species and systems. Within these sequences lienovel drug targets, new enzymes for biotechnology, andlikely an abundance of novel regulatory elements.

Simple annotation is only a first, but essential, step inunderstanding the complexity of the whole organism.Functional annotation should be more than a catalogu-ing of protein functions. It should include informationabout the interactions of the gene products – theseinteractions result in a hierarchy of function about whichwe know very little. The consequence of these interac-tions is a system that is far greater than the sum of itsparts and results in a self-replicating organism. Onegoal in determining the functions of the genes, there-fore, is to provide a basis from which the workings ofthe whole organism can be explored and ultimatelyunderstood. This is the realm of systems biology.

Systems biologists seek to achieve an understandingof the whole organism by studying the products of thegenome, how they interrelate, and how they cometogether in synergistic networks that accomplish thecomplex functions of life. Because systems biologydeals directly with the products of the genome, the fieldrelies heavily on careful functional annotation to describethe individual roles of these products. Unfortunately,efforts in systems biology are presently hampered bythe slow pace of gene annotation. It is therefore essen-tial that we uncover the full biochemical potential ofprokaryotic genes if we are ever to understand the inter-nal machinery of the simplest forms of life.

The slow pace of functional annotation

Despite the necessity of experimental, functionalannotation, progress in this area has been sluggish,resulting in calls for action from members of thegenomics community (Roberts, RJ, PLoS Biology 2: E42,2004; Karp, PD, Genome Biol. 5: 401, 2004). The slowpace of annotation has led to the current situation, inwhich a large number of putative gene annotations arebased on a much smaller foundational set of functionallycharacterized gene products. This inverse pyramid is nota satisfactory basis from which to form hypotheses. Inthis situation, with limited breadth of knowledge aboutthe diversity of gene functions, functions can be easilymisconstrued, and too much confidence might beplaced on sequence correlations between distantly-related genes. To make full use of the resource ofgenomic sequences, therefore, experimental annotationof gene products must be given high priority.

One reason for the current gap between sequencedetermination and the functional characterization ofgenes is a dearth of funding opportunities for annota-tion. Up to now, funding has been scattered and fairlylimited, posing a significant barrier to progress in anno-tation. The current funding paradigm in the UnitedStates does not allow for short, intensive investigationsof the functions of individual genes or gene families;funding sources are more focused on high throughputresearch strategies or directed towards particular bio-chemical problems of interest to an investigator.

Current sources of functional annotation andtheir limitations

A number of different sources of functional annota-tion, some well-known, others more obscure or underdevelopment, are presently available to researchers, butthey have a number of limitations and do not meet all ofthe requirements of current research. The scientific liter-ature is the best source of annotation, but the relevantinformation is often scattered among many papers andmay not be widely known. Moreover, a great deal of rele-vant information was published prior to availability ofgenome sequences, but this information was not consis-tently incorporated into current lists of annotations.

The advent of computational tools has also causedsome problems in acquiring and compiling annotationdata. As computational methods have come to dominategenome annotation, early annotation errors have beenpropagated. Also, functions discovered since the firstannotation was made have been missed in some casesand have not been incorporated into the databases. Toaddress these problems, the field requires a single, cen-tral source of annotation information that is regularlyupdated and is distinct from the current archival data-bases. This resource could be used as a point ofreference for future annotators and other biologists.

The annotation sources available today frequentlyapply thresholds of evidence that must be exceededbefore a given annotation can be included in a data-base. This approach has the advantage that it avoidsmisleading users by omitting wildly speculative func-tions. However, clues that might be indicative of agene’s function, but do not rise to the standardsrequired by the genome annotators, are unavailable toresearchers. The reliability of annotations made avail-able through these resources is also a concern, asdatabases rarely provide quantitative estimates of thetrustworthiness of the evidence used in annotation.Moreover, the experimental methods employed varyamong research groups. As a result, incorrect annota-

Page 10: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

tions made using faulty techniques can go undetectedand may be propagated from one genome to another.

Updating annotation information can also be difficult.New information and gene product characterizationsthat appear in the literature often fail to be transmitted toelectronic annotation sources because updating proce-dures have not been put in place. Conversely, makingannotation errors known to the scientific communityalso poses a problem and is often neglected. (Modelorganism databases such as EcoCyc, Flybase, and theSaccharomyces Genome Databases are exceptions inthat these databases perform intensive literature-basedcuration efforts that extract new experimentally deter-mined functions from the literature and update theirrespective databases. However, these efforts are limitedto less than 10% of current genomes and the annota-

tions are not automatically passed to related genes inother organisms. Furthermore, these databases do notcurrently store predictions or clues about function fromeither computational or experimental methodologies.)

Finally, the names that these annotation sourcesemploy for genes and gene products elicit a great dealof confusion among researchers. Currently very fewabstracts that describe protein functions use standardfunctional descriptions, or systematic gene identifiers.

Recent breakthroughs in bioinformatics

The majority of bioinformatics function predictionalgorithms infer the function of a gene based on similar-

~ 4 ~

One group has initiated systematic work to find the genes associated with known enzymatic func-

tions. SRI International has begun a project called the Enzyme Genomics Initiative, the goal of which

is to find at least one genetic sequence for each known enzymatic function. SRI’s estimates, taken

from cross-referencing the EC numbers available in the ENZYME database with information on

enzyme genes from a number of databases, including Swiss-Prot and TrEMBL, conclude that there are

at least 1,400 enzymes for which the genetic sequence is not known. SRI refers to these enzymes as

“orphans”, a term suggestive of their indefinite genetic heritage.

Researchers working on the SRI initiative have devised a ranking system for predicting the ease with

which the gene for a given orphan could be revealed. Highest marks are given to those enzymes that

come from an organism whose genome has been fully sequenced or for which at least part of the pro-

tein sequence is known. In these cases, identifying the gene simply requires computationally

matching the biochemical properties of the enzyme to the open reading frames in the genome

sequence and then testing the prediction experimentally. A second priority is given to those enzymes

that have been experimentally characterized recently. Lesser marks are granted to enzymes with

available information regarding certain physical characteristics, like molecular weight, isoelectric

point or the results of protease digestion.

See http://bioinformatics.ai.sri.com/enzyme-genomics/ for more information on SRI’s Enzyme

Genomics Initiative.

Taken from a lecture presented by Peter Karp, SRI International

r The Enzyme Genomics Initiative r

Page 11: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

ity of its sequence to that of previously characterizedgenes. But what if none of the homologs of a gene haveassigned functions? What if a gene has no homologs inthe public sequence databases? These situations pres-ent serious roadblocks for applying sequenced-basedmethods for prediction of gene function.

A lack of hints about the possible function of a genealso poses a problem in applying experimentalapproaches to functional characterization. Hints aboutgene function are helpful because they can indicatewhich biochemical assays should be applied to confirmor refute the hypothetical function. Knock-out mutantsfor the gene of interest could provide clues to function,unless a knock-out has no observable phenotype. Geneexpression studies can also provide hints in cases wherea gene is co-expressed or co-regulated with a relativelysmall number of other genes with common functions.

Computational approaches can also provide indica-tions of gene function. Recent breakthroughs inbioinformatics have produced three classes of compu-tational techniques that can increase the chance thatexperimentalists will have a starting point for theirresearch. The first class of methods is based on com-parative genomics techniques that infer functionalassociations between genes from gene patternsobserved across many genomes. For example, imaginethat a gene of interest, A, is adjacent in an organism ofinterest to gene B, whose function is known. By itselfthis observation is of little consequence since theneighbors of a gene frequently have completely unre-lated functions. But imagine further that we observefive or more diverse organisms in which the homolog ofgene A in each organism is adjacent to the homolog ofgene B in that organism. Observation of conservedchromosomal proximity across many genomes sup-ports an inference that genes A and B have relatedfunctions, such as being in the same pathway. Similartechniques based on phylogenetic co-occurrence, andthe existence of rare gene fusions, are also used.

The second class of methods is based on applyingmachine learning techniques to very large datasets of infor-mation under the assumption that proteins with certainfunctions tend to have certain physico-chemical properties.This type of method has been used to infer whether or nota protein is an enzyme and to infer which of the six top-level Enzyme Commission classes an enzyme belongs to,based on information such as the amino-acid composition,molecular weight, and pI of the protein.

The third class of methods is based on integrativetechnologies that bring together different sources ofevidence, such as protein-protein interaction screens,functional genomics screens, pathway information,RNAi screens, and computational predictions, to

piece together a global picture of the proteome of agiven organism.

An annotation initiative

In light of the pivotal importance of functional annota-tion to the progress of biological science, and because ofthe lack of clearly targeted support for experimentalannotation and limitations of the current annotation initia-tives, it is recommended that a centralized annotationinitiative be established in the United States. (A relatedinitiative, the Biosapiens project, is already underway inEurope.) The new initiative proposed here would (a)require bioinformaticians, in collaboration with experi-mentalists and the database managers, to establish aprioritized list of genes for experimental study and (b)engage experimentalists to test those predictions in thelaboratory. In addition, enzymes for which good biochem-ical assays exist, but for which no genes have beenreported should also be targeted for study under the ini-tiative [see Box 1]. The results would lead to a databaseof reliable, experimentally-derived annotations that wouldbe used by the entire biological community. The estab-lishment of a central database to support experimentalannotation would alert biologists to the need for carefulinterpretation of genomic sequences, and would provideimmediate practical assistance to genome annotators. Byproviding funds for small, distinct annotation projects,the initiative could support and highlight an importantniche within the biological community.

The experimental assignment of function for a givengene product is often a discrete, achievable project thatcan be accomplished within a reasonable amount oftime by new Ph.D. students, rotation students, or evengood Honors undergraduate students. The key is for thestudent to work in an established laboratory wherereagents, substrates for biochemical assays, and techni-cal know-how are available. A focused effort to assignfunctions to unknown genes would provide many excel-lent opportunities for training of graduate-level students.

Under the initiative, a functional prediction andvalidation database would tie together catalogs ofuncharacterized gene products and orphan proteinswith functional information that has been acquiredthrough experimental means. The database wouldallow bioinformaticians and experimentalists to postnew predictions and experimental results along withdetailed information about the methods and evi-dence used to derive them. The database wouldrequire the contributions of a reliable staff of profes-sionals who are able to work with the community ofannotators and maintain the quality required by sucha valuable resource.

~ 5 ~

Page 12: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

The initiative would also enable further improvementsto and validation of bioinformatics algorithms. By facili-tating comparisons of functional predictions generatedusing bioinformatics with the results of experimentsdesigned to test those predictions, the predictive powerof the computational tools available through the data-base could be continually evaluated and improved.Thus, as the database expands so will the power of thepredictive methods.

An annotation initiative of this type would benefit allof the fields that rely upon genomics and informatics byproviding detailed knowledge about gene products andgenerating hypotheses about their biological role.Biodefense, biotechnology, synthetic biological chem-istry and drug development would all be helped and thefunctional data generated would be of key significancefor systems biology and many of the post-genomic ini-tiatives that study multi-gene systems.

Outlining the annotation problem

An annotation initiative to support and compile accu-rate functional annotations should be designed to focuson the most worthy and central annotation problems athand. With countless genes of unknown function andmany scores of functions with unknown genes, whereshould the work of functional annotation begin? Also,how should erroneous functional assignments in publiclyavailable databases be brought to light and corrected?

The initial thrust of the initiative should be the annota-tion of prokaryotic genomes. Bacteria and archaeapossess relatively small genomes that could be moreeasily interpreted in a reasonable amount of time thanthose of eukaryotes. Also, scores of such genomes havebeen sequenced and serve as a great resource for experi-mentalists. Finally, prokaryotic organisms are currentlythe main platform for synthetic biology, molecular engi-neering, and systems biology. Hence, there is an urgentneed to compile a carefully curated parts-list for futureresearch and biomedical and commercial applications.

Setting priorities for experimentation

A prioritization scheme is needed to determinewhich genes among the available prokaryoticgenomes should first be targeted for experimentation.A number of distinct schemes can be put forth, butthe most effective and realistic approach would com-

bine many considerations. Ideally, the cost and lengthof the experimental study would be balanced againstthe probability of success and the overall utility of thestudied gene product.

One obvious criterion could be the size and promiscu-ity of a family of proteins. For example, if experimentallydetermining the function of one member of a cluster oforthologous groups would provide an understandingabout many of the other members, then that proteinshould be a priority for investigation.

Targets for experimental work could also be scoredaccording to the reliability of the available functionalevidence. In this scheme, those genes for which thedata are missing or extremely vague would take high-est priority, those genes with slightly more reliabledata would come next, etc. Another priority is thosegene products for which the information is likely to bereliable, but imprecise. These would include broad func-tional predictions, such as “a glycosyl transferase.” Itwould require experimental work to determine the spe-cific substrates and pathways that are relevant to sucha protein.

The gene products of well-studied model organismslike Escherichia coli could be targeted for experimenta-tion in an effort to achieve a complete understanding ofone organism that could then be applied in other con-texts. Another approach might be to focus on genesthat participate in common functions, such as sporula-tion or DNA repair, or are members of a commonmetabolic pathway. A very successful example of sucha study that focused on tRNA modification is outlined inBox 2. Along these same lines, genes for which a greatdeal is known (aside from function) could also be appro-priate targets for experimentation.

Biological significance or impact on society can alsobe guides to establishing priorities. It may be desirableto target those genes that are associated with humanpathogenesis, like the genes involved in host interac-tions, antibiotic resistance, infectivity, and virulence, inan effort to make an impact on public health and medi-cine. It may also be useful to target gene products thatare of economic significance, such as enzymes forindustrially or environmentally important processes.

Whether a formal prioritization scheme should beused to guide funding decisions for the initiativeremains to be decided. On one hand, a guide to priori-ties could serve to focus the efforts of a fundingprogram on the most promising targets. On the otherhand, it may be advisable to leave the decisions in thehands of experimentalists, who would have to find

~ 6 ~

Page 13: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

appropriate assays and who would be required to makea convincing case for their work to a review panel or agoal-oriented funding initiative.

Finding genes for enzymes with known functions

Prioritization is also needed in cases where a cellularfunction has been assigned to a given enzyme, but thegenetic sequence of the protein is not known (see Box

1). It is advisable to place the highest priority on thoseenzymes which have already been isolated and purified.N-terminal sequencing of these proteins should enableeasy identification for an organism with a completelysequenced genome. Outside of these instances, priorityshould be placed on identifying the genes associatedwith important gaps in metabolic pathways.

Another criterion for prioritizing enzymes could be thedegree to which the function is distributed phylogeneti-cally. Finally, emphasis could be placed on enzymes thatact on important metabolites or upon drugs.

~ 7 ~

To provide accurate and complete annotations of the available genomes, it is necessary to identify a

gene or gene set for every cellular function. In one approach to this task, Valerie de Crecy-Lagard is

tracking down the genes t hat are associated with tRNA modification using bioinformatics and com-

mon sense biology.

De Crecy-Lagard tracks down functions that haven’t been attributed to a gene through a combination

of pathway reconstruction, subsystem analysis, and consultation with experts in different areas of

biochemistry. Once a missing function has been identified, she then applies several bioinformatics

tools (available through public databases, including COG, String, PhydBac, KEGG, MetaCyc, and SGD

Model) to track down genes that may produce that function. The phylogenetic occurrence of a given

function can be a powerful device for finding the appropriate gene; if the function does not occur in

organisms X, Y, and Z, for example, then candidate genes for that function might be absent in those

species as well. Several databases enable phylogenetic queries that facilitate finding missing genes

in this way.

Gene clustering is another important indicator. If other genes in the pathway have been found to

cluster at a certain point on the genome, the missing gene might be found within or next to the clus-

ter. Co-expression patterns can provide leads as can the detection of fusion proteins, shared

regulatory sites, and protein-protein interactions. Mining databases for these types of biological

clues enables the researcher to compile enough information to formulate hypotheses that can be

tested in the laboratory.

Taken from a lecture presented by Valerie de Crecy-Lagard, University of Florida. Other lectures

exemplifying how informatics can guide experimentation were presented by Eugene Kolker (Biatech,

Bothell, Washington) and Andrei Osterman (The Burnham Institute, La Jolla, California)

r Finding Missing Genes by Comparative Genomics r

Page 14: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

It is important to note that genes identified by thisphylogenetic approach will have homologs in thesequenced genomes and consequently gene-functionrelationships established by this method will comple-ment the genome-based approach.

Incorrect functional assignments

The problem of incorrect functional information in pub-licly available databases poses a serious problem.Charles Darwin wrote, “False facts are highly injurious tothe progress of science, for they often endure; long, butfalse views, if supported by some evidence, do littleharm, for every one takes a salutary pleasure in provingtheir falseness.” Incorrect annotations provided to thepublic are often seen as “false facts,” results that canmislead researchers and lead to wasted time andresources. However, it is important that researchers real-ize that annotations are interpretation, and subject toerror, and should therefore be seen as “false views,” andopen to challenge and/or verification. Incorrect functionalassignments should be corrected and disagreementsbetween predictions reconciled, but this is difficult giventhe size and archival nature of current databases. Creat-ing a new, independent database, and initially populatingit with only functionally characterized genes without tran-sitive annotation will be of immediate benefit toresearchers and genome annotators, providing a solidfoundation for subsequent annotation. This database willform the core, to which will be added the functionalassignments that will be the outcome of this initiative.

To avoid and correct the dissemination of inaccuratedata, the annotation database should be designed to allowdynamic access and should contain means for reassessingprevious assignments and incorporating new data and evi-dence as they become available. On the experimental side,more value should be placed on the validation or the dis-proof of function. Disproving a functional assignmentshould be publishable in the relevant literature if the cor-rect assignment has been made or in the central databaseif no definitive assignment is provided.

A publicly available database of functional assign-ments that contains the results of rigorous experimentalwork and includes bioinformatic predictions with levelsof certainty indicated would be extremely useful. It isessential that experimental assignments be clearly dis-tinguished from computational predictions. Furthermore,functions inferred from high-throughput experimentationshould be distinguished from more reliable low-through-put biochemical studies. This information would solvemany of the problems derived from the use of incorrectassignments and would be invaluable to the genomecenters seeking to annotate new genomes.

Database requirements

Many different kinds of data would be acquired, inte-grated, and maintained by this annotation initiative.These include (a) a current dataset of experimental andpredicted annotations found in the sequence data-bases, (b) a new set of predicted annotations, whichwould drive the experimental aspect of the initiative,and (c) the annotations that emerge from this experi-mental effort. These annotations would need to beincorporated into the definitive set. These data could allbe handled within a single comprehensive databasefrom which any given set of data might be retrieved.Alternatively, two or more integrated databases mightbe considered. The design specifics of data handlingshould be left open to those researchers who will con-struct it. However, some important features to addressin designing data handling include:

r The scope of the database (s),

r How genes are defined in the context of the database,

r The specific information required for an entry (whichmay include information from informatics studiesand/or experimental data) and the organization ofthat information, and

r How the information will be curated and maintained.

Although most of the particulars of the data handlingremain to be determined, certain questions of contentand standards were discussed at the colloquium. Theseissues include the criteria for a designation of a “definitiveannotation,” ways to tackle verification problems whenmistakes arise, the inclusion of functional RNA in the data-base, the hosting, administration and management of thedatabase, and the predicted costs of the project.

Setting criteria for definitive annotations

Not all annotation evidence is equivalent, and theconcept of a “correct” annotation is subjective to someextent. In assembling a set of reliable gene annotations,it will be necessary to set criteria that define the typeand degree of evidence required for a “definitive” anno-tation and to decide what kinds of less definitive datashould be included. One way to set the bar is to askwhether the annotation explains the known biologicalproperties of the gene. Under some conditions, a gen-eral assessment of the biochemical activity of the genemay be sufficient for moving forward. While it would beideal to meet the high standards of publications like theEnzyme Handbook, this is not always possible, even incases where the gene product has been thoroughly

~ 8 ~

Page 15: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

characterized. Any incremental information about agene’s product, even a negative result, is probablyworth including in the data set.

High throughput methods available today are oftensuggestive of function, but on their own they do notprovide validation. In cases where a gene product hasbeen sufficiently characterized to merit inclusion in thedatabase, clues about function from high throughputmethods should be added.

It should be noted that among computational biolo-gists seemingly congruent functional predictions can bemade separately by more than one group. However, ifthe tools these groups used to arrive at the predictionsare related to one another or if the methods depend onthe same underlying calculations, then the predictionswould be correlated and should not be represented asseparate and independent.

The verification of functional predictions

A central feature of the proposed initiative is toengage experimentalists in testing functional predic-tions made by bioinformaticians. This is not to suggestthat computational prediction and experimental valida-tion are independent activities. On the contrary, theyform a closed loop initiative where computationalpredictions are validated or refined through experimen-tation and fed back into the computational pipeline togenerate new cycles of hypothesis generation and vali-dation. To facilitate this type of collaboration, the dataset of computational predictions needs to be presentedto the experimentalist in a user-friendly manner thatallows the experimentalist to assess potentially contra-dictory predictions and use his or her biochemicaljudgment to select targets for study.

In verifying the functions of known genes or thegenes of known functions, the preferred methods willbe strongly dependent upon the problem at hand.When strong predictions are available, then cloning ofthe gene and direct biochemical assay in the hands ofan expert could quickly lead to an answer. In the case ofgenes of unknown functions, a number of methods maycome into play in an attempt to refine predictions. Highthroughput assays of the activity of expressed unknownproteins against large arrays of single potential sub-strates could prove helpful. However, these highthroughput methods may not reveal sufficient detailedfunctional information to produce a definitive annotationon their own. Similarly, microarray assays to identify the

gene products that are expressed in concert with theunknown gene or under stress conditions that alterexpression may provide clues for more detailed bio-chemical exploration, but would not provide definitiveevidence of function.

In the case of known enzymatic functions for whichno corresponding gene has been described, proteinpurification and sequencing are likely to be the best,most straightforward approach to identifying the gene.Once a potential relationship has been established, thena simple biochemical assay could prove the assignment.If a candidate gene were present in strains such as E.coli or Bacillus subtilis, for which complete knockoutlibraries exist, then the suspected function could beassayed directly in both the appropriate knockout strainand its wild type counterpart. However, cloning the sus-pected gene and assaying its product directly would stillbe advisable.

It may not always be possible to verify conclusivelythe substrate of a particular enzyme, since manyenzymes show specificities that are either broad orseemingly imprecise. Moreover, many unknown geneproducts may have complex enzymatic functions or theymay lack an enzymatic function altogether, complicatingefforts to uncover their functions by direct experimenta-tion. Consequently, the proposed database shouldinclude functional clues as well as the results of experi-mental characterizations.

Many proteins do not exhibit an enzymatic activity,but rather play a role that depends on protein-proteininteractions. The chaperonins, which help other pro-teins fold, adaptor proteins, and some of the integralmembrane proteins that play a structural role, are goodexamples of proteins that are not easily tested by directbiochemical assays. In these cases, more indirectassays will be needed, and genetic tests may be help-ful. Hence, the annotation initiative must be open to awide variety of data from biochemical and geneticexperiments that will help to define gene function.

Functional RNAs

Since the initiative should include all genes, it mustinclude genes that encode functioning RNAs, such astRNAs, inhibitory RNAs, specialized RNAs such as tmRNA,and perhaps others that remain to be discovered. Becausesome of these RNAs lack broad conservation acrossspecies, there are unique difficulties in prediction for RNAencoding genes which would need to be addressed.

~ 9 ~

Page 16: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

Management of the Database

Hosting and administration

The question of where the annotation databaseshould be housed is of significant practical concern.The administrators of smaller, individual prediction data-bases may be reluctant to share predictions with acentralized database like that proposed in this docu-ment if they perceive that the central database is biasedin some way. Thus, there may be merit in seeing thedatabase hosted at the National Center for Biotechnol-ogy Information (NCBI), which has already expressedinterest in contributing to this initiative.

Design and management

It is critical to involve not only bioinformaticians in thedesign and curation of the proposed database, but alsothe experimentalists who will use the database to settheir research agenda. Proposals to design and realizethe database should require the collaboration of bothtypes of professionals. It would also be helpful if, aspart of this initiative, bioinformatics tools were mademore accessible to experimentalists to enable theseprofessionals to make functional predictions andaddress those predictions in their own laboratories.

The quality of the user interface of the database will beimportant to both bioinformaticians and experimentalists;the database must be intuitive and easy to navigate. More-over, the process by which predictions, evidence, andmethods may be submitted must be simple for prospec-tive contributors to use and must permit bulk submissionsof thousands of predictions programmatically, as well asbulk removal and replacement of old predictions withimproved predictions made by the same investigators.

It will be critical for experimental researchers todeposit the results of their efforts into the database asquickly as possible. Promptness is a difficult objective toenforce, however, and there may be problems in convinc-ing researchers to treat their data as community property.We recommend that the initiative adopt a strict require-ment that all experimental results must be deposited intothe database at the time of publication, and that theremust be database accession numbers associated witheach publication. These publications must be submittedby researchers as a criterion for continued funding.

The establishment of an external advisory board forthe database project is recommended. Curation of thedatabase should be a community effort in which adviceis solicited from experts in specific subjects. Detailed

descriptions of the organization and operation of thedatabase are not necessary at this stage; proposalsshould be encouraged to include innovative solutionsto meet the general goals of the initiative. It is impor-tant that the database be a public resource thatprovides facilities for both internet access and fulldownloads (to enable analysis and data mining) freelyto the community.

Coordination with Model Organism Databases

The database project must have tight interactionswith the model organism databases that have beenestablished for many important experimental organ-isms. The EcoCyc project serves as a model of this typeof revision; the project constantly combs the E. coliexperimental literature for newly identified gene func-tions and incorporates those results into EcoCyc.(Annotation data should also be transferred from EcoCycto the annotation database of the initiative. Likewise,experimentally determined gene functions should betransferred from the annotation database to EcoCyc.)

Peer review of data submitted to the database

Peer review of data and other supporting materialssubmitted to the database is critical to ensuring theintegrity of this resource. However, lengthy peer reviewprocesses can stymie progress by increasing theamount of time between the discovery and the wideavailability of the findings. It is recommended that sub-mitted data be made immediately available through thedatabase and that these data be accompanied by a clearwarning statement that they have not yet been peerreviewed. The eventual outcome of the peer reviewprocess could then be posted in the database as itbecomes available.

Communication and collaboration amongannotation researchers

Communication between the two types of researchersinvolved in gene annotation, namely bioinformaticianswho make function predictions and the experimental-ists who take those hypotheses into the lab and testthem, is not always effective. The annotation initiativedescribed in this document could serve to bring thesegroups closer together in the interest of producingauthoritative gene annotations in a timely, cost-efficient

~ 10 ~

Page 17: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

manner – particularly if joint funding opportunities exist.It could also serve to motivate the biochemists by giv-ing them a sense of community with the genomicsinitiatives currently underway.

Linking predictions available in the database with thenames and contact information of the bioinformaticianswho make the predictions would bring the bioinformati-cians and experimentalists closer together, would assistexperimentalists in acquiring the information they need tocarry out their work and would help to provide scientificcredibility to the bioinformaticians. Courses and tutorialson bioinformatics for experimental scientists could beoffered under the auspices of the initiative to help theseresearchers better understand prediction methods andhow to interpret and use them effectively. Finally, the sub-mission of collaborative proposals should be encouragedin order to bring the different types of expertise of theseprofessionals together on the same project.

For the sake of experimentalists, and to make the bestuse of function predictions, it is important to provide asmuch information as possible on the sources of the func-tion predictions presented in the database. Thisinformation should include, but not be limited to thesources of the predictions, the degree of confidence orquality values of the prediction, links to the raw data usedto make predictions, the results of expression analyses,and any phylogenetic data that have been generated.

The expertise of bioinformaticians could also bemade available to experimentalists through a “helpdesk” system sponsored by the database initiative.

Predicted Costs and Funding of the Project

It is important that funding be available for improve-ment of bioinformatics function-prediction methods,whether through this initiative, or through a separate ini-tiative. Although significant progress has been made inthe newly developed bioinformatics methods, this fieldis still an extremely active one, and we expect signifi-cant improvements will be made to these methods for anumber of years. Moreover, the database of experimen-tally-verified annotations will provide a higher quality“gold-standard” set of functional annotations that willaid in further refinement of these methods.

The costs of experimental verification will vary quitewidely, depending on the gene of interest and the func-tions being investigated. For instance, a student project

in an established laboratory testing a strong predictionmight take only a few weeks with a low incrementalcost. Since this could easily be a major component ofthe initiative in its early stages, it will be crucial todevelop funding mechanisms able to support suchactivities. However, development of higher throughputmethods, for example to check ligand binding, could bemore expensive. Projects like this could be fundedthrough more conventional routes.

The typical funding approach, in which proposals aresolicited from interested investigators, would be mostappropriate for funding the work behind this project.Some modifications to this scheme may be appropri-ate, however, including encouraging the use of a widerange of funding, from large down to very small awards.Also, the normal process of reviewing applicationscould be modified by conducting the review process ina manner similar to the review of scholarly publications,instead of the usual review method, in which studysections are employed.

Another possibility would be to encourage scientiststo submit proposals for supplements to existing grantsthat might permit funded researchers to assign studentsto work on specific genes within the laboratory’s area ofexpertise. This might help to elicit the involvement of theexperimentalists who will provide the data at the heart ofthe project by providing specifically targeted funding.

Funding for some of the work described in this initia-tive might be provided from sources that are alreadyavailable. For example, a request for proposals has beenissued under the auspices of the PSI that provides anopportunity for bioinformaticians and experimentaliststo work with a set of proteins that are already ear-marked for funding because their structures are knownor in the process of being determined.

Finally, it seems desirable to launch a small pilot proj-ect to test the efficacy of the proposed initiative. If earlysuccess can be achieved at modest cost, then therewould be some impetus to raise the stakes and launcha major attack on the problem.

Summary of Recommendations

r Considering the importance of genome annotationto full exploitation of sequence information,progress in experimental annotation has been slow,largely due to a lack of available funding for experi-mental annotation approaches. It is recommendedthat an annotation initiative be undertaken to cat-

~ 11 ~

Page 18: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

alyze and coordinate funding for experimentalapproaches to genome annotation.

r Given the current lack of a reliable source of func-tional annotation data, a central data resourceshould be established. It should incorporate:

v a database of peer-reviewed, experimentally ver-ified gene annotations,

v a catalog of the genes that have yet to be anno-tated, which users can sort by gene family,species, priority, etc.,

v a catalog of the functions for which a generemains to be found, and

v all available experimental information relevantto function.

r Priority in designating funding for annotation throughthe initiative should be placed first on prokaryoticgenomes. In selecting among prokaryotic genes,emphasis should be placed on those gene productsthat belong to large protein families, since knowl-edge of these genes is most likely to impart anunderstanding of the biology of many diverse sys-tems. In this way, a small investment of experimentalwork and funding can lead to big rewards in under-standing many species.

r Progress in functional annotation could be enhancedto some extent by developing mechanisms andinformation systems that encourage collaborationbetween bioinformaticians and experimentalists.This would allow experimental scientists to quicklyaccess and test the predictions made using infor-matics tools, while providing bioinformaticiansaccess to experts on the function of particular genesand enzyme systems.

r In efforts to identify the gene encoding a givenproduct, funding priority should be given to thoseproteins or RNAs that have been purified and char-acterized, as simple sequencing would then lead toidentification of the gene.

r The design of the database should remain open tothe input of researchers who choose to submit pro-posals for constructing and maintaining it. However,it must support the maintenance and update ofworking hypotheses about gene function.

r To avoid conflicts with the interests of potential con-tributors, the database should be hosted by anunbiased organization without a vested interest inthe content of the data.

r The collaborative contributions of experimentalistsand bioinformaticians should be encouragedthrough the requests for applications announced bythe database coordinators.

r A variety of funding types will be needed, includingsmall awards to support students who might workin an experienced investigator’s laboratory for ashort period of time.

r The resources produced by the initiative should bemade public and be freely available to the global sci-entific community. �

~ 12 ~

Page 19: An Experimental Approach to Genome Annotationsites.bu.edu/phenogeno/files/2014/06/kasif-roberts... · cussed the currently available sources of genome annotation information and the

This report was designed byPensaré Design Groupwww.pensaredesign.com