Automated Prokaryotic Annotation at JCVI

Post on 10-May-2015

1903 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Conference: Annual BRC Meeting (BRC6), Oct 28-29, 2008 in Ft. Lauderedale, Florida. Presenter: Dan Haft

Transcript

AutomatedAutomatedProkaryotic Prokaryotic AnnotationAnnotation

at the JCVIat the JCVI

Danie l HaftDanie l Haft20082008

A Dual-UsePipeline

Multiple types of stored evidence Persistent & Flexibly Interleaved Supports selective re-annotation Features annotation-driving databases

- CHAR- TIGRFAMs- Genome Properties- BrainGrab Rules

Evidence used by Machine and by Experts MANATEE interface for annotators Capture new rules with BrainGrab

Computable objects:Output from one programbecomes input to another.

HMM results drive GenomeProperties

Genome Properties guide GOprocess assignments

GO process terms

Identification of Genome FeaturesIdentification of Genome Features

IMMbuilt

Glimmer builds a statistical model from the training set

Genome Sequence

•• rRNA rRNA, tRNA, , tRNA, RfamRfam

•• IS elements ·Phage regions ·RepeatsIS elements ·Phage regions ·Repeats

ORFsORFs

: Other Genome Features: Other Genome Features

Gene FindingGlimmer & friends, homology methods

Homology Searches (gathering evidence)

BLAST-Extend-ReprazeHidden Markov Modelsmisc.

Structural Curation ( ORF Management)

Auto_Gene_Curate (start sites, overlaps)InterEvidence

Functional AssignmentsAuto_AnnotateManualMapped

Data Availability

Homology SearchesHomology Searches•• HMM searches: TIGRFAMs & PfamHMM searches: TIGRFAMs & Pfam•• BLAST searches: against internal NIAABLAST searches: against internal NIAA•• PROSITE motifsPROSITE motifs•• InterProInterPro•• TmHMMTmHMM•• SignalPSignalP•• LipoproteinLipoprotein•• PsortPsort•• Generate Paralogous FamiliesGenerate Paralogous Families•• Custom databases searchesCustom databases searches ( (TransportDBTransportDB, Rules), Rules)

Gene Model CurationGene Model Curation

•• Overlaps resolved by evidence competitionOverlaps resolved by evidence competition

•• Start site Start site curationcuration

•• Missed genes / unsupported gene callsMissed genes / unsupported gene calls

Evidence can Overhang the GeneEvidence can Overhang the GeneBlast-Extend-Blast-Extend-ReprazeRepraze (BER) (BER)

The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-framestop codons (PM). This is indicated when similarity extends outside the coordinates of the proteincoding sequence. Blue line indicates predicted protein coding seqeunce, green line indicates up-and downstream extensions. Red line is the match protein.

ORFxxxxx300 bp 300 bpend5 end3

search protein

match protein

similarity extends in the same frame through a stop codon

normal full length match

*similarity extends upstream through a start, or downstream past a frameshift

!

Pfam vs. TIGRFAMs Functional assignments to

proteins Granularity tuned for

single-hit equivalogs(mono-functional !)

Generates computableobjects --> pathwayreconstructions

TIGRFAMs: RULES

Names for homologydomains in proteins

Granularity tuned fortwilight-level sequencesimilarity detection

Explains things toannotator

Pfam : Explanations

TIGRFAMs equivalogsvs. Pfam domains

}X

X

X

Y

Z}

TIGRxxxxx

PFxxxxx

TIGRFAMs as annotation rules

EC number computable ! GO term computable ! protein name computable ? HMM hit computable !!

Isology (homology) types:ranking our rules

EXCEPTION additional info, e.g. “vegetative”

EQUIVALOG the SAME (in enough ways) toreceive the same name across multiple genomes,reflecting one specific function.

SUBFAMILY can name a whole class

DOMAIN class name for a protein region (and apply these classifications also to Pfam)

CHAR : Experimentally CharacterizedCHAR : Experimentally Characterized Protein Database Protein Database

• Highly curated database of experimentally characterizedproteins; connects protein accessions, known function, and thescientific literature.

•What does it include:–Controlled vocabulary describes the type of experimentationperformed in each publication–Key annotation fields (protein name, gene symbol, EnzymeCommission (EC) number, taxonomic data, Gene Ontology (GO)terms) are extracted–Synonymous protein accessions obtained from publicdatabases (Genbank and UniProt) are stored

Annotation Proceeds from …

Inside --> out (e.g. AutoAnnotate): for every protein Collect evidence Best-guess annotation

Outside --> in (e.g. TIGRFAMs): for every model

Search tool + cutoff + standards = annotation rule Achieves partial coverage

Hybrid (BrainGrab) for every unfinished protein Look for means to annotate: blastp, synteny, hole-filling, etc. Capture annotator logic as a new rule Add to library of rules/models for all future genomes

Subject Genome

Trusted Complete Automatic

Proper Realm ofAnnotator Attention

RULES

BrainGrab

NEW

genome genome

share

validate IMPORT

BLASTP_MATCH [SP|P07363, 1600, 95, 92, 60, 1]

SP|P07363|CHEA_ECOLI Chemotaxis protein cheA EC 2.7.13.3

EcHS_A1984 is manually annotated confidentlybecause it is similar enough to :

(method: defines “similar enough”)

Must be the only protein in genome that scores >= 1600 by blastp,covering >= 95 % of the length of the characterized protein and>= 92 % of the target protein, with >= 60 % sequence identity.

A Teachable Moment

a sample of expert opinion:“For This Particular Protein Family”

I (D.H.H.) assert that any > 75 %-identical, full-length match is the same protein.

Ditto any > 65 % match, as long as the region isclearly syntenic.

Ditto any single-copy > 50 % match, as long as it fillsthis hole in this otherwise mostly complete pathway.

B “Bag of Genes”

G Genome Properties

E Evidence to drive other programs

Image from Gödel, Escher, Bach:an Eternal Golden Braidby Douglas Hofstadter, 1979

Genome Properties:annotation at the level of systems

pathway (glyoxylate shunt) system (type III secretion) structure (outer membrane) genometrics (GC content) phenotype (motility, pathogenesis)

YES someevidence

notsupported NO

Some Novel Genome Properties

12 subtypes of CRISPR/Cas system

PEP-CTERM / exo-sortase: Biofilm-associatedprotein sorting

Type VI secretion (53 loci in B. mallei 23344)

Post-translational selenium-modified enzymes

Heterocycle-containing bacterial toxinproduction: BA_2677 = “heterocyclo-anthracin”

A family of variable putative toxinswith patterns of CGG insertions.

Future Annotation PipelineEnhancements

•• Populate the Characterized Protein DatabasePopulate the Characterized Protein Database

•• Develop META-RULES from CHARDevelop META-RULES from CHAR

•• BrainGrab BrainGrab for novel contentfor novel content

•• Import additional Import additional computable evidencecomputable evidence

•• ImproveImprove exchanges of validationexchanges of validation setssets

•• Build a protein names ontologyBuild a protein names ontology

AcknowlegementsRamana Madupu

Jeremy Selengut

Alex Richter

JCVI microbial annotation team

top related