Top Banner
1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries
61

1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

Apr 01, 2015

Download

Documents

Shyanne Cantor
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

1

MICROBIAL GENOME ANNOTATION

Loren Hauser

Miriam Land

Yun-Juan Chang

Frank Larimer

Doug Hyatt

Cynthia Jeffries

Page 2: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

2

NEB Educational Support

http://www.neb.com/nebecomm/course_support.asp?

Page 3: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

3

Why study Computational Biology and Bioinformatics?

· DNA sequencing output is growing faster than Moore’s law!

· 1 Illumina sequencing machine = 0.5 Tbp/week· There are hundreds of these and thousands of

other sequencing machines around the world.· New sequencing technology will conceivably

allow sequencing a human genome for less than $1K in less than 1 day!

Page 4: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

4

Why study Medical Bioinformatics?

· In the near future, most cancer diagnostics will involved DNA or RNA sequencing!

· In the near future, every baby born in the developed world will have their genome sequenced. Protecting privacy and your doctors ability to use that information are the only real impediments!

· Hospitals are using DNA sequencing to track antibiotic resistant bacterial infections.

Page 5: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

5

DOE Undergraduate Research in Microbial Genome Analysis and

Functional Genomics

http://www.jgi.doe.gov/education

Page 6: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

6

Why Study Microbial Genomes?

· Large biological mass (50% of total)· photosynthetic (Prochlorococcus)· fix N2 gas to NH3 (Rhodopseudomonas)

· NH3 to NO2 (Nitrosomonas)

· bioremediation (Shewanella, Burkholderia)· pathogens, BW (Yersinia pestis - plague)· food production (Lactobacillus)· CH4 production (Methanosarcina)

· H2 production (Rhodopseudomonas)

Page 7: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

7

Example of Current Microbial Genome Projects

· UC Davis – FDA funded 100K bacterial genomes project associated with food.

· 5 years = 20K per year / 200 days/year = 100 genomes/day!

Page 8: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

8

Web Resources and Contact Information

· http://genome.ornl.gov/microbial/· http://www.jgi.doe.gov/· http://genome.jgi-psf.org/· http://www.jcvi.org/· http://www.ncbi.nlm.nih.gov/· http://www.sanger.ac.uk/· http://www.ebi.ac.uk/· ftp://ftp.lsd.ornl.gov/pub/JGI

- artemis ready files for each scaffold = (feature table plus fasta sequence file)

· Contact:- [email protected]; [email protected]

Page 9: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

9

Page 10: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

Evolution of Sequencing Throughput

Sequencing Technology Samples/run bp/sample runs/week bp/week yearMaxam and Gilbert 1 100 5 500 1977Manual Sanger 5 400 5 10000 1985Automated Sanger (96 lanes/gel) 100 500 5 250000 1995Automated Sanger (384 capilaries) 400 600 10 2400000 2002454 sequencing (new titanium) 1,000,000 400 5 2E+09 2009Solexa (Illumina) 300,000,000 75 1 2.25E+10 2009Solexa (Illumina) 1,000,000,000 200 1 2.00E+12 ?2010PacBio realtime sequencing 100,000,000 1000 10 1E+12 ?2010

Page 11: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

11

Sequenced Microbial Genomes

· ARCHAEAL GENOMES- 159 FINISHED; 218 IN PROGRESS

· BACTERIAL GENOMES- 3363 FINISHED; 11831 IN PROGRESS

· ENVIRONMENTAL COMMUNITIES- > 50,000 samples (see MGRast)

· as of Sept 6, 2012· http://www.expasy.ch/alinks.html· http://www.genomesonline.org· http://metagenomics.anl.gov/

Page 12: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

12

Published Genomes· Nitrosomonas europaea - J.Bac. 185(9):2759-2773 (2003)· Prochlorococcus MED4 & MIT9313 - Nature 424:1042-1047 (2003)· Synechococcus WH8102 - Nature 424:1037-1042 (2003)· Rhodopseudomonas palustris - Nat. Biotech. 22(1):55-61 (2004)· Yersinia pseudotuberculosis - PNAS 101(22):13826-31 (2004)· Nitrobacter winogradskyi – Appl. Envir. Micro. 72(3):2050-63 (2006)· Nitrosococcus oceani - Appl. Envir. Micro. 72(9):6299-315 (2006)· Burkholderia xenovorans – PNAS 103(42):15280-7 (2006)· Thiomicrospira crunogena – PLoS Biology 4(12):e383 (2006)· Nitrosomonas eutropha C91 – Env. Micro. 9(12):2993-3007 (2007)· Sulfuromonas denitrificans – Appl. Envir. Micro. 74(4):1145-56 (2008)· Nitrosospira multiformis -- Appl. Envir. Micro. 74(11):3559-72 (2008)· Nitrobacter hamburgensis -- Appl. Envir. Micro. 74(9):2852-63 (2008)· Saccharophagus degradans – PLoS Genetics 4(5):e1000087 (2008)· R. palustris – 5 strain comparison – PNAS 105(47):18543-8 (2008)· L. rubarum and L. ferrodiazotrophum – Appl. Envir. Micro. (in press)

Page 13: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

13

Basic Annotation Impacts

· Design of oligonucleotide arrays· Design & prioritize protein expression

constructs· Design & prioritize gene knockouts· Assessment of overall metabolic capacity· Database for proteomics· Allows visualization of whole genome

Page 14: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

14

Additional Analysis Impacts

· Revised functional assignments based on domain fusions, functional clustering, phylogenetic profile

· Regulatory motif discovery· Operon and regulon discovery· Regulatory and protein association

network discovery

Page 15: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

15

Scaffoldsor

contigs

Prodigal

Modelcorrection

Final GeneList

InterPro COGs

Web Pages

Blast

ComplexRepeats

Simplerepeats

GC Content, GC skew

PRIAM

Function call

tRNAs

rRNA,Misc_RNAs

Featuretable

TMHMM SignalP

MicrobialAnnotationGenomePipeline

Page 16: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

16

Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm)

· Unsupervised:  Automatically learns the statistical properties of the genome.

· Indifferent to GC Content:  Prodigal performs well irrespective of the GC content of the organism.

· Draft:  Prodigal can train on multiple sequences then analyze individual draft sequences.

· Open Source:  Prodigal is freely available under the GPL.

· Reference:  Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11(1):119. (Highly Accessed)

Page 17: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

17

G+C Frame Plot Training

· Takes all ORFs above a specified length in the genome.

· Examines the G+C bias in each frame position of these ORFs.

· Does a dynamic programming algorithm using G+C frame bias as its coding scoring function to predict genes.

· Takes those predicted genes and gathers dicodon usage statistics.

Page 18: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

18

Gene Prediction

· Dicodon usage coding score· Length factor added to coding score (GC-

content-dependent)· Coding/noncoding thresholds sharpened (starts

downstream of starts with higher coding get penalized by the difference).

· Dynamic programming to put genes together.· Bonuses for operon distances, larger bonus for

-1/-4 overlaps.· Same strand overlap allowed (up to 60 bases).· Opposite strand -->3'r 5'f<- allowed (up to 250

bases)

Page 19: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

19

Start Site ScoringShine Dalgarno Motif

· Examines initially predicted genes and gathers statistics on the starts (RBS motifs, ATG vs GTG vs TTG frequency)

· Moves starts based on these discoveries.· Gathers statistics on the new set of starts and

repeats this process until convergence (5-10 iterations).

· RBS motifs based on AGGAGG sequence, 3-6 base motifs, with one mismatch allowed in 5 base or longer motifs (e.g. GGTGG, or AGCAG).

· Does a final dynamic programming with the start scoring function.

Page 20: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

20

Start Site ScoringOther Motifs

· If Shine-Dalgarno scoring is strong, use it – this accounts for ~85% of genomes.

· If Shine-Dalgarno scoring is weak, look for other motifs

· If a strong scoring motif is found, use it (example GGTG in A. pernix)

· If no strong scoring motif is found, use highest score of all found motifs (example – Crenarchaea, Tc and Tl start sites are the same, but internal operon genes use weak Shine-Dalgarno motifs)

Page 21: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

21

Annotated Gene Prediction

Page 22: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

22

Prodigal Scoring

Page 23: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

23

Gene Prediction Problems – Pseudogenes

Page 24: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

24

Pseudogenes – Internal deletion

Page 25: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

25

Pseudogenes – Premature stop codon

Page 26: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

26

Pseudogenes – N-terminal deletion

Page 27: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

27

Pseudogenes – Transposon insertion

Page 28: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

28

Pseudogenes – Multiple frameshifts

Page 29: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

29

Pseudogenes – Premature Stop and Frameshift

Page 30: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

30

Pseudogenes – Dead Start Codon

Page 31: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

31

Page 32: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

32

GENE PAGE

Page 33: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

33

Page 34: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

34

Page 35: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

35

Page 36: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

36

ORGANISM’S (PSYC) COGS LISTContig Gene Num Prot Group COG Gene NameCOG DescriptionScore E-Value CategoryContig1 1 -- I COG1211 4-diphosphocytidyl-2-methyl-D-erithritol synthaseIspD 86 9.00E-19 Lipid metabolismContig1 3 -- E COG0137 Argininosuccinate synthaseArgG 565 1.00E-162 Amino acid transport and metabolismContig1 4 -- S COG1376 Uncharacterized protein conserved in bacteriaErfK 61 4.00E-11 Function unknownContig1 6 -- K COG0583 Transcriptional regulatorLysR 123 2.00E-29 TranscriptionContig1 8 -- R COG0628 Predicted permeasePerM 145 6.00E-36 General function prediction onlyContig1 10 -- L COG0593 ATPase involved in DNA replication initiationDnaA 104 8.00E-24 DNA replication recombination and repairContig1 11 -- No COGContig1 14 -- J COG0172 Seryl-tRNA synthetaseSerS 557 1.00E-160 Translation ribosomal structure and biogenesisContig1 15 -- G COG0021 TransketolaseTktA 992 0 Carbohydrate transport and metabolismContig1 16 -- No COGContig1 17 -- L COG0551 Zn-finger domain associated with topoisomerase type ITopA 38 9.00E-04 DNA replication recombination and repairContig1 19 -- L COG0507 ATP-dependent exoDNAse (exonuclease V) alpha subunit - helicase superfamily I memberRecD 61 3.00E-10 DNA replication recombination and repairContig1 20 -- G COG0057 Glyceraldehyde-3-phosphate dehydrogenase/erythrose-4-phosphate dehydrogenaseGapA 344 7.00E-96 Carbohydrate transport and metabolismContig1 23 -- R COG1451 Predicted metal-dependent hydrolaseCOG1451 104 3.00E-24 General function prediction onlyContig1 24 -- P COG0168 Trk-type K+ transport systems membrane componentsTrkG 181 1.00E-46 Inorganic ion transport and metabolismContig1 25 -- P COG0569 K+ transport systems NAD-binding componentTrkA 125 4.00E-30 Inorganic ion transport and metabolismContig1 26 -- No COGContig1 27 -- M COG2885 Outer membrane protein and related peptidoglycan-associated (lipo)proteinsOmpA 113 1.00E-26 Cell envelope biogenesis outer membraneContig1 28 -- No COGContig1 29 -- M COG1538 Outer membrane proteinTolC 114 2.00E-26 Cell envelope biogenesis outer membrane / Intracellular trafficking and secretionContig1 29 -- R COG1538 Outer membrane proteinTolC 114 2.00E-26 Cell envelope biogenesis outer membrane / Intracellular trafficking and secretionContig1 30 -- R COG2274 ABC-type bacteriocin/lantibiotic exporters contain an N-terminal double-glycine peptidase domainSunT 410 1.00E-115 Defense mechanismsContig1 31 -- R COG1566 Multidrug resistance efflux pumpEmrA 82 7.00E-17 Defense mechanismsContig1 33 -- No COGContig1 34 -- No COGContig1 35 -- C COG2010 Cytochrome c mono- and diheme variantsCccA 38 3.00E-04 Energy production and conversionContig1 36 -- R COG3019 Predicted metal-binding proteinCOG3019 136 1.00E-33 General function prediction onlyContig1 37 -- Q COG2132 Putative multicopper oxidasesSufI 271 9.00E-74 Secondary metabolites biosynthesis transport and catabolismContig1 38 -- P COG3667 Uncharacterized protein involved in copper resistancePcoB 148 6.00E-37 Inorganic ion transport and metabolismContig1 39 -- S COG3544 Uncharacterized protein conserved in bacteriaCOG3544 43 8.00E-06 Function unknownContig1 40 -- R COG0491 Zn-dependent hydrolases including glyoxylasesGloB 100 2.00E-22 General function prediction onlyContig1 41 -- P COG2217 Cation transport ATPaseZntA 754 0 Inorganic ion transport and metabolismContig1 42 -- R COG1826 Sec-independent protein secretion pathway componentsTatA 59 4.00E-11 Intracellular trafficking and secretionContig1 43 -- No COGContig1 45 -- S COG1937 Uncharacterized protein conserved in bacteriaCOG1937 49 5.00E-08 Function unknownContig1 46 -- G COG2814 Arabinose efflux permeaseAraJ 57 2.00E-09 Carbohydrate transport and metabolismContig1 47 -- O COG0435 Predicted glutathione S-transferaseECM4 507 1.00E-145 Posttranslational modification protein turnover chaperonesContig1 51 -- J COG0261 Ribosomal protein L21RplU 126 3.00E-31 Translation ribosomal structure and biogenesisContig1 52 -- J COG0211 Ribosomal protein L27RpmA 119 4.00E-29 Translation ribosomal structure and biogenesisContig1 53 -- M COG2834 Outer membrane lipoprotein-sorting proteinLolA 122 3.00E-29 Cell envelope biogenesis outer membraneContig1 57 -- S COG4399 Uncharacterized protein conserved in bacteriaCOG4399 41 2.00E-04 Function unknownContig1 58 -- O COG1138 Cytochrome c biogenesis factorCcmF 589 1.00E-169 Posttranslational modification protein turnover chaperonesContig1 59 -- O COG0526 Thiol-disulfide isomerase and thioredoxinsTrxA 49 2.00E-07 Posttranslational modification protein turnover chaperones / Energy production and conversionContig1 59 -- C COG0526 Thiol-disulfide isomerase and thioredoxinsTrxA 49 2.00E-07 Posttranslational modification protein turnover chaperones / Energy production and conversionContig1 60 -- O COG3088 Uncharacterized protein involved in biosynthesis of c-type cytochromesCcmH 157 4.00E-40 Posttranslational modification protein turnover chaperonesContig1 61 -- O COG4235 Cytochrome c biogenesis factorCOG4235 103 3.00E-23 Posttranslational modification protein turnover chaperonesContig1 62 -- F COG0563 Adenylate kinase and related kinasesAdk 168 3.00E-43 Nucleotide transport and metabolismContig1 64 -- R COG1949 Oligoribonuclease (3'->5' exoribonuclease)Orn 293 7.00E-81 RNA processing and modificationContig1 65 -- I COG1502 Phosphatidylserine/phosphatidylglycerophosphate/cardiolipin synthases and related enzymesCls 230 1.00E-61 Lipid metabolismContig1 66 -- R COG0790 FOG: TPR repeat SEL1 subfamilyCOG0790 56 4.00E-09 General function prediction onlyContig1 67 -- M COG1519 3-deoxy-D-manno-octulosonic-acid transferaseKdtA 330 2.00E-91 Cell envelope biogenesis outer membraneContig1 68 -- S COG1385 Uncharacterized protein conserved in bacteriaCOG1385 120 1.00E-28 Function unknownContig1 69 -- P COG1840 ABC-type Fe3+ transport system periplasmic componentAfuA 164 1.00E-41 Inorganic ion transport and metabolismContig1 70 -- P COG1178 ABC-type Fe3+ transport system permease componentThiP 287 1.00E-78 Inorganic ion transport and metabolismContig1 71 -- E COG3842 ABC-type spermidine/putrescine transport systems ATPase componentsPotA 318 5.00E-88 Amino acid transport and metabolismContig1 74 -- L COG0188 Type IIA topoisomerase (DNA gyrase/topo II topoisomerase IV) A subunitGyrA 941 0 DNA replication recombination and repairContig1 76 -- O COG0625 Glutathione S-transferaseGst 94 7.00E-21 Posttranslational modification protein turnover chaperonesContig1 77 -- No COGContig1 78 -- R COG0515 Serine/threonine protein kinaseSPS1 83 2.00E-17 General function prediction only / Signal transduction mechanisms / Transcription / DNA replication recombination and repairContig1 78 -- T COG0515 Serine/threonine protein kinaseSPS1 83 2.00E-17 General function prediction only / Signal transduction mechanisms / Transcription / DNA replication recombination and repair

Page 37: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

37

Taxonomic Distribution of Top KEGG BLAST Hits

Page 38: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

38

Frequency distance distributions

Salgado et al.PNAS (2000)97:6652Fig. 2

Page 39: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

39

Frequency distance distributions

Salgado et al.PNAS (2000)97:6652Fig. 3b

Page 40: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

40

Branched Chain Amino Acid Transporter family

ATPase ATPase Permease PBPOrganism COG0410 COG0411 COG0559 COG0683Nostoc punctiforme Cyano JGI 3 3 6 4Trichodesmium erythraeum Cyano JGI 2 1 3 6Helicobacter pylori J99 epsilon COG 0 0 0 0Helicobacter pylori 26695 epsilon COG 0 0 0 0Campylobacter jejuni subsp. jejuni NCTC 11168 epsilon COG 1 1 2 2Geobacter metallidurans delta JGI 1 1 2 1Desulfovibrio desulfuricans delta JGI 2 2 4 4Escherichia coli K12 gamma COG 1 1 2 2Escherichia coli O157:H7 EDL933 gamma COG 1 1 2 2

Buchnera sp. APS gamma COG 0 0 0 0Pseudomonas aeruginosa, PAO1 gamma COG 3 3 7 4Pseudomonas fluorescens gamma JGI 3 3 5 3Pseudomonas syringae gamma JGI 7 4 8 5Psychrobacter gamma JGI 0 0 1 0Vibrio cholerae O1 biovar eltor str. N16961 gamma COG 0 0 1 0

Yersinia pestis, CO92 gamma COG 2 2 4 2Yersinia pseudotuberculosis gamma JGI 2 2 4 2Haemophilus influenzae Rd KW20 gamma COG 0 0 0 0Pasteurella multocida subsp. multocida str. Pm70 gamma COG 0 0 0 0

Xylella fastidiosa (3 strains) gamma COG, JGI 0 0 0 0Azotobacter vinlandii gamma JGI 3 4 8 2Psychrobacter gamma JGI 0 0 0 0Burkholderia fungorum beta JGI 22 20 34 29Burkholderia mallei beta 6 6 11 8Burkhoderia pseudomallei beta 7 7 13 10Ralstonia metallidurans beta JGI 9 8 16 12Ralstonia eutropha beta JGI 18 19 36 28Nitrosomonas europaea beta JGI 0 0 0 0Neisseria meningitidis MC58 beta COG 0 0 0 0Neisseria meningitidis Z2491 beta COG 0 0 0 0Caulobacter crescentus alpha COG 0 0 0 0Mesorhizobium loti alpha COG 7 7 16 10Agrobacerium tumefaciens alpha COG 7 7 15 9Bradyrhizobium japonicum alpha 27 26 50 59Brucella melitenis alpha 7 7 12 13Brucella suis alpha 6 6 12 11Sinorhizobium meliloti alpha 5 5 12 8Rickettsia conorii alpha 0 0 1 0Rickettsia prowazekii alpha COG 0 0 1 0Rhodobacter sphaerodes alpha JGI 6 6 12 5Rhodospirillum rubrum alpha JGI 6 7 13 9Rhodopseudomonas palustris alpha JGI 20 20 40 38

Page 41: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

41

Probable Ancient Gene (Liv Operon)

Page 42: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

42

Branched Chain Amino Acid Transporter family – Rhodopseudomonas palustris

Target ID Description Putative LigandThermal Shift Assay

Binding LigandΔ Tm °C for 1000 uM

Ligand OR (100uM ) Ligand

Tm(°C) No

Ligand

RPA0985putative branched-chain amino acid transport

system substrate-binding proteinbranched chain AAs

4-Hydroxybenzoate, Benzoate, Salicylate,

Benzaldehyde29.0,13.5, 2.5, 2.0 56.5

RPA4029possible branched-chain amino acid ABC transport

system substrate-binding proteinbranched chain AAs

4-Hydroxybenzoate, p-Coumarate

17.0, 2.0 58.6

RPA4648 possible ABC transporter binding protein

component spermidine/putrescine p-Coumarate 2.0 55.5

RPA1250 amide-urea binding protein branched chain AAs Urea 5.0 63.0

RPA1789putative branched-chain amino acid transport

system substrate-binding proteinbranched chain AAs p-Coumarate 7.0 67.0

RPA3669putative urea short-chain amide or branched-chain

amino acid uptake ABC transporter periplasmic solute-binding protein precursor

branched chain AAs Urea 6.0 59.5

RPA3810putative periplasmic binding protein of ABC

transporterbranched chain AAs

Ala, Gly,Ser, Met, Leu, Cys

11.5, 6.5, 4.5, 2.5, 2.0, 2.0 77.5

RPA2043putative ABC transporter, periplasmic substrate-

binding protein nitrate/taurine Malate 4.0 52.5

RPA2628 polar amino acid ABC transport substrate-binding

protein, aapJ-2 (aapJ-2) amino acids, prefers

polar aasMet, Cys, His 10.0, 6.5, 3.5 63.0

RPA0668putative ABC transporter subunit, substrate-binding

componentbranched chain AAs

4-Hydroxybenzoate, Salicylate,

Benzaldehyde 13.0, (6.0, 2.0 ) 61.5

RPA1741possible branched-chain amino acid transport

system substrate-binding proteinbranched chain AAs

Met, Leu, Malate, Gly, Pro

6.0, 3.0, 3.0, 2.0, 2.0 52.0

RPA2193putative ABC transporter, perplasmic binding

protein, branched chain amino acidsbranched chain AAs Glutarate 5.0 64.5

RPA3486putative branched-chain amino acid transport

system substrate-binding proteinbranched chain AAs Glutarate 3.0 44.5

RPA2499 possible ABC transporter, periplasmic protein nitrate/taurine or

aliphatic sulfonatesAsn 7.0 53.5

Page 43: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

43

Example of Lateral Transfer

Page 44: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

44

Transporter Gene Loss in Yersina Pestis

· 36 Genes involved in transport from YPSE are nonfunctional in YPES

· 13 lost due to frameshifts· 11 lost due to deletions· 6 lost due to IS element insertions· 4 (2 pair) lost due to recombination

causing deletions and frameshifts· 2 lost due to premature stop codons

Page 45: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

45

Page 46: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

46

Nostoc punctiformeSignal Transduction Histidine Kinases

Page 47: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

47

Nostoc punctiformeSignal Transduction Histidine Kinases

Page 48: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

48

Nostoc punctiformeSignal Transduction Histidine Kinases

Gene # aa# COGN-term.

(TM) N-term.

RRR Other domain PAS/PAC GAF(PHY)HAMP +

TM HisKA HATPaseC-term.

RRR Operon structureR1448 374 COG0642 1 1 1 K1448/K1449R1449 444 COG0642 1 1 1 K1448/K1449R1550 595 COG0642 1 1 1R1597 1042 COG4191 4 (3) 1 1R1685 1559 COG0642 unk. (2) 1 Chase/1 Hpt 1 3(1) 1 1 2R1759 706 COG4251 2 (1) 1 1 RRR1757/RRR1758/K1759/K1760R1760 595 COG0642 1 1 1 1 RRR1757/RRR1758/K1759/K1760R1778 451 COG5002 unk. (2) 1 1 K1778/WHTH1779R1798 1098 COG0642 unk.(1) 1 1 1 1 1 K-R1798/K-F1799/LuxR-F1800R1868 713 COG0642 unk. (3) 1 1 1 1R2035 1080 COG0642 5 1 1 1 K-R2035/cNMP-F2036R2209 430 COG0642 1 1 1R2262 657 COG0642 1 1 1 1 1 2262-9R2263 740 COG0642 1 2 2 2262-9R2268 709 COG0642 2 1 1 2262-9R2271 504 COG4191 1 1 1 K2271/PK&K2272R2272 1801 COG3899 1 Prt. Kin. 1 1? 1 K2271/PK&K2272R2375 1211 COG5278 (sp) 1 Chase 1 1 1 1 3R2408 928 COG4585 unk. (1) 1 Cache 1 HisKA_3 1 LuxR2407/K2408R2421 421 COG4585 unk. (4) HisKA_3 1 LuxR2420/K2421R2485 530 COG0642 unk. 1 1R2901 629 COG0642 1 1 1 1 1 K2901/RRR2902/K2903R2903 1116 COG4251 1 3 2 (1) 1 1 K2901/RRR2902/K2903R2909 103 COG4251 0 0.5R3010 210 COG0642 1 0R3052 475 COG2205 unk. (1) 1 1 1

Page 49: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

49

Nostoc punctiformeSignal Transduction Histidine Kinases

169 predicted genes total12 pseudogenes3 genes with sensors but no kinase domain (do these work with the genes with no sensor domains - not in the same operon)

154 functional Signal Transduction Histidine Kinases2 with 2 kinase domains (fused genes? or a 1 gene cascade?)1 with an Adenylate Cyclase domain

12 with a Ser/Thr Protein Kinase domain, a COG3899 domain, 1 or more GAF domains, and possibly other domains3 with Hpt domain3 with CBS domains6 with Chase domains3 with Cache domains1 with an Amino acid transporter as a sensor? domain

23 with a N-terminal RRR domain15 with only a N-terminal RRR as a sensor? domain66 with 1 or more RRR domains (86 RRR domains)46 with 1 or more C-terminal RRR domains (ie. Hybrid kinases)61 with 1 or more PAS/PAC domains (147 PAS/PAC domains total)59 with 1 or more GAF or Phytochrome domains (96 total - 38 phytochromes)21 with HAMP domains (34 total)64 with unknown N-terminal sensor domains82 with multiple N-terminal sensor domains3 with no sensor domain (do these work with the genes with no kinase domains - not in the same operon)1 with large C-terminal unknown domain1 with N-terminal RRR & WHTH (fused genes?)1 cNMP binding sensor domain2 with HisKA_2 type dimerization/autophosphorylation domains5 with HisKA_3 type dimerization/autophosphorylation domains8 putative operons with common (bidirectional) promoter

TM Transmembrane alpha helical domainRRR response regulator receiver domain (Phospho accepting Asp containing domain)

Page 50: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

50

Nostoc punctiformeRegulatory Proteins

570 Regulatory Proteins Comments/Pseudogenes

201 Transcription/Elongation/Termination Factors

14 Sigma Factors9 Cyanobacterial Sigma Factors 2 sets of pseudogenes: pNPAR018 truncated by transposase; pNPAR022, 3, 4 are remnants of a decayed gene0 Sigma-54 (RpoN)0 Sigma 32 (RpoH)2 Sigma 28 (Flagella/Sporulation)2 Sigma-24 (RpoE/FecI) (ECF subfamily)1 Unknown Sigma factor (ECF subfamily) 1 set of pseudogenes: NpR2325/6

17 Anti/Anti-Anti Sigma Factors1 Anti-Sigma regulatory factor (Ser/Thr protein kinase and phosphatase)8 Anti-Sigma-factor antagonist (STAS) domain protein1 Anti-Sigma-factor antagonist (STAS) and sugar transfersase1 Predicted transmembrane transcriptional regulator (anti-sigma factor)5 Putative Anti-Sigma regulatory factor (Ser/Thr protein kinase)1 Sigma 54 modulation protein/ribosomal protein S30EA

3 Termination/Antitermination Factors1 NusA antitermination factor S1 RNA binding domain:KH domain / RNA binding1 NusB antitermination factor1 NusG antitermination factor

0 Elongation Factors0 GreA/GreB family elongation factors

167 Transcription factors3 Ferric uptake regulator (FUR) family1 Negative regulator of class I heat shock protein2 phage shock protein A, PspA1 Phosphate uptake regulator, PhoU1 Plasmid maintenance system antidote protein6 Predicted transcriptional regulator 4 different COGs1 SOS-response transcriptional repressor, LexA1 Putative transcriptional acitvator, Baf1 Transcriptional Regulator, AbrB family5 Transcriptional Regulator, AraC family1 Transcriptional Regulator, AraC family with Methyltransferase activity5 Two Component Transcriptional Regulator, AraC family

Page 51: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

51

Burkholderia xenovoransRegulatory Proteins

946 Regulatory Proteins Comments

704 Transcription/Elongation/Termination Factors

22 Sigma Factors4 Sigma 70 (RpoD)2 Sigma-54 (RpoN)2 Sigma 32 (RpoH)1 Sigma 28 (Flagella/Sporulation)

12 Sigma-24 (RpoE/FecI) (ECF subfamily)1 Unknown Sigma factor (ECF subfamily)

13 Anti/Anti-Anti Sigma Factors1 Anti Sigma-E protein, RseA, Burkholderiaceae specific1 Anti-Sigma regulatory factor (Ser/Thr protein kinase and phosphatase)1 Anti-Sigma(ECF) factor, ChrR2 Anti-Sigma-factor antagonist (STAS) domain protein4 Predicted transmembrane transcriptional regulator (anti-sigma factor)1 Putative Anti-Sigma regulatory factor (Ser/Thr protein kinase)1 Putative Anti-Sigma-28 factor, FlgM1 Putative Sigma E regulatory protein, MucB/RseB1 Sigma-54 modulation protein also called ribosomal protein S30AE

6 Termination/Antitermination Factors1 transcription termination factor Rho Cold-shock DNA-binding domain(related to S1 RNA binding domain)2 Response regulator receiver (CheY) and ANTAR domain protein ANTAR = RNA binding, anti-termination1 NusA antitermination factor S1 RNA binding domain:KH domain / RNA binding1 NusB antitermination factor1 NusG antitermination factor

3 Elongation Factors3 GreA/GreB family elongation factors

660 Transcription factors7 Cold-shock DNA-binding domain protein1 Possible Ferric uptake regulator (FUR) family2 Ferric uptake regulator (FUR) family1 Negative regulator of class I heat shock protein1 Negative transcriptional regulator

Page 52: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

52

Regulatory ProteinIdentification Scheme

Number Category Product Description COG1 COG2 InterPro Pfam Smart TIGR

5 Chemotaxis Signal TransductionPossible Bacterial chemotaxis sensory transducer COG08405 Chemotaxis Signal TransductionBacterial chemotaxis sensory transducer COG0840 MCPsignal MA5 Chemotaxis Signal TransductionBacterial chemotaxis sensory transducer IPR0040895 Chemotaxis Signal TransductionBacterial chemotaxis sensory transducer, TarH (aspartate) sensor COG0840 MCPsignal and Tar MA and TarH5 Chemotaxis Signal TransductionBacterial chemotaxis sensory transducer, Pas/Pac sensor COG0840 MCPsignal MA sensory_box5 Chemotaxis Signal TransductionBacterial chemotaxis sensory transducer, Cache sensor COG0840 MCPsignal and Cache MA5 Chemotaxis Signal TransductionBacterial chemotaxis sensory transducer, GAF sensor COG0840 MCPsignal MA and GAF5 Chemotaxis Signal TransductionBacterial chemotaxis sensory transducer, Phytochrome sensor COG0840 IPR001294 MCPsignal MA and GAF5 Chemotaxis Signal TransductionBacterial chemotaxis sensory transducer, Phytochrome sensor IPR001294 and IPR004089

5 Chemotaxis Signal TransductionCheW protein COG0835 CheW CheW5 Chemotaxis Signal TransductionCheW protein IPR002545 CheW5 Chemotaxis Signal TransductionTwo component CheW protein IPR002545 and IPR001789 CheW5 Chemotaxis Signal TransductionPossible CheA Signal Transduction Histidine Kinases (STHK), weak homolog, no good domain identificationCOG06435 Chemotaxis Signal TransductionPossible CheA Signal Transduction Histidine Kinases (STHK) COG0643 HATPase_c HATPase_c5 Chemotaxis Signal TransductionCheA Signal Transduction Histidine Kinases (STHK) COG0643 IPR008207 and IPR0035945 Chemotaxis Signal TransductionCheA Signal Transduction Histidine Kinases (STHK) IPR002545 and IPR003594 and IPR004105 HPT and CheW5 Chemotaxis Signal TransductionCheB methylesterase COG2201 CheB_methylest5 Chemotaxis Signal TransductionCheB methylesterase IPR000673 and IPR001789 CheB_methylest5 Chemotaxis Signal TransductionTwo component CheB methylesterase COG2201 IPR001789 CheB_methylest5 Chemotaxis Signal TransductionMCP methyltransferase, CheR-type COG1352 CheR MeTrc5 Chemotaxis Signal TransductionMCP methyltransferase, CheR-type IPR000780 CheR MeTrc5 Chemotaxis Signal TransductionMCP methyltransferase, CheR-type with PAS/PAC sensor COG1352 CheR MeTrc sensory_box5 Chemotaxis Signal TransductionMCP methyltransferase/methylesterase, CheR/CheB with PAS/PAC sensorCOG1352 CheR and CheB_methylest MeTrc sensory_box5 Chemotaxis Signal TransductionCheC, inhibitor of MCP methylation COG17765 Chemotaxis Signal TransductionCheD, stimulates methylation of MCP proteins COG1871

Page 53: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

53

Summary of automated transporter annotation --- Zymomonas

317 Transporter Proteins

69 1 Channels/Pores82 2 Electrochemical Potential-driven transporters116 3 Primary Active Transporters2 4 Group Translocators2 5 Transport Electron Carriers14 8 Accessory Factors Involved in Transport29 9 Incompletely Characterized Transport Systems

23 1.A alpha-type channels46 1.B beta barrel porins73 2.A Porters (uniporters, symporters, antiporters)9 2.C Ion-gradient-driven energizers103 3.A P-P-bond-hydrolysis-driven transporters2 3.B Decarboxylation-driven transporters13 3.D Oxidoreduction-driven transporters2 4.A Phosphotransfer-driven group translocators1 5.A Transmembrane 2-Electron Transfer Carriers1 5.B Transmembrane 1-Electron Transfer Carriers14 8.A Auxiliary transport proteins12 9.A Recognized transporters of unknown biochemical mechanism17 9.B Putative uncharacterized transport proteins

Page 54: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

54

Zymomonas transporterscomplete listing

GROUP 2.A.53 2 proteinsPorters sulfate transporter or Xanthine/uracil/vitamin C transporter or0489 2.A.53Porters carbonic anhydrase, sulfate transporter SulP family or1027 2.A.53

GROUP 2.A.6 8 proteinsPorters putative lipooligosaccharide nodulation factor exporter, NolGHI, RND superfamily or0146 2.A.6.3/2Porters hydrophobe/amphiphile efflux-1 HAE1, RND superfamily or0252 2.A.6.2Porters acriflavin resistance protein, RND superfamily or0704 2.A.6.2Porters efflux transporter, RND family, MFP subunit or1290 2.A.6.2/3.A.1.122/8.A.1Porters acriflavin resistance protein, RND superfamily or1378 2.A.6.2Porters acriflavin resistance protein, RND superfamily or1379 2.A.6.2Porters hopanoid biosynthesis associated RND transporter like protein HpnN or1439 2.A.6.5/7Porters export membrane protein SecD, RND superfamily or1719 2.A.6.4.1

GROUP 2.A.64 3 proteinsPorters twin-arginine translocation protein TatC or1107 2.A.64.1.1Porters twin-arginine translocation protein TatB or1108 2.A.64.1.1Porters twin-arginine translocation protein TatA or1109 2.A.64.1.1

GROUP 2.A.66 5 proteinsPorters multi antimicrobial extrusion protein MatE or0190 2.A.66.1Porters polysaccharide biosynthesis protein or0202 2.A.66.2Porters polysaccharide biosynthesis protein or1191 2.A.66.2Porters polysaccharide biosynthesis protein or1303 2.A.66.2Porters virulence factor MviN family or1478 2.A.66.4.1

GROUP 2.A.69 2 proteinsPorters predicted transporter, putative auxin efflux carrier component or0625 2.A.69.2./1Porters predicted transporter, putative auxin efflux carrier component or0626 2.A.69.2./1

Page 55: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

Transcriptome Analysis Pipeline:RNA sequences to GRN

Collect RNAseq

data

Map reads to genomes

Calculate reads/bp Display

frequency plot

Determine operons from frequency plot

Compare operon determinations (genome co-

ordinates)

Predict operons In silico

Improve algorithm

Determine orthologous

operons

Determine orthologs with

OrthoMCL

Align orthologous promoters

Determine TFBS from alignments

Determine TISs with 5’ RACE.

Cluster analysis from gene

expression arrays

Predict TFBS In silico

Cluster analysis of gene

expression changes

GRN genetic regulatory network

Page 56: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

Dynamic range and sensitivity

Page 57: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

New gene, wrong start, riboswitch

Page 58: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

Small Regulatory RNA ???

Page 59: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

Differential gene expression

Page 60: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

60

Operon with Internal Promoter

Page 61: 1 MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries.

Long Term Vision

· Develop TPing SOPs, and an automated analysis pipeline.

· Initially produce TPs and preliminary GRNs for all important DOE microbial genomes (i.e. BESC), and eventually all DOE microbial genomes.

· Incorporate the TP analysis pipeline into ORNL’s automated microbial annotation pipeline, and eventually into IMG and GenBank files.

· Add additional experimental methods to improve the GRN determinations.