Top Banner
Family of HMMs Nam Nguyen University of Texas at Austin
116

Family of HMMs

Jan 17, 2016

Download

Documents

crescent

Family of HMMs. Nam Nguyen University of Texas at Austin. Outline of Talk. Background Family of HMMs Model Alignment algorithm Applications of fHMM SEPP (Mirarab, Nguyen, and Warnow. PSB 2012) TIPP (Nguyen, et al. Under review) Conclusions and future work. Phylogenetics. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Family of HMMs

Family of HMMs

Nam NguyenUniversity of Texas at Austin

Page 2: Family of HMMs

Outline of Talk

Background

Family of HMMs

Model

Alignment algorithm

Applications of fHMM

SEPP (Mirarab, Nguyen, and Warnow. PSB 2012)

TIPP (Nguyen, et al. Under review)

Conclusions and future work

Page 3: Family of HMMs

Phylogenetics

Study of evolutionary relationship between different species

Applications to many fields such as drug discovery, agriculture, and biotechnology

Critical are tools for alignment and phylogeny estimation.

Courtesy of Tree of Life Project

Page 4: Family of HMMs

Gather Sequences

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

Page 5: Family of HMMs

Align Sequences

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

Page 6: Family of HMMs

Estimate Tree

S1

S4

S2

S3

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

Page 7: Family of HMMs

Multiple Sequence Alignment

Fundamental step in bioinformatics pipelines

Used in phylogeny estimation, prediction of 2D/3D protein structure, and detection of conserved regions

Can be formulated as an NP-hard optimization problem

Popular heuristics include progressive alignment methods and iterative methods

Heuristics do not scale linearly with the number of sequences

Not as accurate on large datasets or evolutionary divergent datasets

Page 8: Family of HMMs

• Statistical model for representing an MSA

• Uses include

• inserting sequences into an alignment

• taxonomic identification

• homology detection

• functional annotation

Profile Hidden Markov Model (HMM)

Page 9: Family of HMMs

• Statistical model for representing an MSA

• Uses include

• inserting sequences into an alignment

• taxonomic identification

• homology detection

• functional annotation

Profile Hidden Markov Model (HMM)

Page 10: Family of HMMs

• Statistical model for representing an MSA

• Uses include

• inserting sequences into an alignment

• taxonomic identification

• homology detection

• functional annotation

Profile Hidden Markov Model (HMM)

Page 11: Family of HMMs

Metagenomics

Courtesy of Wikipedia

• Study of sequencing genetic material directly from the environment

• Applications to biofuel production, agriculture, human health

• Sequencing technology produces millions of short reads from unknown species

• Fundamental step in analysis is identifying taxa of read

Page 12: Family of HMMs

ACT..TAGA..A (species5)

AGC...ACA (species4)

TAGA...CTT (species3)

TAGC...CCA (species2)

AGG...GCAT (species1)

• ACCG• CGAG• CGG• GGCT• TAGA• GGGGG• TCGAG• GGCG• GGG• .• .• .• ACCT

(60-200 bp long)

Fragmentary Reads: Known Full length Sequences, and an alignment and a tree

(500-10,000 bp long)

Phylogenetic Placement

Page 13: Family of HMMs

Phylogenetic Placement

• Input: (Backbone) Alignment and tree on full-length sequences and a query sequence (short read)

• Output: Placement of the query sequence on the backbone tree

• Use placement to infer relationship between query sequence and full-length sequences in backbone tree

• Applications in metagenomic analysis

• Millions of reads

• Reads from different genomes mixed together

• Use placement to identify read

Page 14: Family of HMMs

Phylogenetic Placement

Align each query sequence to backbone alignment to produce an extended alignment

Place each query sequence into the backbone tree using extended alignment

Page 15: Family of HMMs

Align Sequence

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AAS2 = TAG-CTATCAC--GACCGC--GCAS3 = TAG-CT-------GACCGC--GCTS4 = TAC----TCAC--GACCGACAGCTQ1 = TAAAAC

Page 16: Family of HMMs

Align Sequence

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AAS2 = TAG-CTATCAC--GACCGC--GCAS3 = TAG-CT-------GACCGC--GCTS4 = TAC----TCAC--GACCGACAGCTQ1 = -------T-A--AAAC--------

Page 17: Family of HMMs

Place Sequence

S1

S4

S2

S3Q1

S1 = -AGGCTATCACCTGACCTCCA-AAS2 = TAG-CTATCAC--GACCGC--GCAS3 = TAG-CT-------GACCGC--GCTS4 = TAC----TCAC--GACCGACAGCTQ1 = -------T-A--AAAC--------

Page 18: Family of HMMs

Place Sequence

S1

S4

S2

S3Q1

S1 = -AGGCTATCACCTGACCTCCA-AAS2 = TAG-CTATCAC--GACCGC--GCAS3 = TAG-CT-------GACCGC--GCTS4 = TAC----TCAC--GACCGACAGCTQ1 = -------T-A--AAAC-------- Q1

Q2Q3

Query sequences are aligned and placed independently

Page 19: Family of HMMs

Phylogenetic Placement

Align each query sequence to backbone alignment:HMMALIGN (Eddy, Bioinformatics 1998)

PaPaRa (Berger and Stamatakis, Bioinformatics 2011)

Place each query sequence into backbone tree, using extended alignment:pplacer (Matsen et al., BMC Bioinformatics 2010)

EPA (Berger et al., Systematic Biology 2011)

Page 20: Family of HMMs

Phylogenetic Placement

Align each query sequence to backbone alignment:HMMALIGN (Eddy, Bioinformatics 1998)

PaPaRa (Berger and Stamatakis, Bioinformatics 2011)

Place each query sequence into backbone tree, using extended alignment:pplacer (Matsen et al., BMC Bioinformatics 2010)

EPA (Berger et al., Systematic Biology 2011)

Page 21: Family of HMMs

HMMER and PaPaRa results

Increasing rate evolution

0.0

Backbone size: 5005000 fragments20 replicates

Page 22: Family of HMMs

Reducing Evolutionary Distance

Page 23: Family of HMMs

Reducing Evolutionary Distance

Page 24: Family of HMMs

Reducing Evolutionary Distance

Page 25: Family of HMMs

Reducing Evolutionary Distance

Page 26: Family of HMMs

Reducing Evolutionary Distance

Page 27: Family of HMMs

Family of HMMs (fHMM)

Represents the MSA with multiple HMMs

Input: backbone alignment and tree on full-length sequences S and max decomposition size N

Two steps:

Decompose tree into subtrees of closely related sequences, with at most N leaves in each subtree

Build HMMs on subalignments induced by subtrees

Page 28: Family of HMMs

Family of HMMs (fHMM)

Represents the MSA with multiple HMMs

Input: backbone alignment and tree on full-length sequences S and max decomposition size N

Two steps:

Decompose tree into subtrees of closely related sequences, with at most N leaves in each subtree

Build HMMs on subalignments induced by subtrees

Page 29: Family of HMMs

Family of HMMs (fHMM)

Represents the MSA with multiple HMMs

Input: backbone alignment and tree on full-length sequences S and max decomposition size N

Two steps:

Decompose tree into subtrees of closely related sequences, with at most N leaves in each subtree

Build HMMs on subalignments induced by subtrees

Page 30: Family of HMMs

Alignment using fHMM

Score query sequence against every HMM and select HMM that yields best bit score

Insert query sequence into subalignment, and by transitivity align query sequence to backbone alignment

Page 31: Family of HMMs

Alignment using fHMM

Score query sequence against every HMM and select HMM that yields best bit score

Insert query sequence into subalignment, and by transitivity align query sequence to backbone alignment

Page 32: Family of HMMs

Alignment using fHMM

Score query sequence against every HMM and select HMM that yields best bit score

Insert query sequence into subalignment, and by transitivity align query sequence to backbone alignment

Page 33: Family of HMMs

SEPP

SEPP = SATé-Enabled Phylogenetic Placement

Developers: Mirarab, Nguyen, and Warnow

Two stages of decomposition:

Placement decomposition

Alignment decomposition

Parameterized by N and M

N: maximum size of alignment subsets

M: maximum size of placement subsets

N ≤ M

Published at Pacific Symposium on Biocomputing 2012

Page 34: Family of HMMs

Stage 1: Placement decomposition

S1 S2S3

S4

S6

S7S8

S9

S10S11

S12

S13

S14S15

S5

N=4, M=8

Decompose tree into placement sets of size ≤ 8

Decompose each placement set into alignment sets of size ≤ 4

Page 35: Family of HMMs

SEPP 4/8: Decompose Tree

N=4, M=8

Decompose tree into placement sets of size ≤ 8

Decompose each placement set into alignment sets of size ≤ 4

S9

S10S11

S12

S13

S14

S9

S10S11

S12

S13

S14S15

S1 S2S3

S4

S6

S7S8

S5

S1 S2S3

S4

S6

S7S8

S5

S2S3

S4

S6

S7

S5

Page 36: Family of HMMs

Stage 2: Alignment decomposition

S1 S2S3

S4

S6

S7S8

S9

S10S11

S12

S13

S14

S5

N=4, M=8

Decompose tree into placement sets of size ≤ 8

Decompose each placement set into alignment sets of size ≤ 4

S9

S10S11

S12

S13

S14S15

S1 S2S3

S4

S6

S7S8

S5

S2S3

S4

S6

S7

S5

Page 37: Family of HMMs

Align and Place Fragment

Align to best HMM

Place within placement subtree containing HMM

S9

S11S12

S13

S14

S9

S11S12

S13

S14S15

S1 S2S3

S4

S6

S7S8

S5

S1 S2S3

S4

S6

S7S8

S5

S2S3

S4

S6

S7

S5

S10

Page 38: Family of HMMs

Align and Place Fragment

Align to best HMM

Place within placement subtree containing HMM

S9

S11S12

S13

S14

S9

S11S12

S13

S14S15

S1 S2S3

S4

S6

S7S8

S5

S1 S2S3

S4

S6

S7S8

S5

S2S3

S4

S6

S7

S5

Q

S10

Page 39: Family of HMMs

Align and Place Fragment

Align to best HMM

Place within placement subtree containing HMM

S9

S11S12

S13

S14

S9

S11S12

S13

S14S15

S1 S2S3

S4

S6

S7S8

S5

S1 S2S3

S4

S6

S7S8

S5

S2S3

S4

S6

S7

S5

Q

S10

Q’

Page 40: Family of HMMs

SEPP Parameters: Simulated

M2 model condition, 500 true backbone

Page 41: Family of HMMs

IncreasingPlacementSize:10

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

Page 42: Family of HMMs

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

IncreasingPlacementSize:10,50

Page 43: Family of HMMs

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

IncreasingPlacementSize:10,50,250

Page 44: Family of HMMs

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

IncreasingPlacementSize:10,50,250,500

Page 45: Family of HMMs

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

IncreasingPlacementSize:10,50,250,500

Increases:Accuracy

Page 46: Family of HMMs

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

IncreasingPlacementSize:10,50,250,500

Increases:AccuracyMemory

Page 47: Family of HMMs

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

IncreasingPlacementSize:10,50,250,500

Increases:AccuracyMemoryRunning time

Page 48: Family of HMMs

DecreasingAlignment Size:

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

Page 49: Family of HMMs

DecreasingAlignment Size:250

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

Page 50: Family of HMMs

DecreasingAlignment Size:250,50

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

Page 51: Family of HMMs

DecreasingAlignment Size:250,50,10

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

Page 52: Family of HMMs

DecreasingAlignment Size:250,50,10

Increases:

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

Page 53: Family of HMMs

DecreasingAlignment Size:250,50,10

Increases:Running time

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

Page 54: Family of HMMs

DecreasingAlignment Size:250,50,10

Increases:Running timeAccuracy?

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

Page 55: Family of HMMs

DecreasingAlignment Size:250,50,10

Increases:Running timeAccuracy?

M2 model condition, 500 true backbone

SEPP Parameters: Simulated

Page 56: Family of HMMs

16S.B.ALL dataset, 13k curated backbone

SEPP Parameters: Biological

DecreasingAlignmentSize:

Page 57: Family of HMMs

SEPP Parameters: Biological

16S.B.ALL dataset, 13k curated backbone

DecreasingAlignmentSize:1000

Page 58: Family of HMMs

16S.B.ALL dataset, 13k curated backbone

SEPP Parameters: Biological

DecreasingAlignmentSize:1000,250

Page 59: Family of HMMs

16S.B.ALL dataset, 13k curated backbone

SEPP Parameters: Biological

DecreasingAlignmentSize:1000,250,100

Page 60: Family of HMMs

16S.B.ALL dataset, 13k curated backbone

SEPP Parameters: Biological

DecreasingAlignmentSize:1000,250,100,50

Page 61: Family of HMMs

16S.B.ALL dataset, 13k curated backbone

SEPP Parameters: Biological

DecreasingAlignmentSize:1000,250,100,50

Increases:Running timeAccuracy

Page 62: Family of HMMs

SEPP (10% rule) Simulated Results

0.00.0

Increasing rate evolution

Backbone size: 5005000 fragments20 replicates

Page 63: Family of HMMs

SEPP Biological Results

16S.B.ALL dataset, curated alignment/tree, 13k backbone, 13k total fragments

For 1 million fragments:

PaPaRa+pplacer: ~133 days

HMMER+pplacer: ~30 days

SEPP 1000/1000: ~6 days

Page 64: Family of HMMs

SEPP summary

Two stages of decomposition

Placement decomposition to form placement sets

Alignment decomposition to form fHMM

Results in 40% lower placement error than HMMER+pplacer on divergent datasets

1/5 running time of HMMER+pplacer on large backbones

Local placement uses less than 2 GB peak memory compared to 60-70 GB peak memory for global placement

Page 65: Family of HMMs

Outline of Talk

Background

Family of HMMs

Model

Alignment algorithm

Applications of fHMM

SEPP (Mirarab, Nguyen, and Warnow. PSB 2012)

TIPP (Nguyen, et al. Under review)

Conclusions and future work

Page 66: Family of HMMs

Taxonomic Identification and Profiling

Taxonomic identification

Objective: Given a query sequence, identify the taxon (species, genus, family, etc...) of the sequence

Classification problem

Methods include Megan, PhymmBL, Metaphyler, and MetaPhylAn

Taxonomic profiling

Objective: Given a set of query sequences collected from a sample, estimate the population profile of the sample

Estimation problem

Can be solved via taxonomic identification

Page 67: Family of HMMs

ACT..TAGA..A (species5)

AGC...ACA (species4)

TAGA...CTT (species3)

TAGC...CCA (species2)

AGG...GCAT (species1)

• ACCG• CGAG• CGG• GGCT• TAGA• GGGGG• TCGAG• GGCG• GGG• .• .• .• ACCT

(60-200 bp long)

Fragmentary Unknown Reads: Known Full length Sequences, and an alignment and a tree

(500-10,000 bp long)

Using SEPP

ML placement 40%

Page 68: Family of HMMs

ACT..TAGA..A (species5)

AGC...ACA (species4)

TAGA...CTT (species3)

TAGC...CCA (species2)

AGG...GCAT (species1)

• ACCG• CGAG• CGG• GGCT• TAGA• GGGGG• TCGAG• GGCG• GGG• .• .• .• ACCT

(60-200 bp long)

Fragmentary Unknown Reads: Known Full length Sequences, and an alignment and a tree

(500-10,000 bp long)

Taxonomic Identification using Phylogenetic PlacementAdding uncertainty

2nd highest likelihood placement 38%

ML placement 40%

Page 69: Family of HMMs

• Developers: Nguyen, Mirarab, Pop, and Warnow• SEPP takes the best extended alignment and finds the ML

placement. • We modify SEPP to use uncertainty:

• Find many extended alignments of fragments to each reference alignment to reach support alignment threshold

• Find many placements of fragments for each extended alignment to reach placement support threshold

• Takes alignment and placement support values • Classify each fragment at the Lowest Common Ancestor

of all placements obtained for the fragment• Under review

TIPP: Taxonomic identification and phylogenetic profiling

Page 70: Family of HMMs

Experimental Design

Taxonomic identification

Used leave-one-out experiments to examine classification accuracy on classifying novel taxa

Used non-leave-one-out experiments with fragments simulated under different error models to examine robustness

Fragments simulated under Illumina-like and 454-like error models

Taxonomic profiling

Collected simulated datasets from various studies

Estimated profiles on simulated samples

Computed Root Mean Squared Error for each profile

Page 71: Family of HMMs

Leave-one-out comparison

Page 72: Family of HMMs

Robustness to sequencing error

454_3 error model has reads with average length of 285 bps, with 60 indels per read

Page 73: Family of HMMs

Taxonomic profiling• Selected 9 different simulated metagenomic model

conditions• Divided datasets into two groups:

• short fragments (<= 100 bps) • long fragments (>= 100 bps).

• Report RMSE relative to TIPP’s RMSE

Page 74: Family of HMMs

Profiling: Short Fragments

Page 75: Family of HMMs

Profiling: Short Fragments

Note: PhymmBL does not report species level classification

Page 76: Family of HMMs

Profiling: Long Fragments

Note: PhymmBL does not report species level classification

Page 77: Family of HMMs

TIPP Summary

Combines SEPP with statistical support threshold to increase precision with minor reduction in sensitivity

Better sensitivity for classifying novel reads compared to MetaPhyler

Very robust to sequencing errors

Results in overall more accurate profiles (lowest average error in 10 of 12 conditions)

Can be parameterized for precision or sensitivity

Page 78: Family of HMMs

Summary• fHMM as a statistical model for MSA• Algorithm for alignment using fHMM

• Computes HMMs on closely related subsets• Aligns query sequence to fHMM

• fHMM improves sequence alignment to an existing alignment

Page 79: Family of HMMs

Future work

• Use fHMM as a replacement for profile HMM in other domains– Homology detection– Functional annotation

• Use different alignment methods within fHMM framework• TIPP

– Statistical models for combining profiles on different markers– Expand marker sets to include more genes

Page 80: Family of HMMs

Acknowledgements

Siavash Mirarab Tandy WarnowMihai Pop

Supported byNSF DEB 0733029University of Alberta

Bo Liu

Page 81: Family of HMMs

1KP P450 transcriptome dataset

Full-length P450 gene ~500 AA

Total sequences before filtering ~225K

Page 82: Family of HMMs

Ultra-large sequence alignment

Most MSA techniques do not grow linearly with number of sequences

Alignments are needed on very large datasets

Pfam contains families with more than 100,000 sequences

More than 1 million 16S sequences in Green Genes DB

Datasets can contain fragmentary and full-length sequences

Page 83: Family of HMMs

HMMs for MSA

Given seed alignment (e.g., in PFAM) and a collection of sequences for the protein family:

Represent seed alignment using HMM

Align each additional sequence to the HMM

Use transitivity to obtain MSA

Can we do something like this without a seed alignment?

Page 84: Family of HMMs

UPP: Ultra-large alignment using SEPP

Developers: Nguyen, Mirarab, and Warnow

Input: set of sequences S, backbone size B, and alignment subset size A

Output: MSA on S

Algorithm Select B random full-length sequences (backbone set) from S

Estimate backbone alignment and backbone tree on backbone set

Align remaining sequences to backbone alignment

Uses nested hierarchical fHMM

In preparation

Page 85: Family of HMMs

Disjoint HMMs

HMM 1

HMM 2 HMM 3

HMM 4

Page 86: Family of HMMs

HMM 1

Nested HMMs

Page 87: Family of HMMs

m

HMM 2

HMM 3

HMM 1

Nested HMMs

Page 88: Family of HMMs

m

HMM 2

HMM 3

HMM 1

HMM 4

HMM 5 HMM 6

HMM 7

Nested HMMs

Page 89: Family of HMMs

Experimental Design

Examined both simulated and biological DNA, RNA, and AA datasets

Generated fragmentary datasets from the full-length datasets

Compared Clustal-Omega, Mafft, Muscle, and UPP

ML trees estimated on alignments using FastTree

Scored alignment and tree error

Tree error measured in FN rate or Delta FN rate

Page 90: Family of HMMs

Tree error on simulated RNA datasets

UPP(Fast): Backbone size=100, Alignment size=10Average full-length sequence size 1500 bpsOnly UPP completes on all datasets within 24 hours on a 12 core machine with 24 GB

Page 91: Family of HMMs

Running time on simulated RNA datasets

UPP has close to linear scaling

Page 92: Family of HMMs

Tree Error on fragmentary RNASim 10K dataset

UPP(Default): Backbone size=1000, Alignment size=10Average fragment length of 500 bpsAverage full-length sequence size 1500 bpsDelta FN error: ML(Estimated)-ML(True)

Page 93: Family of HMMs

One Million Sequences: Tree Error

UPP(100,100): 1.6 days using 12 processors

UPP(100,10): 7 days using 12 processors

Note: UPP Decomposition improves accuracy

Page 94: Family of HMMs

UPP summary

Uses nested hierarchical fHMM for sequence alignment

Overall, results in the most accurate alignments (not shown) and trees on full-length simulated datasets

Larger differences on highly divergent datasets

Results in comparable or more accurate alignments and trees on biological datasets (not shown)

Yields most accurate trees on both full-length and mixed datasets

Only method that can complete within 24 hours on datasets with up to 200K sequences, 1M in less than 2 days

Page 95: Family of HMMs

Tree Error

u

v

wx

y

FN

True Tree Estimated Tree

u

v

wx

y

False Negative (FN): an edge in the true tree that is missing from the estimated tree

Delta Error: the difference in FN of the backbone tree+placement and the backbone tree

Page 96: Family of HMMs

Tree Error

u

v

wx

y

True Tree Estimated Tree

u

v

wx

y

False Negative (FN): an edge in the true tree that is missing from the estimated tree

Delta Error: the difference in FN of the backbone tree+placement and the backbone tree

z z

Page 97: Family of HMMs

Tree Error

u

v

wx

y

True Tree Estimated Tree

u

v

wx

y

False Negative (FN): an edge in the true tree that is missing from the estimated tree

Delta Error: the difference in FN of the backbone tree+placement and the backbone tree

z z

Page 98: Family of HMMs

Profile HMM

Q:

Page 99: Family of HMMs

Profile HMM

Q:

Page 100: Family of HMMs

Profile HMM

Q:A

Page 101: Family of HMMs

Profile HMM

Q:

Page 102: Family of HMMs

Profile HMM

Q:A

Page 103: Family of HMMs

Profile HMM

Q:At

Page 104: Family of HMMs

Profile HMM

Q:Atc

Page 105: Family of HMMs

Profile HMM

Q:

Page 106: Family of HMMs

Profile HMM

Q:A

Page 107: Family of HMMs

Profile HMM

Q:A-

Page 108: Family of HMMs

Profile HMM

Q:A-tcTCA-tATG

Page 109: Family of HMMs

Metagenomic data analysis

NGS data produce fragmentary sequence data

Metagenomic analyses include unknown species

Taxon identification: given short sequences, identify the species for each fragment

Applications: Human Microbiome and other metagenomic projects

Issues: accuracy and speed

Page 110: Family of HMMs

Contributions

MRL and SuperFine+MRL: new supertree methods. Nguyen, Mirarab, and Warnow. AMB 2012

SEPP: SATé-Enabled phylogenetic placement. Mirarab, Nguyen, and Warnow. PSB 2012

TIPP: Taxonomic identification and phylogenetic profiling. Nguyen, Mirarab, Pop, and Warnow. Under review.

UPP: Ultra-large alignment using SEPP. Nguyen, Mirarab, Kumar, Guo, Wang, and Warnow. In preparation.

Comparison of different methods for masking alignments. Nguyen, Linder, and Warnow. In preparation.

Page 111: Family of HMMs

Contributions

MRL and SuperFine+MRL: new supertree methods. Nguyen, Mirarab, and Warnow. AMB 2012

SEPP: SATé-Enabled phylogenetic placement. Mirarab, Nguyen, and Warnow. PSB 2012

TIPP: Taxonomic identification and phylogenetic profiling. Nguyen, Mirarab, Pop, and Warnow. Under review.

UPP: Ultra-large alignment using SEPP. Nguyen, Mirarab, Kumar, Guo, Wang, and Warnow. In preparation.

Comparison of different methods for masking alignments. Nguyen, Linder, and Warnow. In preparation.

Page 112: Family of HMMs

Tree Error

u

v

wx

y

FN

True Tree Estimated Tree

u

v

wx

y

False Negative (FN): an edge in the true tree that is missing from the estimated tree

Page 113: Family of HMMs

Placement Error

u

v

wx

y

True Tree Estimated Tree

u

v

wx

y

Delta Error: the difference in FN of the extended tree and the backbone tree

Q1 Q1

Page 114: Family of HMMs

Placement Error

u

v

wx

y

True Tree Estimated Tree

u

v

wx

y

Delta Error: the difference in FN of the extended tree and the backbone tree

Q1 Q1

Page 115: Family of HMMs

Placement Error

u

v

wx

y

True Tree Estimated Tree

u

v

wx

y

Delta Error: the difference in FN of the extended tree and the backbone tree

Delta Error = 2 – 1 = 1

Q1 Q1

Page 116: Family of HMMs

Alignment using fHMM

` `

Align query to best scoring HMM

Insert query sequence into backbone alignment using transitivity