Top Banner
Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana- Champaign http://tandy.cs.illinois.edu Supported by NSF grant DBI-1461364
62

Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Dec 19, 2015

Download

Documents

Jessie Wiggins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Phylogenomics Symposium and Software School

Tandy WarnowDepartments of Computer Science and Bioengineering

The University of Illinois at Urbana-Champaignhttp://tandy.cs.illinois.edu

Supported by NSF grant DBI-1461364

Page 2: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website,University of Arizona

Species Tree

Page 3: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Phylogenomics

Phylogenomics = genome-scale phylogeny estimationNote: Jonathan Eisen coined this term, but used it to mean something else.

Page 4: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Phylogenomic pipeline

• Select taxon set and markers

• Gather and screen sequence data, possibly identify orthologs

• Compute multiple sequence alignments for each locus

• Compute species tree or network:

– Compute gene trees on the alignments and combine the estimated gene trees, OR

– Estimate a tree from a single site from each locus

– Estimate a tree from a concatenation of the multiple sequence alignments, OR

• Get statistical support on each branch (e.g., bootstrapping)

• Estimate dates on the nodes of the phylogeny

• Use species tree with branch support and dates to understand biology

Page 5: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Phylogenomic pipeline

• Select taxon set and markers

• Gather and screen sequence data, possibly identify orthologs

• Compute multiple sequence alignments for each locus

• Compute species tree or network:

– Compute gene trees on the alignments and combine the estimated gene trees, OR

– Estimate a tree from a single site from each locus

– Estimate a tree from a concatenation of the multiple sequence alignments, OR

• Get statistical support on each branch (e.g., bootstrapping)

• Estimate dates on the nodes of the phylogeny

• Use species tree with branch support and dates to understand biology

Page 6: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Gene trees inside the species tree (Coalescent Process)

Present

Past

Courtesy James Degnan

Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

Page 7: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Incomplete Lineage Sorting (ILS)• Two (or more) lineages fail to

coalesce in the first ancestral population

• Chance of ILS depends on – Time: shorter branches make ILS

likelier– Population size: wider branches

increase ILS

• The most likely gene tree is not necessarily same as the species tree [Degnan, Rosenberg, 2006, PLOS genetics]

JH Degnan, NA Rosenberg –Trends in ecology & evolution, 2009

Page 8: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

New Coalescent-based Methods

• New summary methods:– ASTRAL, BUCKy-pop, NJst, Phylonet, STAR, STEM, etc.

• “Binning” genes– Naïve binning, Statistical Binning, Weighted Statistical

Binning• Site-based methods:

– SNAPP, SVDquartets• Co-estimation methods:

– BEST, *BEAST, BBCA

Page 9: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

New Coalescent-based Methods

• New summary methods:– ASTRAL, BUCKy-pop, NJst, Phylonet, STAR, STEM, etc.

• “Binning” genes– Naïve binning, Statistical Binning, Weighted Statistical

Binning• Site-based methods:

– SNAPP, SVDquartets• Co-estimation methods:

– BEST, *BEAST, BBCA

Page 10: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Phylogenetic Network

From L. Nakhleh, “Evolutionary Phylogenetic Networks: Models and Issues”, in Problem Solving Handbook in Computational Biology and Bioinformatics, Springer.

Page 11: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Metagenomic Taxon Identification

Objective: classify short reads in a metagenomic sample

Page 12: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Scientific challenges:

• Ultra-large multiple-sequence alignment• Alignment-free phylogeny estimation• Supertree estimation• Estimating species trees from many gene trees• Genome rearrangement phylogeny• Reticulate evolution• Visualization of large trees and alignments• Data mining techniques to explore multiple optima• Theoretical guarantees under Markov models of evolution

Techniques:machine learning, applied probability theory, graph theory, combinatorial optimization,supercomputing, and heuristics

The Tree of Life: Multiple Challenges

Page 13: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Schedulehttp://tandy.cs.illinois.edu/symposium-2015-schedule.html

Monday, • 1:00-1:10 Introductions• 1:10-1:50 Tandy Warnow (Multiple sequence alignment)• 1:50-2:30 Laura Kubatko (Phylogenomics: SVDquartets)• 2:30-3:10 Siavash Mirarab (Phylogenomics: ASTRAL)• 3:10-3:25 Break• 3:25-4:05 Nam Nguyen (TIPP: metagenomic data analysis)• 4:05-4:45 Luay Nakhleh (Phylogenetic Networks)• 4:45-5:15 Downloading and installing software

Page 14: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Software School Schedulehttp://tandy.cs.illinois.edu/symposium-2015-schedule.html

May 19, 2015: Software School (USB 2244 and USB 1250)

9-10 AM:• ASTRAL in USB 1250 (taught by Siavash Mirarab)• TIPP in USB 2244 (taught by Nam-Phuong Nguyen)

10-11 AM:• UPP in USB 2244 (taught by Nam-Phuong Nguyen)• Phylonet in USB 1250 (taught by Yun Yu and Luay Nakhleh)

11 AM - 12 noon: • PASTA in USB 1250 (taught by Siavash Mirarab).

12 noon - 1:30 PM: lunch break

1:30 PM - 2:30 PM: • SVDquartets in USB 1250 (taught by Laura Kubatko and Dave Swofford

Page 15: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

PASTA and UPP: two new methods for multiple sequence alignment

Tandy WarnowDepartments of Computer Science and Bioengineering

The University of Illinois at Urbana-Champaignhttp://tandy.cs.illinois.edu

Page 16: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

OrangutanGorilla ChimpanzeeHuman

From the Tree of the Life Website, University of Arizona

Phylogeny (evolutionary tree)

Page 17: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Page 18: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT

U V W X Y

U

V W

X

Y

Page 19: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

AGAT TAGACTT TGCACAA TGCGCTTAGGGCATGA

U V W X Y

U

V W

X

Y

The “real” problem

Page 20: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

…ACGGTGCAGTTACCA…

MutationDeletion

…ACCAGTCACCA…

Indels (insertions and deletions)

Page 21: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

…ACGGTGCAGTTACC-A…

…AC----CAGTCACCTA…

• The true multiple alignment – Reflects historical substitution, insertion, and deletion events– Defined using transitive closure of pairwise alignments computed on

edges of the true tree

…ACGGTGCAGTTACCA…

SubstitutionDeletion

…ACCAGTCACCTA…

Insertion

Page 22: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Input: unaligned sequences

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

Page 23: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Phase 1: Alignment

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

Page 24: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Phase 2: Construct tree

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

S1

S4

S2

S3

Page 25: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

First Align, then Compute the Tree

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

S1

S4

S2

S3

Page 26: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Multiple Sequence Alignment (MSA): another grand challenge1

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--…Sn = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGC …Sn = TCACGACCGACA

Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation

1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Page 27: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

This talk

• Simulation studies comparing two-phase methods

• SATé (Science 2009, Systematic Biology 2012)

• PASTA (RECOMB 2014 and JCB 2014), co-estimation of alignments and trees (improved version of SATé)

• UPP (RECOMB 2015 and Genome Biology, in press), designed for datasets with fragmentary sequences, or with structural alignments

PASTA and UPP can analyze datasets with 1,000,000 sequences, and are highly accurate

All methods are available in open-source form, on github.

Tutorials tomorrow.

Page 28: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Simulation Studies

S1 S2

S3S4

S1 = -AGGCTATCACCTGACCTCCA

S2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

S1 = -AGGCTATCACCTGACCTCCA

S2 = TAG-CTATCAC--GACCGC--S3 = TAG-C--T-----GACCGC--S4 = T---C-A-CGACCGA----CA

Compare

True tree and alignment

S1 S4

S3S2

Estimated tree and alignment

Unaligned Sequences

Page 29: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Quantifying Error

FN: false negative (missing edge)FP: false positive (incorrect edge)

50% error rate

FN

FP

Page 30: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Two-phase estimationAlignment methods• Clustal• POY (and POY*)• Probcons (and Probtree)• Probalign• MAFFT• Muscle• Di-align• T-Coffee • Prank (PNAS 2005, Science 2008)• Opal (ISMB and Bioinf. 2007)• FSA (PLoS Comp. Bio. 2009)• Infernal (Bioinf. 2009)• Etc.

Phylogeny methods• Bayesian MCMC • Maximum

parsimony • Maximum likelihood • Neighbor joining• FastME• UPGMA• Quartet puzzling• Etc.RAxML: heuristic for large-scale ML optimization

Page 31: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

1000-taxon models, ordered by difficulty (Liu et al., 2009)

Page 32: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Large-scale Alignment Estimation

• Many genes are considered unalignable due to high rates of evolution

• Only a few methods can analyze large datasets

• iPlant (NSF Plant Biology Collaborative) and other projects planning to construct phylogenies with 500,000 taxa

Page 33: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

1kp: Thousand Transcriptome Project

First study (Wickett, Mirarab, et al., PNAS 2014) had ~100 species and ~800 genes, gene trees and alignments estimated using SATe, and a coalescent-based species tree estimated using ASTRAL

Second study: Plant Tree of Life based on transcriptomes of ~1200 species, and more than 13,000 gene families (most not single copy)

Gene Tree Incongruence

G. Ka-Shu WongU Alberta

N. WickettNorthwestern

J. Leebens-MackU Georgia

N. MatasciiPlant

T. Warnow, S. Mirarab, N. Nguyen, Md. S.BayzidUIUC UT-Austin UT-Austin UT-Austin

Upcoming Challenges: Species tree estimation from conflicting gene treesAlignment of datasets with > 100,000 sequences

Plus many many other people…

Page 34: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Re-aligning on a tree

A

B D

C

Merge sub-

alignments

Estimate ML tree on merged

alignment

Decompose dataset

A B

C D Align subproble

ms

A B

C D

ABCD

Page 35: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

SATé and PASTA Algorithms

Estimate ML tree on new alignment

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

Alignment

If new alignment/tree pair has worse ML score, realign using a different decomposition

Repeat until termination condition (typically, 24 hours)

Page 36: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

1000 taxon models, ordered by difficulty

SATé-1 24 hour analysis, on desktop machines

(Similar improvements for biological datasets)

SATé-1 can analyze up to about 30,000 sequences.

SATé-1 (Science 2009) performance

Page 37: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

1000 taxon models ranked by difficulty

SATé-1 and SATé-2 (Systematic Biology, 2012)

Page 38: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

PASTA (2014): even better than SATé-2

PASTA vs. SATé-2(a) Faster,(b) Can analyze larger datasets (up to 1,000,000 sequences – SATé-2 can analyze 50,000 sequences)(c) More accurate!

Page 39: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

1kp: Thousand Transcriptome Project

Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy)Gene Tree Incongruence

G. Ka-Shu WongU Alberta

N. WickettNorthwestern

J. Leebens-MackU Georgia

N. MatasciiPlant

T. Warnow, S. Mirarab, N. Nguyen, Md. S.BayzidUIUC UT-Austin UT-Austin UT-Austin

Challenge: Alignment of datasets with > 100,000 sequences

Plus many many other people…

Page 40: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

1KP dataset: more than 100,000 p450 amino-acidsequences, many fragmentary

Page 41: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

1KP dataset: more than 100,000 p450 amino-acidsequences, many fragmentary

All standard multiplesequence alignmentmethods we tested performed poorly ondatasets with fragments.

Page 42: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

1kp: Thousand Transcriptome Project

Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy)Gene Tree Incongruence

G. Ka-Shu WongU Alberta

N. WickettNorthwestern

J. Leebens-MackU Georgia

N. MatasciiPlant

T. Warnow, S. Mirarab, N. Nguyen, Md. S.BayzidUIUC UT-Austin UT-Austin UT-Austin

Challenge: Alignment of datasets with > 100,000 sequences with many fragmentary sequences

Plus many many other people…

Page 43: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

UPP: large-scale MSA estimation

UPP = “Ultra-large multiple sequence alignment using Phylogeny-aware Profiles”

Nguyen, Mirarab, and Warnow. In press, Genome Biology 2015.Available in open source form on github

Page 44: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Profile HMMs

Page 45: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

A simple idea

• Select random subset of sequences, and build “backbone alignment”

• Construct a profile Hidden Markov Model (HMM) to represent the backbone alignment

• Add all remaining sequences to the backbone alignment using the HMM

Fast!

But is it accurate?

Page 46: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

A simple idea

• Select random subset of sequences, and build “backbone alignment”

• Construct a profile Hidden Markov Model (HMM) to represent the backbone alignment

• Add all remaining sequences to the backbone alignment using the HMM

Fast!

But is it accurate?

Page 47: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Simple technique:1) build one HMM for the entire alignment2) Align fragment to the HMM, and insert into alignment

Page 48: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

One Hidden Markov Model for the entire alignment?

Page 49: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Or 2 HMMs?

Page 50: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Or 4 HMMs?

Page 51: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Or 7 HMMs?

Page 52: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

UPP Algorithmic Approach

• Select random subset of sequences, and build “backbone alignment”

• Construct an “Ensemble of Hidden Markov Models” on the backbone alignment (the family has HMMs on overlapping subsets of different sizes)

• Add all remaining sequences to the backbone alignment using the Ensemble of HMMs

Page 53: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

RNASim: alignment error

Note: Mafft was run under default settings for 10K and 50K sequencesand under Parttree for 100K sequences, and fails to complete under any setting For 200K sequences. Clustal-Omega only completes on 10K dataset.

All methods given 24 hrs on a 12-core machine

Page 54: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

RNASim: tree error

Note: Mafft was run under default settings for 10K and 50K sequencesand under Parttree for 100K sequences, and fails to complete under any setting For 200K sequences. Clustal-Omega only completes on 10K dataset.

All methods given 24 hrs on a 12-core machine

Page 55: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

RNASim Million Sequences: alignment error

Notes: • We show alignment error

using average of SP-FN and SP-FP. UPP variants have better scores than PASTA.

• But for the Total Column (TC) scores, PASTA is better than UPP: it recovered 10% of the columns compared to less than 0.04% for UPP variants.

• PASTA under-aligns: its alignment is 43 times wider than true alignment (~900 Gb of disk space). UPP alignments were closer in length to true alignment (0.93 to 1.38 wider).

Page 56: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

RNASim Million Sequences: tree error

Notes: • UPP(Fast,NoDecomp)

took 2.2 days,

• UPP(Fast) took 11.9 days, and

• PASTA took 10.3 days (all using 12 processors).

Page 57: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Performance on fragmentary datasets of the 1000M2 model condition

UPP vs. PASTA: impact of fragmentation

Under high rates of evolution,PASTA is badly impactedby fragmentary sequences (the same is true for other methods).

Under low rates of evolution,PASTA can still be highly accurate(data not shown).

UPP continues to have goodaccuracy even on datasetswith many fragments underall rates of evolution.

Page 58: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Running Time

Wall-clock time used (in hours) given 12 processors

Page 59: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Summary• SATé-1 (Science 2009), SATé-2 (Systematic Biology 2012), and PASTA (RECOMB 2014):

methods for co-estimating gene trees and multiple sequence alignments.

PASTA can analyze up to 1,000,000 sequences, and is highly accurate for full-length sequences. But none of these methods are robust to fragmentary sequences.

PASTA TUTORIAL TOMORROW

• HMM Ensemble technique: uses a collection of HMMs to represent a “backbone alignment”. HMM ensembles improve accuracy, especially in the presence of high rates of evolution.

• Applications of HMM Ensembles in:

– UPP (ultra-large multiple sequence alignment), Genome Biology (in press) – TUTORIAL TOMORROW

– SEPP (phylogenetic placement), PSB 2012 (not shown)

– TIPP (metagenomic taxon identification and abundance profiling), Bioinformatics 2014 (not shown) – TUTORIAL TOMORROW

Page 60: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Acknowledgments

PhD students: Nam Nguyen* and Siavash Mirarab**Undergrad: Keerthana KumarLab Website: http://www.cs.utexas.edu/users/phylo Personal Website: http://tandy.cs.illinois.edu Write to me: [email protected] (I am recruiting students!)

Funding: Guggenheim Foundation, NSF, Microsoft Research New England, David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Computing Center), and the University of Alberta (Canada)

TACC and UTCS computational resources

* Now a postdoc with Becky Stumpf and Bryan White (ICB, UIUC)** Supported by HHMI Predoctoral Fellowship

Page 61: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Tutorial Downloads• PASTA: https://github.com/smirarab/pasta/blob/master/pasta-doc/pasta-

tutorial.md• UPP: https://github.com/smirarab/sepp/blob/master/tutorial/upp- tutorial.md• ASTRAL https://github.com/smirarab/ASTRAL/blob/master/astral- tutorial.md• Phylonet https://wiki.rice.edu/confluence/pages/viewpage.action?pageId=

8898533• TIPP https://github.com/smirarab/sepp/blob/master/tutorial/tipp- tutorial.md• SVDquartets: will be run within PAUP*,

http://people.sc.fsu.edu/~dswofford/paup_test /

Please go to these sites to obtain the software, and to get instructions on how to download, install, and run the software. (Do this today, in advance of the software school tomorrow.)

Page 62: Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Scientific challenges:

• Ultra-large multiple-sequence alignment• Alignment-free phylogeny estimation• Supertree estimation• Estimating species trees from many gene trees• Genome rearrangement phylogeny• Reticulate evolution• Visualization of large trees and alignments• Data mining techniques to explore multiple optima• Theoretical guarantees under Markov models of evolution

Techniques:machine learning, applied probability theory, graph theory, combinatorial optimization,supercomputing, and heuristics

The Tree of Life: Multiple Challenges