1 Phylogenetic Phylogenetic Tree Tree Reconstruction Reconstruction Modified version of Dr. Chun-Chi Modified version of Dr. Chun-Chi eh Shih’ eh Shih’ Institute of Information Science Institute of Information Science s s Academia Sinica Academia Sinica
83
Embed
1 Phylogenetic Tree Reconstruction Modified version of Dr. Chun-Chieh Shih’ Institute of Information Sciences Academia Sinica.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Taken From: http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Concept 0f evolutionary trees
Number of Trees
77
Types of data used in phylogenetic inference
Use the aligned characters, such as DNA or protein sequences, directly during tree inference.
Character-based methods:
Transform the sequence data into pairwise distances, and usethe matrix during tree building.
Distance-based methods:
88
Data set collection
Multiple sequence alignment
Tree construction
Character-based Distance-based
Optimal criteriaParsimony Maximum likelihood
UPGMA NJ
Fitch-MargoliashKITCH Distance
Test reliability of the tree by analytical and/or resampling procedure
99
Distance Methods
Calculate changes between each pair in a groupof sequences (The first step in producing a multiple sequenceAlignment)
Identify tree that correctly positions neighbors
and that also has branch lengths that reproduce
the original data as closely as possible
Finding closest neighbors among a group of
Sequences
1010
Distance Methods - Example
distancesbetweensequences
distance table
1111
FITCHFITCH: estimates phylogenetic tree assuming additivi: estimates phylogenetic tree assuming additivity of branch lengths using the Fitch-Margoliash methty of branch lengths using the Fitch-Margoliash methodod
KITSHKITSH: same as FITCH, but under the assumption of : same as FITCH, but under the assumption of a molecular clocka molecular clock
NEIGHBORNEIGHBOR: estimates phylogenies using either:: estimates phylogenies using either: Neighbor-joiningNeighbor-joining (no molecular clock assumed) (no molecular clock assumed) Unweighted Pair Group Method with ArithmeticUnweighted Pair Group Method with Arithmetic MeanMean
Parsimony can give misleading information when rates of sequence
change vary in the different branches of a tree that are represented by
the sequence data
Where maximum parsimony fails
Real tree: 2 long branches in which
G has turned to A independently,
possibly with some intermediate
steps.
In parsimony analysis rates of change along all branches of the tree are assumed equal. Therefore the tree predicted from parsimony will not be correct.
3939
Standard problem: Maximum Parsimony Standard problem: Maximum Parsimony (Hamming distance Steiner Tree)(Hamming distance Steiner Tree)
InputInput: Set : Set SS of of nn aligned sequences of aligned sequences of length klength k
OutputOutput: A phylogenetic tree : A phylogenetic tree TT– leaf-labeled by sequences in leaf-labeled by sequences in SS– additional sequences of length additional sequences of length kk labeling the labeling the
internal nodes of internal nodes of TT
such that is minimized. such that is minimized. )(),(
),(TEji
jiH
Maximum Parsimony - Example
4040
Maximum parsimony (example)Maximum parsimony (example)
InputInput: Four sequences: Four sequences– ACTACT– ACAACA– GTTGTT– GTAGTA
QuestionQuestion: which of the three trees has the : which of the three trees has the best MP scores?best MP scores?
Maximum Parsimony - Example
4141
All possible unrooted treesAll possible unrooted trees
ACT
GTT ACA
GTA ACA ACT
GTAGTT
ACT
ACA
GTT
GTA
Maximum Parsimony - Example
4242
Possible substitutionsPossible substitutions
ACT
GTT
GTT GTA
ACA
GTA
12
2
MP score = 5
ACA ACT
GTAGTT
ACA ACT
3 1 3
MP score = 7
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
Optimal MP tree
Maximum Parsimony - Example
4343
Maximum Parsimony: Maximum Parsimony: computational complexitycomputational complexity
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
Finding the optimal MP tree is NP-hard
Optimal labeling can becomputed in linear time O(nk)
Maximum Parsimony - Example
4444
Maximum likelihood approach
Method uses probability calculations to find atree that best accounts for the variation in aset of sequences
Similar to maximum parsimony method in thatanalysis is performed on each column of amultiple sequence alignment
Start with an evolutionary model of sequencechange that provides estimates of rates ofsubstitution of one base for another(transitions and transversions).
4545
Maximum likelihood approach
Statistical method - powerful and flexible,also computationally complex
Given a particular tree and a model of theevolutionary change, calculate the likelihoodof the tree based on data, i.e. the givenmultiple sequence alignment
Likelihood (tree | data) proportional toProbability( data | tree)
4646
Maximum likelihood approach
Tree with branches, vk branch lengths
Probability of character change PAC(t) for
A C in time t
Don’t know character states inside tree (inthe past) so calculate for all possibilities,e.g. A, C, G, T
Maximum likelihood does best in simulationbut is also slowest method
Variety of new heuristics to find ML tree faster
4949
Maximum Likelihood (ML)Maximum Likelihood (ML) Given: stochastic model of sequence evolution Given: stochastic model of sequence evolution
(e.g. Jukes-Cantor) and a set S of sequences (e.g. Jukes-Cantor) and a set S of sequences Objective: Find tree T and probabilities p(e) of suObjective: Find tree T and probabilities p(e) of su
bstitution on each edge, to maximize the probabibstitution on each edge, to maximize the probability of the data.lity of the data.
Preferred by some systematists, but even harder tPreferred by some systematists, but even harder than MP in practice.han MP in practice.
Maximum likelihood approach
5050
Quality of the tree
Phylogenetic trees can vary dramatically with
slight changes in data
We want to know which branches are reliable, and
which branches do not have strong support from the
data
Bootstrapping is the most common method used
A general statistical technique for determining how
much error is in a set of results
5151
Confidence assessment
Bootstrapping
Original data set with n characters
Draw n characters randomly with re-placement.Repeat m times.
m pseudo-replicates, each with n characters.
Original analysis,e.g. MP, ML, NJ.
Repeat original analysison each of thepseudo-replicate data sets.
Evaluate the resultsfrom the m analyses.
5252
Confidence assessment
Bootstrap sampling of phylogenies
5353
Confidence assessment
What do the bootstrap values mean?
Bootstrap values for phylogenetic trees do not
follow proper statistical behavior
Bootstrap value 95% actually close to 100%
confidence in that branch
Bootstrap value 75% often close to 95%
confidence
Bootstrap value 60% is much lower confidence
Less than 50% bootstrap: no confidence in that
branch over an alternative
5454
Computer Software for PhylogeneticsComputer Software for Phylogenetics Due to the lack of consensus among evolutionary biologists Due to the lack of consensus among evolutionary biologists about basic principles for phylogenetic analysis, it is not about basic principles for phylogenetic analysis, it is not surprising that there is a wide array of computer software surprising that there is a wide array of computer software available for this purpose.available for this purpose.– PHYLIPPHYLIP is a free package that includes 30 programs that is a free package that includes 30 programs that
compute various phylogenetic algorithms on different kinds of compute various phylogenetic algorithms on different kinds of data.data.
– The The GCGGCG package (available at most research institutions) package (available at most research institutions) contains a full set of programs for phylogenetic analysis contains a full set of programs for phylogenetic analysis including simple distance-based clustering and the complex including simple distance-based clustering and the complex cladisticcladistic analysis program analysis program PAUPPAUP ( (PPhylogenetic hylogenetic AAnalysis nalysis UUsing sing PParsimony)arsimony)
– CLUSTALXCLUSTALX is a multiple alignment program that includes the is a multiple alignment program that includes the ability to create tress based on ability to create tress based on Neighbor Joining.Neighbor Joining.
– MacCladeMacClade is a well designed cladistics program that allows is a well designed cladistics program that allows the user to explore possible trees for a data set.the user to explore possible trees for a data set.
5555
Phylogenetics on the WebPhylogenetics on the Web There are several phylogenetics servers available There are several phylogenetics servers available
on the Web on the Web – some of these will change or disappear in the near futuresome of these will change or disappear in the near future
– these programs can be very slow so keep your sample sets smallthese programs can be very slow so keep your sample sets small The Institut Pasteur, Paris has a The Institut Pasteur, Paris has a PHYLIPPHYLIP server at: server at:
http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.htmlhttp://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html Louxin Zhang at the Natl. University of Singapore has a Louxin Zhang at the Natl. University of Singapore has a WebPhylipWebPhylip server: server:
http://sdmc.krdl.org.sg:8080/~lxzhang/phylip/http://sdmc.krdl.org.sg:8080/~lxzhang/phylip/ The Belozersky Institute at Moscow State University has their own "The Belozersky Institute at Moscow State University has their own "GeneBeeGeneBee" "
phylogenetics server:phylogenetics server:
http://www.genebee.msu.su/services/phtree_reduced.htmlhttp://www.genebee.msu.su/services/phtree_reduced.html The The PhylodendronPhylodendron website is a tree drawing program with a nice website is a tree drawing program with a nice
user interface and a lot of options, however, the output is limited to user interface and a lot of options, however, the output is limited to gifs at 72 dpi - not publication qualitygifs at 72 dpi - not publication quality..
Joseph Felsenstein (author of PHYLIP) maintains a Joseph Felsenstein (author of PHYLIP) maintains a
comprehensive list of comprehensive list of Phylogeny programsPhylogeny programs at: at: http://evolution.genetics.washington.edu/phylip/software.htmlhttp://evolution.genetics.washington.edu/phylip/software.html
Introduction to Phylogenetic Systematics,Introduction to Phylogenetic Systematics,Peter H. Weston & Michael D. Crisp, Society of Australian Systematic Peter H. Weston & Michael D. Crisp, Society of Australian Systematic BiologistsBiologists
University of California, Berkeley Museum of Paleontology University of California, Berkeley Museum of Paleontology (UCMP)(UCMP)http://www.ucmp.berkeley.edu/clad/clad4.htmlhttp://www.ucmp.berkeley.edu/clad/clad4.html
5757
Software HazardsSoftware Hazards There are a variety of programs for Macs and There are a variety of programs for Macs and
PCs, but you can easily tie up your machine for PCs, but you can easily tie up your machine for many hours with even moderately sized data many hours with even moderately sized data sets (i.e. fifty 300 bp sequences)sets (i.e. fifty 300 bp sequences)
Moving sequences into different programs can Moving sequences into different programs can be a major hassle due to incompatible file be a major hassle due to incompatible file formats.formats.
Just because a program can perform a given Just because a program can perform a given computation on a set of data does not mean that computation on a set of data does not mean that that is the appropriate algorithm for that type of that is the appropriate algorithm for that type of data.data.
5858
Which Method to Choose?
Depends upon the sequences that are being compared
Strong sequence similarity:
Maximum parsimony
Clearly recognizable sequence similarity
Distance methods
All others:
Maximum likelihood
Best to choose at least two approaches
Compare the results – if they are similar,you can have more confidence
5959
Which Method to Choose?
6060
Neighbor-joiningNeighbor-joining Maximum parsimonyMaximum parsimony Maximum likelihoodMaximum likelihood
Uses only pairwise Uses only pairwise distancesdistances
Uses only shared Uses only shared derived charactersderived characters
Uses all dataUses all data
Minimizes distance Minimizes distance between nearest between nearest neighborsneighbors
Minimizes total Minimizes total distancedistance
Maximizes tree likelihood Maximizes tree likelihood given specific parameter given specific parameter valuesvalues
Very fastVery fast SlowSlow VeryVery slow slow
Easily trapped in local Easily trapped in local optimaoptima
Assumptions fail when Assumptions fail when evolution is rapidevolution is rapid
Highly dependent on Highly dependent on assumed evolution modelassumed evolution model
Good for generating Good for generating tentative tree, or choosing tentative tree, or choosing among multiple treesamong multiple trees
Best option when Best option when tractable (<30 taxa, tractable (<30 taxa, homoplasy rare)homoplasy rare)
Good for very small data Good for very small data sets and for testing trees sets and for testing trees built using other methodsbuilt using other methods
Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Comparison of Methods
6161
More Topics
Related to
Phylogenetics
6262
More topics related to Phylogenetics
Phylogeny epidemiology
Supertree / Tree of life
Phylogeography
6363
Idea of the ‘Tree of Life’
The idea that the evolution of life can be represented as a tree, with leaves corresponding to extant species and nodes to extinct ancestors, came from Charles Darwin
The earliest trees formed by Ernst Haeckel and others were based on a general idea of a hierarchy of relationships between species and higher taxa
Gradually, quantitative criteria have been developed to measure the degree of morphological difference that was thought to reflect evolutionary distance
6464
Winds of Change
In the early days of molecular phylogenetics, a gene tree was usually equated with the species tree. This view was typified using ribosomal RNA (rRNA) sequences as the principal molecular phylogenetic marker
This resulted in the discovery of a previously unrecognized domain of life, the Archaea, and in a tree topology that has been aptly called the ‘standard model’ of evolution
This model involves the early descent of the bacterial clade from the last universal common ancestor and a subsequent separation of archaea and eukaryotes.
All this was to change once comparative genomics yielded more information and multiple complete genome sequences became available for comparison
6565
The three domains of Life
Identified by phylogenetic analysis of the highly
conserved 16S ribosomal RNA
6666
Three strategies for constructing phylogenies
Homologous single-gene data set
Sequence concatenation
Supertree construction
Rely on many taxa for a single gene
Combine or concatenate multiple
sequences for the same set of species
Need for close concordance of species
sampling among genes, which is difficult
because of the hit-or-miss sampling in
the databases.
Less genes and less samples
Large number sequence alignment
Sample multiple genes only for minimally overlapping sets of species
Tree constructed by a set of subtrees
6767
With current computational tools, phylogenetic analyses for 1,000 species
is possible with adequate computer resources
It is currently impossible to reach a reasonable solution for 500,000 species,
even with months of computation .
Tree of Life( 30,000 species )
Assembling the Tree of Life (ATOL )
What difficulty in computing
David Hillis, Science, 2003
PARALLEL ALGORITHMS FOR GENETICS
6868
6969
Assembling large data matrices by concatenation
Advantages
Improve the accuracy of a specific portion of a tree
The addition of species can be useful in cases of so-called
‘long-branch attraction’, in which high substitution rates or long
intervals of time can mislead phylogenetic inference methods
Two potential problems
Multiple genes can mix phylogenetic signals arising from different
evolutionary histories
Some sequences are usually unavailable for some species,
‘missing data’, with possible deleterious effects on accuracy
Domination by biological problems
7070
Reconstruction of trees from large data matrices
Two issues in constructing phylogenetic trees Computation time
Reliability
Two time-consuming computational problems
Multiple sequence alignment
Phylogenetic inference
Domination by computational problems
Optimal methods ( parsimony and maximum likelihood ) are time-consuming
Even heuristic approach
Months of processor time were devoted to a heuristic parsimon
y analysis of the Chase et al. dataset of ~ 500 sequences, and i
t never ran to completion ( Sanderson and Driskell, 2004)
7171
Synthesis of large trees: supertree
Tree constructed by a set of trees
Advantages Independent studies can be combined into a single tree
Initial trees can be based on different kinds of data
Initial trees can be obtained by different methodologies
Initial trees often have been selected from competing trees by professional judgment
There are most likely no common data for all species
Methods such as maximum likelihood would not be computationally tractable on such a large dataset
+
7272
Synthesis of large trees: supertree
Classification ( Wilkinson et al, 2001, Bininda-Emonds et al, 2002 )
Present Past
Supertree technique past and present ( Bininda-Emonds, 2004 )
7373
Reconstructing the “Tree” of Life
Handling large datasets: millions of species
The “Tree of Life” is not
really a tree: reticulate evolution
7474
PhylogeneticEpidemiology
7575
Infectious diseases are caused by pathogens
pathogen: microbe that causes disease
microbe: microscopic organism
The major classes of disease-causing microbes are
viruses, bacteria, and eukaryotes (protists, fungi, and worms)
RNA Viruses
The RNA viruses are more often associated with epidemic and
emerging diseases in humans than DNA viruses.
The gene sequences of many RNA viruses change so rapidly
that it is possible to watch spatial and temporal patterns unfold on
a ‘real time’ scale that is not usually visible in other organisms.
Diseases caused by RNA viruses: avian influenza, HIV, dengue...
7676
The rapidity of RNA virus evolution is caused by acombination of (Holmes, 2004)
Extremely high mutation rates
Short generation times
Immense population sizes.
These factors produce rates of nucleotide substitution that are, on
average, some six orders of magnitude higher than those in eukaryotes
and DNA viruses (Jenkins et al. 2002).
The high rates of substitution found in viruses and bacteria allow
phylogenies to be reconstructed for sequences that have diverged
only recently
Molecular phylogenies have come to play an increasingly important
role in epidemiological studies of microbial pathogens, as they
provide information about the location, timing, and mechanisms by
which virulent strains arise.
7777
Guan et al. (2002) Emergence of multiple genotypes of H5N1 avian influenza virusesin Hong Kong SAR. Proc Natl Acad Sci U S A, 99, 8950-8955.
7878
Moya, A., Holmes, E.C., and Gonzalez-Candelas, F. (2004) The population geneticsand evolutionary epidemiology of RNA viruses. Nat Rev Microbiol, 2, 279-288.
7979
Maximum likelihood estimate of phylogeny of eight strains of influenza A isolated from humans, swine, and birds based on an analysis of the HA gene. The divergence years prior to 1870, estimated using a partially constrained molecular clock, are shown at the left of the branch. The branch lengths (after 1870) are calibrated in units of years (scale at bottom).
Rannala, B. 2002. Molecular phylogenies and virulence evolution.In Adaptive Dynamics of Infectious Diseases: In Pursuit of Virulence Management
8080
Difficulties With Phylogenetic Analysis
Horizontal or lateral transfer of genetic material
(for instance through viruses) makes it difficult to
determine phylogenetic origin of some evolutionary
events
Garbage in, garbage out ! Alignment crucial
Genes selective pressure can be rapidly evolving,
masking earlier changes that had occurred
phylogenetically
8181
Difficulties With Phylogenetic Analysis
Two sites within comparative sequences may be
evolving at different rates
Rearrangements of genetic material can lead to
false conclusions
duplicated genes can evolve along separate pathways,
leading to different functions
8282
Gene trees vs species trees Gene duplication can complicate phylogenetic analysis
Paralogues (duplicated genes) do not fit in evolutionary tree
Phylogenetics - Issues
Choice of target sequence type
Use for very long-term evolutionary studies, spanning species boundaries & biological kingdoms
Ribosomal RNA (slowest change / mutation rate)
(a) Use for short-term studies of closely-related species
DNA / RNA (fastest change / mutation rate)
(b) Contains more evolutionary information than protein
(a) Use for wide species comparisons
Protein (medium change / mutation rate)
(b) More reliable alignment than DNA
8383
NO HOMEWORK! Happy??A problem will be appeared in the Final Exam:
Give an example and design a flowchart to
show how to construct a tree
Give an example and design a flowchart to
show how to construct a tree
Your answer should include, at least:
(a) Where you find the example? ( Google, books, or papers )
(b) Why you choose this example? ( curiosity, simple, or no reason? )
(c) Where you plan to get the sequences? ( database in the public domain )
(d) What kind of the methods you plan to use to construct your tree?