1 Phylogenetics WHO-TDR Bioinformatics Workshop Jessica Kissinger New Delhi, India October, 2005 Why do Phylogenetics? • We make evolutionary assumptions in our everyday research life. For example, we need a drug that will kill the parasite and not us. Thus, we need a target that is present in the parasite and not us. • We need a good model system, Which parasite (or host) is most closely related to P. falciparum or Humans?
29
Embed
Phylogenetics WHO-TDR Bioinformatics Workshop · Goals for this lecture •Become familiar with concepts •Become familiar with vocabulary •Become familiar with the data analysis
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
PhylogeneticsWHO-TDR
Bioinformatics WorkshopJessica KissingerNew Delhi, India
October, 2005
Why do Phylogenetics?
• We make evolutionary assumptions in oureveryday research life. For example, we need adrug that will kill the parasite and not us. Thus,we need a target that is present in the parasite andnot us.
• We need a good model system, Which parasite(or host) is most closely related to P. falciparumor Humans?
2
Why Phylogenetics?
• This strain is resistant to drug and this oneis sensitive, what has changed?
• Where did this parasite come from? Has it“co-evolved” with humans? Did it enterthe human lineage from another source?
• Which other mosquitoes are likely to serveas a host for my parasite in nature?
Phylogenetics
• What is Phylogenetics?– Molecular Systematics
• The use of molecular data to infer the relationshipsof the host species e.g. using rRNA to build trees tolook at the relationship of the bacteria to theeukaryotes
– Molecular Evolution• Use trees to infer how a molecule, protein, or gene
has evolved (insertions, deletions, substitutions).
• Define a question• Select sequences appropriate to answer your
question (not all sequences are equally good!)• Make a multiple sequence alignment• Edit your alignment to make it better• Perform lots and lots of analyses• Perform Bootstrap analyses to test confidence
Multiple Sequence Alignment
9
Multiple Sequence Alignment
Study your Alignments!
10
A Word About Methods• There are two overall categories of methods
– Transformed distance methods (data are transformedinto a distance matrix). The matrix is used to build asingle tree. UPGMA and Neighbor-Joining areexamples of this method. They are computationallysimple and very fast.
– Optimality methods (tree generation is separate fromtree evaluation). Parsimony and Maximum-likelihoodmethods divorce the issue of tree generation fromevaluating how good a tree is. For parsimony, theremany be more than 1 “most parsimonious” or“shortest” tree found.
Distance methods• UPGMA
– Assume all lineagesevolve at the same rate
– Produces a root– Produces only one tree– Computationally very
fast– Trees are additive
• Neighbor-joining– Permits variation in
rates of evolution– Does not produce a
root– Produces only one tree– Computationally very
fast– Trees are additive
11
1 ATTGCTCAGA2 AATGCTCTGA3 ATAGGACTGA
1 vs 2 = 80% similar = 0.2 distance1 vs 3 = 60% similar = 0.4 distance2 vs 3 = 60% similar = 0.4 distance
0.1
0.1
0.1
0.2
1 2 3 123
0.10.2
1 2 31 - 0.2 0.42 - 0.43 -
Create a distance matrixCan use scoring schemes to transform data into distances(e.g. do transitions occur moreoften than transversions)
The implementation of theUPGMA algorithm toproduce the tree below. Anew matrix is calculatedat each iteration.
12
An unrooted Neighbor-joining tree of thesame dataset
Factors that Affect Phylogenetic Inference
1. Relative base frequencies (A,G,T,C)2. Transition/transversion ratio3. Number of substitutions per site4. Number of nucleotides (or amino acids) in sequence5. Different rates in different parts of the molecule6. Synonymous/non-synonymous substitution ratio7. Substitutions that are uninformative or obfuscatory
1. Parallel substitutions2. Convergent substitutions3. Back substitutions4. Coincidental substitutions
In general, the more factors that are accounted for by themodel (i.e., more parameters), the larger the error ofestimation. It is often best to use fewer parameters bychoosing the simpler model.
Models of evolution: choosing parameters
13
Some distance models: p-distance
• p = nd/n, where n is the number of sites(nucleotides or amino acids), and nd is thenumber of differences between the two sequencesexamined.• Very robust when divergence times are recentand the affect of complicating phenomena is minor
Some distance models: Jukes-Cantor
• Used to estimate the number ofsubstitutions per site
• The expected number of substitutionsper site is:
• d = 3αt = -(3/4)ln[1-(4/3)p], where pis the proportion of differencebetween 2 sequences
• Variance can be calculated• No assumptions are made about
nucleotide frequencies, or differentialsubstitution rates
A T C G
ATCG
-ααα
α-αα
αα-α
ααα-
14
Some distance models: Kimura two-parameter
• Used to estimate the number ofsubstitutions per site
• d = 2rt, where r is thesubstitution rate (per site, peryear) and t is the generation time;r = α + 2β, so:
• d = 2αt + 4βt• Accounts for different transition
and transversion rates• No assumptions are made about
nucleotide frequencies, varianceis greater than Jukes-Cantor
C T
A G
Pyrimidines
Purines
α
α
ββ ββ
α = transition rateβ = transversion rateThese are treated thesame for longdivergence times.
Other models
• Hasegawa, Kishino, Yano (HKY): corrects forunequal nucleotide frequencies and transition/transversion bias into account
• Unrestricted model: allows different rates betweenall pairs of nucleotides
• General Time Reversible model: allows differentrates between all pairs of nucleotides and correctsfor unequal nucleotide frequencies
• Many other models have been invented to correctfor specific problems
• The more parameters are introduced, the larger thevariance becomes
15
Optimality Methods
• All possible trees (or a heuristic samplingof trees) are generated and evaluatedaccording to Parsimony or Maximumlikelihood.
• Note: Tree generation is divorced from treeevaluation. More than one tree topologymay be optimal according to your criteria
General differences between optimality criteria
Works well with strongor weak sequencesimilarity
Works only when sequencesimilarity is high
Works well with strong orweak sequence similarity
Can estimate branchlengths with somedegree of accuracy
Cannot estimate branchlengths accurately
Can accurately estimatebranch lengths (importantfor molecular clocks)
Well understoodstatistical properties(easy to test)
Poorly understood statisticalproperties (hard to test)
Well understood statisticalproperties (easy to test)
Computationally slowComputationally fastComputationally fast
Can account for manytypes of sequencesubstitutions
Assumes that all substitutionsare equal
Can account for many typesof sequence substitutions
Model based“Model free”Model based
MaximumLikelihood
MaximumParsimony
Minimumevolution
16
Rooted Tree Unrooted Tree
A definiteBeginning andPolarity, a root
Rooted Tree Unrooted Tree
Terminal branches
Nodes
InternalBranches
Root
17
1 2 31 23 12 3
1
2
3
In the world of trees, there are more rooted topologies for a given Number of taxa than unrooted
More trees than the number of atoms in the universe!
18
Tree search considerations
• Exhaustive searches are searches of allpossible trees for the number of Taxa inyour data set (15 Taxa or less)
• If you have more than 15 Taxa, thenheuristic methods must be employed inwhich you search a sample of all possibletrees. There are many algorithms for thegeneration of different populations of trees.
Tree search considerations
Strategy Type• Stepwise addition Algorithmic• Star decomposition Algorithmic• Exhaustive Exact• Branch & bound Exact• Branch swapping Heuristic• Genetic algorithm Heuristic• Markov Chain Monte Carlo Heuristic
19
Parsimony basics & scores• Based on shared derived characters
(synapomorphies)• Identical characters which evolve more than once
are “homoplasies”• Unique characters are “autapomorphies”• The score of the tree is the total of all the changes
needed to map the data. The scale bar is #ofchanges.
• Smaller, i.e. more parsimonious scores are better• More than one tree topology may have the same
score
An informative position is one that can favor one treeover another when some type of criteria are applied.
1 2 3 4 5 6 7 8 91 A A G A G T G C A2 A G C C G T G C G3 A G A T A T C C A4 A G A G A T C C G
* **
1 2 3423 4 11 2 3 4
1A
2G
3A
4G
1A
3A
2G
4G
1A
4G
2G
3A
2 1 2
Position #5 isInformative, itpermits us tochoose a shortertree from amongthe options. It prefers the treeof length 1 overthose of length 2
20
1 2 3 4 5 6 7 8 91 A A G A G T G C A2 A G C C G T G C G3 A G A T A T C C A4 A G A G A T C C G
* **
1G
2C
3A
4A
1G
3A
2C
4A
1G
4A
2C
3A
1G
2A
3G
4G
1G
3G
2A
4G
1G
4G
2A
3G
1A
2A
3A
4A
1A
3A
2A
4A
1A
4A
2A
3A
Pos 1
Pos 2
Pos 3
0 0 0
22 2
1 1 1
Not all alignment positions can help pick a better tree
1 2 3 4 5 6 7 8 91 A A G A G T G C A2 A G C C G T G C G3 A G A T A T C C A4 A G A G A T C C G
8 8 3 2 1 4 6 5 9 1 C C G A A A T G A2 C C C G A C T G G3 C C A G A T T A A4 C C A G A G T A G
1STsample
Original DataEach column Represented once
2ND etc.
3RD etc.
2 9 6 2 1 3 4 8 7
6 3 3 1 6 5 7 4 9
100 or 1,000
The bootstrap process
Then build consensus of all trees produced by sample datasets. This provides support for nodes
24
A caution about alignmentscharacters in columns are homologous
stickman
Daffy
Donald
RoadRunner
TweetyBugs
Goofy
Mickey
Wile E
Pluto
1 change
stickman
Daffy
Donald
RoadRunner
TweetyBugs
Goofy
Mickey
Wile E
Pluto
1 change
stickman
Daffy
Donald
RoadRunner
TweetyBugs
Goofy
Mickey
Pluto
Wile E
1 change
stickman
Daffy
RoadRunner
Tweety
DonaldBugs
Goofy
Mickey
Wile E
Pluto
1 change
stickman
Daffy
RoadRunner
Tweety
DonaldBugs
Goofy
Mickey
Wile E
Pluto
1 change
stickman
Daffy
RoadRunner
Tweety
DonaldBugs
Goofy
Mickey
Pluto
Wile E
1 change
stickman
Daffy
RoadRunner
Tweety
DonaldBugs
Goofy
Mickey
Wile E
Pluto
1 change
stickman
Daffy
RoadRunner
Tweety
DonaldBugs
Goofy
Mickey
Wile E
Pluto
1 change
stickman
Daffy
RoadRunner
Tweety
DonaldBugs
Goofy
Mickey
Pluto
Wile E
1 change
stickman
Daffy
RoadRunner
Donald
TweetyBugs
Goofy
Mickey
Wile E
Pluto
1 change
stickman
Daffy
RoadRunner
Donald
TweetyBugs
Goofy
Mickey
Wile E
Pluto
1 change
stickman
Daffy
RoadRunner
Donald
TweetyBugs
Goofy
Mickey
Pluto
Wile E
1 change
stickman
Daffy
Donald
RoadRunner
TweetyBugs
Goofy
Mickey
Wile E
Pluto
1 change
stickman
Daffy
Donald
RoadRunner
TweetyBugs
Goofy
Mickey
Wile E
Pluto
1 change
stickman
Daffy
Donald
RoadRunner
TweetyBugs
Goofy
Mickey
Pluto
Wile E
1 change
15 equally Parsimonious treesOf Disney characters.All trees have the same,smallest score.
25
stickman
Daffy
RoadRunner
Tweety
Donald
Bugs
Goofy
Mickey
Wile E
Pluto
Strict
stickman
Daffy
RoadRunner
Tweety
Donald
Bugs
Goofy
Mickey
Wile E
Pluto
100
100
60
100
100
Majority rule
Comparison of real trees Assesment ofsupport
Bootstrap Example
Donald DuckDaffy DuckTweety bird
71 Donald Duck
Daffy DuckTweety bird
?
Donald Duck
Daffy DuckTweety bird
?
?
If 79% of the time this relationship holds, 29% it is something else
26
Some points to consider for the paper fasteners:
We decided, in our evolutionary model that material was so important that weneeded to give it extra weight, so we did (weight = 2).
Based on external information, such as the archeological record, we have learnedthat metal predates plastic, so, we ordered our characters: metal must precedeplastic.
We decided to use as an “outgroup”, an unbent piece of metal, (taxon 21) topolarize the direction of evolution within our tree, i.e. we have evolved from astraight piece of metal into a “paper fastening device”. We will not allow reversionto this “unbent” state.
We will enforce the assumptions/decisions made above by using a constraint tree.By using this constraint tree, we reduce the number of possible rooted trees from2.216431 x 1020 to 273,922,023,375 and we reduce the number of unrooted treesfrom 6.332660 x 1018 to 54,784,404,674 - a considerable savings!
We removed taxa 4 and 11 from the data set because they are non-homologous, i.e.the have a similar function but they do not share a common evolutionary descent orpath. What we have here is a case of convergent evolution, i.e. independent originsof a paper fastening solution!
27
Neighbor-joining analysis and bootstrap of clip dataset
Some of the >37,500Trees generated by a Parsimony analysisof the clip dataset
28
Consensus of 5,000 parsimony Trees Bootstrap of clips
Software and Books
• “How to make a phylogenetic Tree” by BarryHall, comes with PAUP* CD, ~$30, Sinauer Press
• Phylip - Joe Felsenstein, Free via internet• PAML - Free via internet• Mr. Bayes - Free via internet• ClustalW or ClustalX - Free via internet• Fundamentals of molecular evolution, Second
edition, Wen-Hsiung Li, Sinauer Press
* Best on a MAC, but also command line
29
Giving Credit
• Several slides in this presentation wereprovided by Mike Thomas, via apresentation he posted on the internet in2002.