Lecture 5 Maximum Likelihood and model selection Joe Felsenstein
Feb 01, 2016
Lecture 5
Maximum Likelihood and model selection
Joe Felsenstein
Maximum Likelihood: The explanation that makes the observed outcome the most likely
First use in phylogenetics: Cavalli-Sforza and Edwards (1967) for gene frequency data; Felsenstein (1981) for DNA sequences
L = Pr(D|H)
Probability of the data, given an hypothesis
The hypothesis is a tree topology, its branch-lengths and a model under which the data evolved
exampleSuppose you are flipping coins and counting the number of times “heads” appear – This is your data. You throw the coin twice and observe “heads” both times. You might have two hypotheses to explain these data.
“heads” “tails” “heads” “tails”
• H1 is the hypothesis that the coin is normal: “heads” on one side, “tails” on the other and each has the same probability, p = 0.5, of appearing.
• H2 is the hypothesis that the coin is rigged with an 80% chance of getting a head , p = 0.8.
• What is the likelihood of H1?
• What is the likelihood of H2?
• The probability of observing “heads” in each of two flips under H1 is: L(H1|data) = 0.5 0.5 = 0·25
• The probability of observing “heads” in each of two flips under H2 is:
L(H2|data) = 0.8 0.8 = 0.64
Since the probability of observing the data under H2 is greater than under H1, you might argue that the “rigged” coin hypothesis is the more likely.
However, if you flipped the same coin 20 times and got “heads” 8 times, would H2 still be the better explanation of the data?
• Note that you could flip “heads” and “tails” in different orders
• E.g. HTHHTHTTHTTTTHHTTHTT• Or HHHHHHHHTTTTTTTTTTTT
There are actually 20 choose 8 ways to do this
nCk = )!(!
!
knk
n
.
The likelihood for H1 of observing 8 “heads” and 12 “tails” (where each has an equal chance of appearing) under H1 is:
L(H1 data) = 20C8 (0.5)8(0.5)12
20 19 18….21(8 7 6….21)(121110….21)
= (0.5)8(0.5)12
The likelihood for H2 of observing 8 “heads” and 12 “tails” (where “heads has a probability of 0.8 of appearing) under H2 is:
L(H2 data) = 20C8 (0.8)8(0.2)12
Numbers can be very low, so normally take natural logs lnL(H1)= 2.119
lnL(H2)= 9.355
Maximum likelihood is less likely to be misleading with more data
Plotting likelihood (or –lnL) values for different parameter values (e.g. equilibrium base frequency for Adenine, πA) gives the likelihood surface. The best score on this surface (the lowest point) identifies the maximum likelihood estimate (MLE), and indicates the hypothesis best supported by the data.
The maximum likelihood estimate (MLE)
0
20
40
60
80
100
120
140
0 0.5 1
p values
-ln
L
Here the MLE for πA ≈ 0.35
πA
In phylogenetics, the hypothesis is a tree topology, its branch-lengths and a model under which the data evolved
Sheep Goat
Cow Bison
Branch-lengths as expected numbers of substitutions per site
0.10
0.080.05
0.32
Parsimony seeks to minimise the number of substitutions Likelihood seeks to estimate the actual number of substitutions
0.14
Parsimony assumes that having knowledge about how some sites are evolving tells us nothing about other sites. Likelihood assumes that sites evolve independently, but by common mechanisms
Oak ATGACCGCTGCCAG Ash ACGCTCGCCATCAG Maple ATGCTCGCTACCGGTransitions at six sites, only one transversion is observed
Hence, an ML model should allow for different transition and transversion substitution rates
A G
C T
Purines
Pyrimidines
Transitions
Transversions
The model is reversible, ie. p(AG) = p(AG), so the root can be placed at any node
Root
A A
G
G
A
G
Pattern probability = p(G G) p(G G) p(G A) p(A A) p(A A)
Under the simple Jukes-Cantor model, all base frequencies=0.25, all substitutions equally probable. b is branch-length (subs/site)
Pij (i=j) = 0.25+0.75eb
Pij (i≠j) = 0.25-0.25eb
b is branch-length (subs/site)
Root
A A
G
G
A
G
Site pattern probability = p(G G) p(G G) p(G A) p(A A) p(A A)
= 0.7049 0.7049 0.0984 0.7049 0.7049 = 0.0243
Pij (i=j) = 0.25+0.75eb
Pij (i≠j) = 0.25-0.25eb
Where b=0.5, Pij (I=j) = 0.7049, Pij(i≠j) = 0.0984
0.5
0.50.5
0.50.5b is branch-length (subs/site)
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A A A C CA G C T A G
The site likelihood is the sum of the probabilities for the 16 possible site patterns = 0.0333
C T A G C C C T T T
T A G C T T G G G G
Hence, the site lnL = 3.402
The likelihood of a tree is the product of the site likelihoods. Taken as natural logs, the site likelihoods can be summed to give the log likelihood: the tree with the highest likelihood (lowest –lnL) is the ML tree.
Site –lnL(1) –lnL(2)
1 2.457 2.891 2 1.568 1.943
. .. ..
. .. ..
1206 2.541 1.943
2052.456 2043.655
Tree 2 is the ML tree by 8.801 –lnL units
Tree 1 Tree 2
12
3
4
Sequence at root: AGACTGATCGAATCGATTAG
Sequence at 1: ATACGGACAGAACGGTTAAG Sequence at 2: AGACTGATCGATTCGATTAG Sequence at 3: AGAATGATCGATTCGATTAG Sequence at 4: CGAATGATCGAATGGACTTG
True synapomorphyBack mutation
Parallel change
Simulation tree1
2
3
4
Parsimony reconstruction
Long-branch attraction
Goremykin (Molec. Biol. Evol., 2005) chloroplast genome sequences
Maximum likelihood tree
Under the correct model, Maximum likelihood will account for parallel changes and back mutations
Maximum parsimony tree
Long-branch attraction
Major programs supporting maximum likelihood analysis
• PHYLIP (http://evolution.gs.washington.edu/phylip/software.html)
• PAUP* (http://paup.csit.fsu.edu)
• PHYML (http://atgc.lirmm.fr/phyml/)
• PAML (http://abacus.gene.ucl.ac.uk/software/paml.html)
• TREE-PUZZLE (http://www.tree-puzzle.de)
• DAMBE (http://aix1.uottawa.ca/~xxia/software/software.htm)
General time-reversible (GTR+I+ ) likelihood parameters
Base frequencies: πA πG πC πT (3 free parameters)
Substitution rates: A-C, A-G, A-T, C-G, C-T, G-T (5 free parameters)
Proportion of invariant sites: I (1 free parameter)
Shape of the distribution: (1 free parameter)
Branch-lengths: (2n-3 free parameters on unrooted trees)
GTR: 6 substitution types, unequal base frequencies
SYM
Equal base frequencies
TrN
3 substitution types (transversion, 2 transitions)
HKY85 / F84
F81
Single substitution type
TrN
K2P
JC
Equal base frequencies
3 substitution types (transition, 2 transversions)
2 substitution types transversion, transition)
Equal base frequencies
Single substitution type
2 substitution types transversion, transition)
Substitution model categories
Note: there are also models for codon and amino acid data
Which is the most appropriate model?
Too few parameters can lead to inaccuracy, convergence upon the wrong tree (inconsistency)
Too many parameters can reduce statistical power, the ability to reject an hypothesis
The test statistic () is 2(lnLmodel_1 minus lnLmodel_2 )
likelihood ratio test (LRT)
is compared to a 2 distribution critical value (where the degrees of freedom is the difference in the number of free parameters being estimated between the two models.
= 2(lnLmodel_1 minus lnLmodel_2 ).
HO: models 1 and 2 explain the data equally well
critical value
Accept H0 Reject H0
Null distribution of
critical value
Accept H0 Reject H0
Null distribution of
Accept HO critical value Reject HO
The Akaike Information Criterion (AIC)
AIC for each model = 2lnL + (2 the number of free parameters)
Choose the model with the lower AIC
• Can be compared between non-nested models
• Does not assume a 2 distribution
• May tend to over-parameterization more than LRT
How well does the model reflect the substitution process?
Parametric bootstrap: compares the observed sequence data with data predicted from the model (observed vs. expected site pattern frequencies)
1 AGCA 2 AGAT 3 TGAT 4 TGCT
The test statistic = likelihood ratio between the multinomial likelihood T(X) and the standard substitution model likelihood
T(X) = n
iN ln(N) N ln(N)
i is the ith unique site pattern, Ni is the number of times that pattern appears, n is the number of unique site patterns and N is the total number of sites.
1. Calculate the test statistic O for the observed data
2. Simulate many pesudoreplicate datasets using the ML topology, branch lengths and model parameters
3. Calculate the test statistic i for each of the pseudoreplicates
4. If O is greater than (e.g.) 95% of the ranked values of i then the null hypothesis is rejected
135
130
125
120
115
110
105
140
145
freq
uenc
y
O = 126.7
p = 0.317
Maximum likelihood analysis is computationally expensive
So for ML on large datasets it is not feasible to use non-parametric bootstrapping (reconstruct the tree from many resamplings from the observed data (draw and replace n nucleotide sites, where n=sequence length)
Time (t) for one ML(GTR+I+ ) heuristic search on a X-taxa, 3425 nt in length (using a Pentium 4 processor)
X = 5; t = 46s t = 0.3s X = 6; t = 4m 37s t = 1.3s X = 7; t = 15m 58s t = 2.5s X = 8; t = 39m 16s t = 5.5s
Taxa Parameters estimated fixed
Accounting for stochastic (sampling) error
Flip 2 coins 100 times, does coin A give more tails than coin B
Can the difference in likelihood between two hypotheses be explained by sampling error?
Compare with a null distribution
Proportion of tails
Choosing just a finite number of nucleotide sites (e.g. 500) also has the problem of sampling error
Null distribution for the likelihood ratio statistic
Here = lnLT1 minus lnLT2
critical value
Accept H0 Reject H0
Null distribution of
critical value
Accept H0 Reject H0
Null distribution of
Accept HO critical value Reject HO
Maximum likelihood Hypothesis testing
If comparing a small number of alternative trees, ML hypothesis tests allow full parameter optimisation
Shimodaira-Hasegawa (SH) test
Kishino-Hasegawa (KH) test
Approximately unbiased (AU) test
Swofford, Olsen, Waddell, Hillis (SOWH) test
Winning sites tests
Parametric bootstrap test
Kishino-Hasegawa test
HO: T1 and T2 explain the data equally well; ie. =0
1 3 1 3
2 42 4
1. Calculate the likelihood ratio statistic between T1 and T2
2. Bootstrap the data (or site likelihoods) to generate pseudoreplicates
3. Optimise ML on each pseudoreplicate for T1 and T2
4. Calculate i for each pseudoreplicate
5. Centre the distribution by subtracting the mean of i from each value of i
6. If lies outside of 2.5% - 97.5% of the ranked distribution of I, then HO is rejected
T1T2
Shimodaira-Hasegawa test
Approximately unbiased test
Corrects for comparing multiple topologies
HO: That all topologies are equally good explanations of the data
HA: That some topologies do not explain the data as well as others
Generates pseudoreplicates that differ in length from the original dataset, in order to explore site-pattern space. This allows more accurate correction for comparing multiple topologies
HO and HA as above
SOWH test
HO: That T1 is the true tree
1 3
2 4
1. Optimise T1 and TML on the original data and calculate the likelihood ratio 2. Simulate many datasets using T1 (topology, branch-lengths, substitution parameters)3.For each dataset (i) optimise the likelihood for T1 to give L1i and find TML to give LMLi
4. Calculate i for each TML to give LMLi pair.
5. If is greater than 95% of the ranked values of i, reject HO.
T1
HA: That some other tree is the true tree
1 3
2 4
TML
Which hypothesis test to use?
If just pairwise hypothesis testing and neither is a priori known to be the ML tree, then the KH test
If comparing many topologies simultaneously and the curvature of site-pattern space can be defined, then AU test, otherwise, SH test, which is very conservative and the power to reject HO is dependent on the number of topologies tested)
SOWH tests the full phylogenetic model (topology, branch-lengths, substitution parameters), so can be difficult to interpret when the object is specifically to compare topologies. If the model is misspecified it will not be conservative enough.
A Maximum likelihood analysis: What are the phylogenetic affinities of turtles
Turtles and many early reptilian groups
Squamates: e.g. lizards/snakes
Some marine reptiles, derived from diapsids
Mammals
H1: AnapsidaH3: ArchosauriaH2: Diapsida
Amphibia (outgroup)
Squamata
Aves
CrocodiliaMammalia
Turtle placement hypotheses
Complete mitochondrial genome RNA sequences: 3,110 nucleotides
Modeltest (Posada and Crandal, Bioinformatics, 1998)
Hierarchical Likelihood Ratio Tests (hLRTs)
Equal base frequenciesNull model = JC -lnL0 = 28513.7910Alternative model = F81 -lnL1 = 28409.5176 = 2(lnL1-lnL0) = 208.5469 df = 3 P-value = <0.000001 Optimized Model LRT → TrN+I+; on AIC → GTR+I+
Base frequencies = (0.3546 0.2105 0.1780 0.2569) Rate matrix = (1.0000 5.3176 1.0000 1.0000 8.7021 1.0000) Rates=gamma Pinvar=0.2284 Shape=0.9845
Dog
3 toed Sloth
KangarooOpossum
PlatypusEchidna
SkinkIguana
Green TurtlePainted Turtle
AlligatorCaiman
Cassowary
Penguin
CaecilianSalamander
0.05 substitutions/site
ML (TrN+I+) heuristic search
Arc
hosa
uria
Dia
psid
ia
Mam
mal
ia
Amphibia
Birds
Crocodilia
Amphibia
Turtles
Squamates
Mammalia
Birds
Crocodilia
Amphibia
Turtles
Squamates
Mammalia
Birds
Crocodilia
Amphibia
Mammalia
Squamates
Turtles
Tree 1
Tree 2
Tree 3
8898
Non-parametric bootstrap support
Tree1 Tree2 Tree3-lnL +36.1 +11.7 <best>KH 0.002 0.044 --SH 0.003 0.153 --AU 0.001 0.054 --SOWH <0.001 <0.001 --
• Turtles are not a remnant of early anapsid reptiles, instead they group within Diapsida
• The anapsid condition must be a reversal in turtles, as has occurred convergently with other armoured diapsid groups
• Within Diapsida, turtles are likely sister to Archosauria (inc. birds and crocodiles) - this explains the missing 80Ma fossil record of turtles (prior to 230 Ma)
Ankylosaurus