Lecture 5 Maximum Likelihood and model selection

Lecture 5

Maximum Likelihood and model selection

Joe Felsenstein

Maximum Likelihood: The explanation that makes the observed outcome the most likely

First use in phylogenetics: Cavalli-Sforza and Edwards (1967) for gene frequency data; Felsenstein (1981) for DNA sequences

L = Pr(D|H)

Probability of the data, given an hypothesis

The hypothesis is a tree topology, its branch-lengths and a model under which the data evolved

exampleSuppose you are flipping coins and counting the number of times “heads” appear – This is your data. You throw the coin twice and observe “heads” both times. You might have two hypotheses to explain these data.

“heads” “tails” “heads” “tails”

• H1 is the hypothesis that the coin is normal: “heads” on one side, “tails” on the other and each has the same probability, p = 0.5, of appearing.

• H2 is the hypothesis that the coin is rigged with an 80% chance of getting a head , p = 0.8.

• What is the likelihood of H1?

• What is the likelihood of H2?

• The probability of observing “heads” in each of two flips under H1 is: L(H1|data) = 0.5 0.5 = 0·25

• The probability of observing “heads” in each of two flips under H2 is:

L(H2|data) = 0.8 0.8 = 0.64

Since the probability of observing the data under H2 is greater than under H1, you might argue that the “rigged” coin hypothesis is the more likely.

However, if you flipped the same coin 20 times and got “heads” 8 times, would H2 still be the better explanation of the data?

• Note that you could flip “heads” and “tails” in different orders

• E.g. HTHHTHTTHTTTTHHTTHTT• Or HHHHHHHHTTTTTTTTTTTT

There are actually 20 choose 8 ways to do this

nCk = )!(!

!

knk

n

.

The likelihood for H1 of observing 8 “heads” and 12 “tails” (where each has an equal chance of appearing) under H1 is:

L(H1 data) = 20C8 (0.5)8(0.5)12

20 19 18….21(8 7 6….21)(121110….21)

= (0.5)8(0.5)12

The likelihood for H2 of observing 8 “heads” and 12 “tails” (where “heads has a probability of 0.8 of appearing) under H2 is:

L(H2 data) = 20C8 (0.8)8(0.2)12

Numbers can be very low, so normally take natural logs lnL(H1)= 2.119

lnL(H2)= 9.355

Maximum likelihood is less likely to be misleading with more data

Plotting likelihood (or –lnL) values for different parameter values (e.g. equilibrium base frequency for Adenine, πA) gives the likelihood surface. The best score on this surface (the lowest point) identifies the maximum likelihood estimate (MLE), and indicates the hypothesis best supported by the data.

The maximum likelihood estimate (MLE)

0

20

40

60

80

100

120

140

0 0.5 1

p values

-ln

L

Here the MLE for πA ≈ 0.35

πA

In phylogenetics, the hypothesis is a tree topology, its branch-lengths and a model under which the data evolved

Sheep Goat

Cow Bison

Branch-lengths as expected numbers of substitutions per site

0.10

0.080.05

0.32

Parsimony seeks to minimise the number of substitutions Likelihood seeks to estimate the actual number of substitutions

0.14

Parsimony assumes that having knowledge about how some sites are evolving tells us nothing about other sites. Likelihood assumes that sites evolve independently, but by common mechanisms

Oak ATGACCGCTGCCAG Ash ACGCTCGCCATCAG Maple ATGCTCGCTACCGGTransitions at six sites, only one transversion is observed

Hence, an ML model should allow for different transition and transversion substitution rates

A G

C T

Purines

Pyrimidines

Transitions

Transversions

The model is reversible, ie. p(AG) = p(AG), so the root can be placed at any node

Root

A A

G

G

A

G

Pattern probability = p(G G) p(G G) p(G A) p(A A) p(A A)

Under the simple Jukes-Cantor model, all base frequencies=0.25, all substitutions equally probable. b is branch-length (subs/site)

Pij (i=j) = 0.25+0.75eb

Pij (i≠j) = 0.25-0.25eb

b is branch-length (subs/site)

Root

A A

G

G

A

G

Site pattern probability = p(G G) p(G G) p(G A) p(A A) p(A A)

= 0.7049 0.7049 0.0984 0.7049 0.7049 = 0.0243

Pij (i=j) = 0.25+0.75eb

Pij (i≠j) = 0.25-0.25eb

Where b=0.5, Pij (I=j) = 0.7049, Pij(i≠j) = 0.0984

0.5

0.50.5

0.50.5b is branch-length (subs/site)

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A A A C CA G C T A G

The site likelihood is the sum of the probabilities for the 16 possible site patterns = 0.0333

C T A G C C C T T T

T A G C T T G G G G

Hence, the site lnL = 3.402

The likelihood of a tree is the product of the site likelihoods. Taken as natural logs, the site likelihoods can be summed to give the log likelihood: the tree with the highest likelihood (lowest –lnL) is the ML tree.

Site –lnL(1) –lnL(2)

1 2.457 2.891 2 1.568 1.943

. .. ..

. .. ..

1206 2.541 1.943

2052.456 2043.655

Tree 2 is the ML tree by 8.801 –lnL units

Tree 1 Tree 2

12

3

4

Sequence at root: AGACTGATCGAATCGATTAG

Sequence at 1: ATACGGACAGAACGGTTAAG Sequence at 2: AGACTGATCGATTCGATTAG Sequence at 3: AGAATGATCGATTCGATTAG Sequence at 4: CGAATGATCGAATGGACTTG

True synapomorphyBack mutation

Parallel change

Simulation tree1

2

3

4

Parsimony reconstruction

Long-branch attraction

Goremykin (Molec. Biol. Evol., 2005) chloroplast genome sequences

Maximum likelihood tree

Under the correct model, Maximum likelihood will account for parallel changes and back mutations

Maximum parsimony tree

Long-branch attraction

Major programs supporting maximum likelihood analysis

• PHYLIP (http://evolution.gs.washington.edu/phylip/software.html)

• PAUP* (http://paup.csit.fsu.edu)

• PHYML (http://atgc.lirmm.fr/phyml/)

• PAML (http://abacus.gene.ucl.ac.uk/software/paml.html)

• TREE-PUZZLE (http://www.tree-puzzle.de)

• DAMBE (http://aix1.uottawa.ca/~xxia/software/software.htm)

General time-reversible (GTR+I+ ) likelihood parameters

Base frequencies: πA πG πC πT (3 free parameters)

Substitution rates: A-C, A-G, A-T, C-G, C-T, G-T (5 free parameters)

Proportion of invariant sites: I (1 free parameter)

Shape of the distribution: (1 free parameter)

Branch-lengths: (2n-3 free parameters on unrooted trees)

GTR: 6 substitution types, unequal base frequencies

SYM

Equal base frequencies

TrN

3 substitution types (transversion, 2 transitions)

HKY85 / F84

F81

Single substitution type

TrN

K2P

JC


3 substitution types (transition, 2 transversions)

2 substitution types transversion, transition)


Single substitution type

2 substitution types transversion, transition)

Substitution model categories

Note: there are also models for codon and amino acid data

Which is the most appropriate model?

Too few parameters can lead to inaccuracy, convergence upon the wrong tree (inconsistency)

Too many parameters can reduce statistical power, the ability to reject an hypothesis

The test statistic () is 2(lnLmodel_1 minus lnLmodel_2 )

likelihood ratio test (LRT)

is compared to a 2 distribution critical value (where the degrees of freedom is the difference in the number of free parameters being estimated between the two models.

= 2(lnLmodel_1 minus lnLmodel_2 ).

HO: models 1 and 2 explain the data equally well

critical value

Accept H0 Reject H0

Null distribution of

critical value

Accept H0 Reject H0


Accept HO critical value Reject HO

The Akaike Information Criterion (AIC)

AIC for each model = 2lnL + (2 the number of free parameters)

Choose the model with the lower AIC

• Can be compared between non-nested models

• Does not assume a 2 distribution

• May tend to over-parameterization more than LRT

How well does the model reflect the substitution process?

Parametric bootstrap: compares the observed sequence data with data predicted from the model (observed vs. expected site pattern frequencies)

1 AGCA 2 AGAT 3 TGAT 4 TGCT

The test statistic = likelihood ratio between the multinomial likelihood T(X) and the standard substitution model likelihood

T(X) = n

iN ln(N) N ln(N)

i is the ith unique site pattern, Ni is the number of times that pattern appears, n is the number of unique site patterns and N is the total number of sites.

1. Calculate the test statistic O for the observed data

2. Simulate many pesudoreplicate datasets using the ML topology, branch lengths and model parameters

3. Calculate the test statistic i for each of the pseudoreplicates

4. If O is greater than (e.g.) 95% of the ranked values of i then the null hypothesis is rejected

135

130

125

120

115

110

105

140

145

freq

uenc

y

O = 126.7

p = 0.317

Maximum likelihood analysis is computationally expensive

So for ML on large datasets it is not feasible to use non-parametric bootstrapping (reconstruct the tree from many resamplings from the observed data (draw and replace n nucleotide sites, where n=sequence length)

Time (t) for one ML(GTR+I+ ) heuristic search on a X-taxa, 3425 nt in length (using a Pentium 4 processor)

X = 5; t = 46s t = 0.3s X = 6; t = 4m 37s t = 1.3s X = 7; t = 15m 58s t = 2.5s X = 8; t = 39m 16s t = 5.5s

Taxa Parameters estimated fixed

Accounting for stochastic (sampling) error

Flip 2 coins 100 times, does coin A give more tails than coin B

Can the difference in likelihood between two hypotheses be explained by sampling error?

Compare with a null distribution

Proportion of tails

Choosing just a finite number of nucleotide sites (e.g. 500) also has the problem of sampling error

Null distribution for the likelihood ratio statistic

Here = lnLT1 minus lnLT2

critical value

Accept H0 Reject H0


critical value

Accept H0 Reject H0


Accept HO critical value Reject HO

Maximum likelihood Hypothesis testing

If comparing a small number of alternative trees, ML hypothesis tests allow full parameter optimisation

Shimodaira-Hasegawa (SH) test

Kishino-Hasegawa (KH) test

Approximately unbiased (AU) test

Swofford, Olsen, Waddell, Hillis (SOWH) test

Winning sites tests

Parametric bootstrap test

Kishino-Hasegawa test

HO: T1 and T2 explain the data equally well; ie. =0

1 3 1 3

2 42 4

1. Calculate the likelihood ratio statistic between T1 and T2

2. Bootstrap the data (or site likelihoods) to generate pseudoreplicates

3. Optimise ML on each pseudoreplicate for T1 and T2

4. Calculate i for each pseudoreplicate

5. Centre the distribution by subtracting the mean of i from each value of i

6. If lies outside of 2.5% - 97.5% of the ranked distribution of I, then HO is rejected

T1T2

Shimodaira-Hasegawa test

Approximately unbiased test

Corrects for comparing multiple topologies

HO: That all topologies are equally good explanations of the data

HA: That some topologies do not explain the data as well as others

Generates pseudoreplicates that differ in length from the original dataset, in order to explore site-pattern space. This allows more accurate correction for comparing multiple topologies

HO and HA as above

SOWH test

HO: That T1 is the true tree

1 3

2 4

1. Optimise T1 and TML on the original data and calculate the likelihood ratio 2. Simulate many datasets using T1 (topology, branch-lengths, substitution parameters)3.For each dataset (i) optimise the likelihood for T1 to give L1i and find TML to give LMLi

4. Calculate i for each TML to give LMLi pair.

5. If is greater than 95% of the ranked values of i, reject HO.

T1

HA: That some other tree is the true tree

1 3

2 4

TML

Which hypothesis test to use?

If just pairwise hypothesis testing and neither is a priori known to be the ML tree, then the KH test

If comparing many topologies simultaneously and the curvature of site-pattern space can be defined, then AU test, otherwise, SH test, which is very conservative and the power to reject HO is dependent on the number of topologies tested)

SOWH tests the full phylogenetic model (topology, branch-lengths, substitution parameters), so can be difficult to interpret when the object is specifically to compare topologies. If the model is misspecified it will not be conservative enough.

A Maximum likelihood analysis: What are the phylogenetic affinities of turtles

Turtles and many early reptilian groups

Squamates: e.g. lizards/snakes

Some marine reptiles, derived from diapsids

Mammals

H1: AnapsidaH3: ArchosauriaH2: Diapsida

Amphibia (outgroup)

Squamata

Aves

CrocodiliaMammalia

Turtle placement hypotheses

Complete mitochondrial genome RNA sequences: 3,110 nucleotides

Modeltest (Posada and Crandal, Bioinformatics, 1998)

Hierarchical Likelihood Ratio Tests (hLRTs)

Equal base frequenciesNull model = JC -lnL0 = 28513.7910Alternative model = F81 -lnL1 = 28409.5176 = 2(lnL1-lnL0) = 208.5469 df = 3 P-value = <0.000001 Optimized Model LRT → TrN+I+; on AIC → GTR+I+

Base frequencies = (0.3546 0.2105 0.1780 0.2569) Rate matrix = (1.0000 5.3176 1.0000 1.0000 8.7021 1.0000) Rates=gamma Pinvar=0.2284 Shape=0.9845

Dog

3 toed Sloth

KangarooOpossum

PlatypusEchidna

SkinkIguana

Green TurtlePainted Turtle

AlligatorCaiman

Cassowary

Penguin

CaecilianSalamander

0.05 substitutions/site

ML (TrN+I+) heuristic search

Arc

hosa

uria

Dia

psid

ia

Mam

mal

ia

Amphibia

Birds

Crocodilia

Amphibia

Turtles

Squamates

Mammalia

Birds

Crocodilia

Amphibia

Turtles

Squamates

Mammalia

Birds

Crocodilia

Amphibia

Mammalia

Squamates

Turtles

Tree 1

Tree 2

Tree 3

8898

Non-parametric bootstrap support

Tree1 Tree2 Tree3-lnL +36.1 +11.7 <best>KH 0.002 0.044 --SH 0.003 0.153 --AU 0.001 0.054 --SOWH <0.001 <0.001 --

• Turtles are not a remnant of early anapsid reptiles, instead they group within Diapsida

• The anapsid condition must be a reversal in turtles, as has occurred convergently with other armoured diapsid groups

• Within Diapsida, turtles are likely sister to Archosauria (inc. birds and crocodiles) - this explains the missing 80Ma fossil record of turtles (prior to 230 Ma)

Ankylosaurus

Lecture 5 Maximum Likelihood and model selection

Documents

likelihood of h2

likelihood of h1

likelihood surface

dataplotting likelihood

thisthe likelihood

lh2 data

data evolvedexamplesuppose

lh1 data