Bioinformática: Inferência filogenética WHY - rcastilho.ptrcastilho.pt/BI2017/Main_files/BIOINFO_2017.pdf · Uses of phylogenies: Co-evolution • Compare divergence patterns in

Bioinformática: Inferência filogenética

WHYDO WE CARE ?

Rita Castilho, [email protected]

What for?

Sistemática Molecular

Evolução

Ecologia

Forense

Medicina (evolução de vírus, vacinas, desenvolvimento de drogas)

Uses of phylogenies: Sistemática• Similar organisms are

grouped together

• Clades share common evolutionary history

• Phylogenetic classification names clades

Source: Inoue, J.G., Miya, M., Tsukamoto, K., Nishida, M. 2003.

Basal actinopterygian relationships: a mitogenomic perspective on the

phylogeny of the “ancient fish”. Molecular Phylogenetics and

Evolution, 26: 110-120.

Pryer et al. 2001

Uses of phylogenies: Character evolution

What did the ancestral Darwin's Finch eat?

Example of correlated character evolution

Granivore Insectivore Folivore

MP

Schluter et al. 1997

Uses of phylogenies: Ecology

• Study the evolution of ecological interaction and behavior

• Why might two related species have a different ecology?

• e.g. social vs. solitary, drought tolerant vs. mesophytic, parasitic vs. free living, etc.

• What are the causes of these differences?

• Is the environment causing these differences?

• Can we infer which condition is ancestral?

Examples of phylogenetic ecology

Evolutionary ecology of mate

choice in swordtail fish (genus Xiphophorus)

Morris et al. 2003

Uses of phylogenies: Co-evolution

• Compare divergence patterns in two groups of tightly linked organisms (e.g. hosts and parasites or plants and obligate pollinators)

• Look at how similar the two phylogenies are

• Look at host switching

• Evolutionary arms races

• Traits in one group track traits in another group • e.g. toxin production and resistance in prey/predator or plant/herbivore

systems, floral tube and proboscis length in pollination systems

Example of host-parasite phylogeny

Uses of phylogenies: Phylogenetic geography

• Sometimes called historical biogeography or phylogeography

• Map the phylogeny with geographical ranges of populations or species

• Understand geographic origin and spread of species

• Look at similarities between unrelated organisms

• Understand repeated patterns in distributions

• e.g. identifying glacial refugia

Example of phylogeographyIndependent sites of pig

domestication

Larson et al. 2005

Example of phylogeography Example of phylogeography

Example of phylogeography

The highly diversified genus Conus includes more then 500 species of venomous marine snails

Example of phylogeographyAll endemic Cape Verde Conus:

are vermivorous (prey on polychaete annelids)

are nonplanktonic lecithotrophic

Example of phylogeography

“sm

all”

she

lled

spec

ies

“large” shelled species

Divergence time estimates indicate a double colonization of the archipelago:

the ancestors of “large” and “small” shelled species arrived around 16.5

and 4.6 MYA, respectively

Main cladogenetic events in Cape Verde Conus are consistent with geological dates and eustatic sea-level changes

(Haq et al., 1987)

- 80m10.5MYA - 50m

5.5MYA

- 30m3.8MYA

16.5 MYA

4.6 MYA

2,048 bp - mitochondrial genes (12S rRNA, trnaV, 16S rRNA, cob)

ML

Uses of phylogenies: Estimating Divergence Times

• Estimate when a group of organisms originated

• Uses information about phylogeny and rates of evolutionary change to place timescales on tree

• Needs calibration with fossils

• Combined with mapping characters, correlate historical events with character evolution

• e.g. Radiation of flowering plants in the Cretaceous

Example of timescales on phylogenies

Timing the evolution of sociality in sweat bees to a warm period in geologic history

Brady et al. 2006

Multiple origins of HIV from SIV (Simian Immunodeficiency

Virus)

Uses of phylogenies: Medicine

• Learn about the origin of diseases

• Look for disease resistance mechanisms in other hosts to identify treatment and therapy in humans

Multiple origins of HIV from SIV (Simian Immunodeficiency Virus)

From: Understanding Evolution. HIV: the ultimate evolver. http://evolution.berkeley.edu/evolibrary/article/0_0_0/medicine_04

Severe acute respiratory syndrome

Example of disease phylogeny

Wendong et al. 2005

Methicillin-resistant Staphylococcus aureus

AsiaEuropaAmérica do Sul

AustralasiaAmérica do Norte

Example of disease phylogeny

Harris et al. 2010

Example of medical forensics

• A dentist who was infected with HIV was suspected of infecting some of his patients in the course of treatment

• HIV evolves very quickly (10-3 substitutions/year)

• Possible to trace the history of infections among individuals by conducting a phylogenetic analysis of HIV sequences

• Samples were taken from dentist, patients, and other infected individuals in the community

• Study found 5 patients had been infected by the dentist

Source: Ou et. al. 1992. Molecular epidemiology of HIV transmission in a dental practice. Science, 256: 1165-1171.

Exemplo 2

Phylogeny and molecular evolution


=

Determining the common origin (ancestral)

What are phylogenies for?

Latimeria

Protopterus

What is the most recent common ancestor of tetrapods?

Latimeria

Protopterus

Coelacanth


Latimeria

Protopterus

Lungfish


Latimeria

Protopterus

Teleost


What is the most recent

common ancestor of tetrapods?

What is the most recent

common ancestor of tetrapods?

Tetrapodlimbs

Amnion

Feathers

Lungfishes

Mammals

Amphibians

Lizardsand snakes

Crocodiles

Hawks and other birds

Ostriches

Am

niotesTetrapods

Birds

Common ancestor oflineages to the right

Homologous traitshared by all groupsto the right

2

1

3

4

6

5

Three main assumptions for phylogenetic inference

Three main assumptions for phylogenetic inference

All organisms have a common ancestral

➊

Evolution can be displayed in a bifurcating patternThere are exceptions…. Lateral gene transfer

Three main assumptions for phylogenetic inference ➋

Clark and Messer, 2012. Science

Change in characteristics happens through time

Orbit eclipses dorsal midline

Orbit migration

CitharusPsettodesAmphistium/HeteronectesTrachinatus

Migrated orbit

Unmigrated orbit

Three main assumptions for phylogenetic inference ➌

How to build Phylogenetic TreesSelect Sequences

Align Sequences

Choose model and method; Build tree

Evaluate Tree

Interpret Phylogeny

Good

Needs Improvement

Estimating Genetic Differences

0 25 50 750

0.5

1.0

1.5Expected differences

Time

Diff

eren

ces

betw

een

sequ

ence

sEstimating Genetic Differences

If all nucleotides equally likely, observed difference would plateau at 0.75

Simply counting differences underestimates distances.

Fails to count for multiple hits 0 25 50 75

0

0.5

1.0


Observed differences

Time

Diff

eren

ces

betw

een

sequ

ence

s

One substitutions happened - one substitution is visible

G

CG

PAST

G

CA

PAST

Two substitutions happened - only one substitution is visible Two substitutions happened - no visible substitution

GPAST

A A

Estimating Genetic Differences

If all nucleotides equally likely, observed difference would plateau at 0.75

Simply counting differences underestimates distances.

Fails to count for multiple hits 0 25 50 75

0

0.5

1.0



Time

Diff

eren

ces

betw

een

sequ

ence

s

Page RDM, Holmes EC (1998) Molecular Evolution: a phylogenetic approach Blackwell Science, Oxford.

Models of evolution

Models of nucleotide substitution allow for the calculation of probabilities of specific base changes along a branch.

They include different parameters that aim to describe distinct aspects of the process of nucleotide substitution.


Models of evolution Impact of models: 3 sequences

http://artedi.ebc.uu.se/course/X3-2004/Phylogeny/Exercises/nj.html

AGC AAC ACC

Sequences 1 and 2 differs at 1 out of 3 positions = 1/3 Sequences 1 and 3 differs at 1 out of 3 positions = 1/3 Sequences 2 and 3 differs at 1 out of 3 positions = 1/3

1 2 31 -2 0.333 -3 0.333 0.333 -

JC69 model (Jukes-Cantor, 1969)

http://www.bioinf.manchester.ac.uk/resources/phase/manual

Where P is the proportion of nucleotides that are different (the observed differences above) in the two sequences and ln is the natural log function. To calculate the JC distances from the observed differences above:

1 2 31 -2 0.333 -3 0.333 0.333 -

1 2 31 -2 0.44 -3 0.44 0.44 -

AGC AAC ACC

d = 34ln 1− 4P

3⎡⎣⎢

⎤⎦⎥

d = 34ln 1− 4(1 / 3)

3⎡⎣⎢

⎤⎦⎥


Models of evolution

K80 model (Kimura, 1980) orKimura 2P

Kimura's Two Parameter model (K2P) incorporates the observation that the rate of transitions per site (a) may differ from the rate of transversions (b), giving a total rate of substitiutions per site of (a + 2b)(there are three possible substitutions: one transition and two transversions). The transition:transversion ratio a/b is often represented by the letter kappa (k).

In the K2P model the number of nucleotide substitutions per site is given by:

where: P the proportional differences between the two sequences due to transitions Q are the proportional differences between the two sequences due to transitions and transversions respectively.

AGC AAC ACC

d = 12ln 11− 2P −Q⎡⎣⎢

⎤⎦⎥+ 14

11− 2Q⎡⎣⎢

⎤⎦⎥

K80 model (Kimura, 1980) orKimura 2P

Sequences 1 and 3 differ one transversion Sequences 2 and 3 differ one transversion

AGC AAC

Sequences 1 and 2 differ one transition

AGC ACC

AAC ACC

1 2 3

1 -

2 0.549 -

3 0.477 0.549 -

1 2 3

1 -

2 0.549 -

3 0.477 0.549 -

1 2 3

1 -

2 0.441 -

3 0.441 0.441 -

1 2 3

1 -

2 0.333 -

3 0.333 0.333 -


Jukes-Cantor model

Kimura 2P

Note how the differences caused by the application of different models give different distances Estimating Genetic Differences

0 25 50 750

0.5

1.0



Time

Diff

eren

ces

betw

een

sequ

ence

s0.333

JC: 0.441 K2P: 0.477-0.549

Molecular Clock

Proposed that for any given protein, the rate of molecular evolution is approximately constant over time in all lineages.

Molecules reflect evolutionary divergence

0

15

30

45

60

0 275 550 825 1100

Time since divergence (millions of years)

Am

ino

acid

sub

stitu

tions

(p

er 1

00 re

sidu

es) i

n cy

toch

rom

e c

Yeast vs mould

Angiosperms vs animals

Insects vs vertebrates

Fish vs land vertebrates

Amphibians vs birds and mammals

Birds vs mammals

Mammals vs reptiles

Birds vs reptiles

A molecular clock for Cytochrome c

Bioinformática: Inferência filogenética

HOW DO WE DO IT ?


ancestor

descendant 1 descendant 2

ASSUMPTION: LIFE IS

MONOPHYLETIC

Time

Reading trees

B C DA

“Root”: common ancestor of organisms in the phylogeny

Reading trees

Nó ancestral ou Raíz da árvore

B C DA

Internal branch: common ancestor of a subset of species in the tree

Reading trees

Ramos ou linhagens

B C DA

“Node”: point of divergence of two species

Reading trees

Nós internos ou pontos de divergência (representam ancestrais hipotéticos dos taxa)

B C DA

“Leaf”: terminal branch leading to a species

Reading trees

Nós terminais

B C DA

Clade: group of species descended from a common ancestor

Reading trees

B C DA

Star phylogeny No resolution

Partially resolved phylogeny

Fully resolved phylogeny

B

B

C

C

C

E

E

E

D

D D

Polytomy Bifurcation

Phylogenetic inference resolves the association order of lineages

A A A

B

Phylogenies = Evolutionary relationships

((A,(B,C)),(D,E)) = phylogeny

B - C closer, sister clade A

Taxon A

Taxon B

Taxon C

Taxon E

Taxon D

This dimension can: •be proportional to genetic distance (diferences) = phylogram or adictive trees; •be proportional to time = ultrametric trees; •have no scale what so ever.

A - B - C, sister clade D - E

If there was a temporal or genetic scale then D - E taxa are the closest related, and diverged more recently

All of these rearrangements show the same evolutionary relationships between

the taxaB

A

C

D

A

B

D

C

B

C

AD

B

D

AC

B

ACD

B

A

C

D

A

B

C

D

Mobiles

A C

B D

Tree 1

A B

C D

Tree 2

A B

D C

Tree 3

Phylogenetic tree building (or inference) methods are aimed at discovering which of the possible unrooted trees is "correct".

We would like this to be the “true” biological tree — that is, one that accurately represents the evolutionary history of the taxa. However, we must settle for discovering the computationally correct or optimal tree for the phylogenetic method of choice.

The number of unrooted trees increases in a greater than exponential manner with number of taxa

# Taxa ( N)

3 4 5 6 7 8 910 . . . .30

# Un rooted trees

1 3 15 105 945 10,935 135,135 2,027,025 . . . . 3.58 x 10 36

(2N - 5)!! = # unrooted trees for N taxa

CA

B D

A B

C

A D

B E

C

A D

B E

C

F

An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees

The unrooted tree 1:

A C

B D

Rooted tree 4

C

D

A

B

4

Rooted tree 3

A

B

C

D

3

Rooted tree 5

D

C

A

B

5

Rooted tree 2

A

B

C

D

2

Rooted tree 1

B

A

C

D

1

These trees show five different evolutionary relationships among the taxa!

1, 2, 3, 4 and 5 possible roots

CA

B D

Each unrooted tree theoretically can be rooted anywhere along any of its branches

A D

B E

C


CA

B D

A D

B E

C

F


CA

B D

A D

B E

C


CA

B D

A D

B E

C

A D

B E

C

F

Taxa Unrooted trees X roots Rooted trees

3 1 3 3

4 3 5 15

5 15 7 105

6 105 9 945

7 945 11 10 395

8 10 935 13 135 125

9 135 135 15 2 027 025

30 3.58 x 1036 57 2.04 x 1038

For 10 sequences there are more than 34 million rooted trees

For 20 sequences there are

8,200,794,532,637,891,559,000 trees.

In a recent study of 135 human mtDNA sequences there were potentially

2.113 x10 267 trees.

This number is larger than number of particles known in the universe!!

Mid-point rooting Outgroup rooting

D

C

E

B

G H

F

J

I

K

A

Grouping 2Grouping 1

D

C

E G

F

B

A

J

I

KH D

C

B

E G

F

H

A

J

I

K

Grouping 3

Monophyletic. In this tree, grouping 1, consisting of the seven species B–H, is a monophyletic group, or clade. A monophyletic group is made up of an ancestral species (species B in this case) and all of its descendant species. Only monophyletic groups qualify as legitimate taxa derived from cladistics.

Paraphyletic. Grouping 2 does not meet the cladistic criterion: It is paraphyletic, which means that it consists of an ancestor (A in this case) and some, but not all, of that ancestor’s descendants. (Grouping 2 includes the descendants I, J, and K, but excludes B–H, which also descended from A.)

Polyphyletic. Grouping 3 also fails the cladistic test. It is polyphyletic, which means that it lacks the common ancestor of (A) the species in the group. Further-more, a valid taxon that includes the extant species G, H, J, and K would necessarily also contain D and E, which are also descended from A.

Phylogenetic MethodsDistance: • Tree based on pairwise distances between sequences • Evolutionary models applied to pairwise distances to account for multiple substitutions per site and rate heterogeneity among sites

Parsimony: • Minimize the number of substitutions • Assumes sites are independent • Assumes 1 substitution per site

Maximum Likelihood: • Maximize probability of sequences given tree • Evolutionary models applied to each position to account for multiple substitutions per site and rate heterogeneity among sites • Gives single tree with highest likelihood • Assumes sites are independent

Bayesian: • Maximize posterior probability of tree given sequences • Evolutionary models applied to each position to account for multiple substitutions per site and rate heterogeneity among sites • Integrates over all trees • Assumes sites are independent

Molecular phylogenetic tree building methods Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows:

COMPUTATIONAL METHODClustering algorithmOptimality criterion

DAT

A TY

PE Cha

ract

ers

Dis

tanc

es

PARSIMONY

MAXIMUM LIKELIHOOD

UPGMA

NEIGHBOR-JOINING

MINIMUM EVOLUTION

LEAST SQUARES

Methods of reconstructing phylogenies (evolutionary trees)

Distance matrix methods. Tree that best predicts the entries in a table of pairwise distances among species.

Parsimony methods. Tree that allows evolution of the sequences with the fewest changes is preferred.

Maximum likelihood. Tree that has highest probability that the observed data would evolve.

Distance method

UPGMA(Unweighted Pair Group Method with Arithmetic Mean)

Seq sites1 T T A T T A A2 A A T T T A A3 A A A A A T A

4 A A A A A A T

Distances

Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building.

1 2 3 41 -2 3 -3 5 4 -4 5 4 2 -

1 2 3 41 -2 3 -3 5 4 -4 5 4 2 -

! Métodos principais de filogenia molecular ! UPGMA

Construction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmetic Mean (UPGMA)

From http://www.icp.ucl.ac.be/~opperd/private/upgma.html

A - GCTTGTCCGTTACGATB – ACTTGTCTGTTACGAT

First, construct a distance matrix:

A - GCTTGTCCGTTACGATB – ACTTGTCTGTTACGATC – ACCTGTCCGAAACGATD - ACTTGACCGTTTCCTTE – AGATGACCGTTTCGATF - ACTACACCCTTATGAG

A - GCTTGTCCGTTACGATC – ACCTGTCCGAAACGAT

A B C D E FA -B 2 -C 4 4 -D 6 6 6 -E 6 6 6 4F 8 8 8 8 8 -


First round

dist(A,B),C = (distAC + distBC) / 2 = 4 dist(A,B),D = (distAD + distBD) / 2 = 6 dist(A,B),E = (distAE + distBE) / 2 = 6 dist(A,B),F = (distAF + distBF) / 2 = 8

Choose the most similar pair, cluster them together and calculate

the new distance matrix.

A B C D E FA -B 2 -C 4 4 -D 6 6 6 -E 6 6 6 4F 8 8 8 8 8 -

A,B C D E FA,B -C 4 -D 6 6 -E 6 6 4F 8 8 8 8 -


Second round

Third round

dist(D,E),(A,B) = (distD(AB) + distE(AB) / 2 = 6 dist(D,E),C = (distDC + distEC) / 2 = 6 dist(D,E),F = (distDF + distEF) / 2 = 8

A,B C D E FA,B -C 4 -D 6 6 -E 6 6 4F 8 8 8 8 -

A,B C D,E F

A,B -

C 4 -

D,E 6 6 -

F 8 8 8 -


Forth round

dist(A,B,C),DE = (distABDE + distCDE) / 2 = 6 dist(A,B,C),F = (distABF + distCF) / 2 = 8

A,B C D,E F

A,B -

C 4 -

D,E 6 6 -

F 8 8 8 -

A,B,C D,E F

A,B,C -

D,E 6 -

F 8 8 -


Fifth round

Sixth round

Note the this method identifies the root of the tree

A,B,C D,E F

A,B,C -

D,E 6 -

F 8 8 -

(A,B,C)(D,E)(A,B,C)(D,E) -

F 8


UPGMA fails when rates of evolution are not constant

A tree in which the evolutionary rates are not equal

From http://www.icp.ucl.ac.be/~opperd/private/upgma.html

A B C D E B 5 C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8Método de distância

NJ

! Métodos principais de filogenia molecular ! NJ

The neighbor-joining method of Saitou and Nei (1987). Is especially useful for making a tree having a large number of taxa.

Begin by placing all the taxa in a star-like structure.

Making trees using neighbor-joining

Shortest pairs are chosen to be neighbors and then joined in distance matrix as one OTU.


Tree-building methods: Neighbor-joining

Next, identify neighbors (e.g. 1 and 2) that are most closely related. Connect these neighbors to other OTUs via an internal branch, XY. At each successive stage, minimize the sum of the branch lengths.


Tree-building methods: Neighbor joining

Define the distance from X to Y by

dXY = 1/2(d1Y + d2Y – d12)


The neighbor joining method joins at each step, the two closest sub-trees that are not already joined. It is based on the minimum evolution principle. One of the important concepts in the NJ method is neighbors, which are defined as two taxa that are connected by a single node in an unrooted tree

A B

Node 1


A

B

C

D

E

A

B

5

C

4

7

D

7

10

7

E

6

9

6

5

F 8 11 8 9 8

B

C

D

E

F

A


We have in total 6 OTUs (N=6).

Step 1: We calculate the net divergence r (i) for each OTU from all other OTUs

r(A) = 5+4+7+6+8=30 r(B) = 42 r(C) = 32 r(D) = 38 r(E) = 34 r(F) = 44

A

B

C

D

E

A

B

5

C

4

7

D

7

10

7

E

6

9

6

5

F 8 11 8 9 8


Step 2: Now we calculate a new distance matrix using for each pair of OUTs the formula:

M(ij)=d(ij) - [r(i) + r(j)]/(N-2) or in the case of the pair A,B:

M(AB)=d(AB) -[(r(A) + r(B)]/(N-2) = M(AB)= 5 -[(30 + 42]/(6-2) = -13

A

B

C

D

E

A

B

5

C

4

7

D

7

10

7

E

6

9

6

5

F 8 11 8 9 8

r(A) = 30r(B) = 42r(C) = 32r(D) = 38r(E) = 34r(F) = 44


Step 2: Now we calculate a new distance matrix using for each pair of OUTs the formula:

M(ij)=d(ij) - [r(i) + r(j)]/(N-2) or in the case of the pair A,B:

M(AB)=d(AB) -[(r(A) + r(B)]/(N-2) = M(AB)= 5 -[(30 + 42]/(6-2) = -13

M(AC)= 4 -[(30 + 32]/(6-2) = -11.5 M(AD)= 7 -[(30 + 38]/(6-2) = -10

etc........

A

B

C

D

E

A

B

-13

C

-11.5

-11.5

D

-10

-10

-10.5

E

-10

-10

-10.5

-13

F -10.5 -10.5 11 -11.5 -11.5

r(A) = 30r(B) = 42r(C) = 32r(D) = 38r(E) = 34r(F) = 44


Step 3: Now we choose as neighbors those two OTUs for which Mij is the smallest. These are A and B and D and E. Let's take A and B as neighbors and we form a new node called U ( joining AB).

A

B

C

D

E

A

B

-13

C

-11.5

-11.5

D

-10

-10

-10.5

E

-10

-10

-10.5

-13

F -10.5 -10.5 11 -11.5 -11.5

A

B

C

D

E

A

B

5

C

4

7

D

7

10

7

E

6

9

6

5

F 8 11 8 9 8

r(A) = 30r(B) = 42r(C) = 32r(D) = 38r(E) = 34r(F) = 44

BA

? ?


Now we calculate the branch length from the internal node U to the external OTUs A and B.

S(AU) =d(AB) / 2 + [r(A)-r(B)] / 2(N-2) = S(AU) = 5/2 + [30-42 / 2*(6-2) ] = 1 S(BU) = 5 - S(AU) = 4

A

B

C

D

E

A

B

-13

C

-11.5

-11.5

D

-10

-10

-10.5

E

-10

-10

-10.5

-13

F -10.5 -10.5 11 -11.5 -11.5

A

B

C

D

E

A

B

5

C

4

7

D

7

10

7

E

6

9

6

5

F 8 11 8 9 8

r(A) = 30r(B) = 42r(C) = 32r(D) = 38r(E) = 34r(F) = 44

B

A1

4

U


Step 4: Now we define new distances from U to each other terminal node:

d(CU) = d(AC) + d(BC) - d(AB) / 2 = 3 d(DU) = d(AD) + d(BD) - d(AB) / 2 = 6 d(EU) = d(AE) + d(BE) - d(AB) / 2 = 5 d(FU) = d(AF) + d(BF) - d(AB) / 2 = 7

A

B

C

D

E

A

B

5

C

4

7

D

7

10

7

E

6

9

6

5

F 8 11 8 9 8

U

C

D

E

F

U

C

3

D

6

7

E

5

6

5

F

7

8

9

8

r(A) = 30r(B) = 42r(C) = 32r(D) = 38r(E) = 34r(F) = 44


Step 4: And we create a new distance matrix

U

C

D

E

F

U

C

3

D

6

7

E

5

6

5

F

7

8

9

8

B

C

D

E

F

A1

4


Step 5: Now, N is N-1 = 5, and the entire procedure is repeated starting at step 1

U

C

D

E

F

U

C

3

D

6

7

E

5

6

5

F

7

8

9

8

B

C

D

E

F

A1

4


B

C

D

E

F

A

1

1

0.5

4

21

4.752.25

2.75

UPGMA

NJ

ROOT

Comparison of UPGMA and NJ

Neighbor Joining finding shortest (minimum evolution) tree by finding neighbors that minimize the total length of the tree. Shortest pairs are chosen to be neighbors and then joined in distance matrix as one OTU.

the algorithm does not assume that the molecular clock is constant for sequences in the tree. If there are unequal substitution rates, the tree is more accurate than UPGMA.

Distance Methods: evolutionary distances (number of substitutions) are computed for all pairs of taxa.

UPGMA unweighted pairgroup method with arithmetic means.

assumes equal rate of substitutions (therefore is always rooted, as the taxa that has accumulated more sequences is evidently older) (if the substitutions rates are different among taxa, then the tree maybe wrong)

sequential clustering algorithmspairs of taxa are clustered in order of decreasing similarity

Parsimony

Molecular phylogenetic tree building methods Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows:


DAT

A TY

PE Cha

ract

ers

Dis

tanc

es

PARSIMONY

MAXIMUM LIKELIHOOD

UPGMA

NEIGHBOR-JOINING

MINIMUM EVOLUTION

LEAST SQUARES

! Métodos principais de filogenia molecular ! Maximum parsimony ! Métodos principais de filogenia molecular ! Maximum parsimony

William of Ockham (or Occam) was a 14th-

century English logician and Franciscan friar

who's name is given to the principle that when

trying to choose between multiple competing theories the simplest one is probably the best.

This principle is known as Ockham's razor.

! Métodos principais de filogenia molecular ! Maximum parsimony

Parsimony: • Minimize the number of substitutions • Assumes sites are independent • Assumes <1 substitution per site

Tree 1

1 (A) 2(G) 3(A) 4 (G)

2 changes

A

Species 1 2 3 4

Data A G A G

Tree 2

1 (A) 2(G) 3(A) 4 (G)

G

2 changes

Fitch (1971) Systematic Zoology 20:406-416



Tree 1 and 2 Tree 3

1 (A) 2(G) 3(A) 4 (G)

(A or G) (A or G)

2 changes

(A or G)

1 (A) 2(G)3(A) 4 (G)

(A) (G)

(A or G)

1 change

Species 1 2 3 4

Data A G A G




Tree 1 Tree 2

1 (A) 2(G) 3(A) 4 (G)

(A or G) (A or G)

2 changes

(A or G)

1 (A) 2(G)3(A) 4 (G)

(A)(G)

(A or G)

1 change

More parsimonious


Species 1 2 3 4

Data A G A G


Parsimony methods

Optimality criterion: The ‘most-parsimonious’ tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences.

Advantages: • Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’). • Can be used on molecular and non-molecular (e.g., morphological) data. • Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy) • Can be used for character (can infer the exact substitutions) and rate analysis • Can be used to infer the sequences of the extinct (hypothetical) ancestors

Disadvantages: • Not based on statistical properties • Can be fooled by high levels of homoplasy (‘same’ events)

Molecular phylogenetic tree building methods:

Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows:


DAT

A TY

PE Cha

ract

ers

Dis

tanc

es

PARSIMONY

MAXIMUM LIKELIHOOD

UPGMA

NEIGHBOR-JOINING

MINIMUM EVOLUTION

LEAST SQUARES

• Parsimony seeks solutions that minimize the amount of change required to explain the data (underestimates superimposed changes)

• ML attempts to estimate the actual amount of change (by specifying the evolutionary model that will account for the data with the highest likelihood)

• Methods that incorporate models of evolutionary change can make more efficient use of the data

! Métodos principais de filogenia molecular ! Maximum likelihood

Maximum likelihood (ML) methodsOptimality criterion: ML methods evaluate phylogenetic hypotheses in terms of the probability that a proposed model of the evolutionary process and the proposed unrooted tree would give rise to the observed data. The tree found to have the highest ML value is considered to be the preferred tree.

Advantages: • Are based on explicit model of evolution. • Usually the most ‘consistent’ of the methods available. • Can be used for character (can infer the exact substitutions) and rate analysis. • Can be used to infer the sequences of the extinct (hypothetical) ancestors. • Can help account for branch-length effects.

Disadvantages: • Are based on explicit model of evolution. • Are not as simple and intuitive as many other methods. • Are computationally very intense (Iimits number of taxa and length of sequence). • Slooooow!!! • Violations of the assumed model can lead to incorrect trees.


6 faces 8 faces 12 faces

Ideia from Gavin Naylor


Roll the diceHow many points?




Roll the diceHow many points?



14 POINTS


Para um resultado de 14, necessitamos de usar dois dados. Qual o par de dados que mais provavelmente originará esse

resultado?



For a 14 points results we need 2 dices. Which is the pair of dices that most probably originates that result?


Equivalente a: qual a árvore que mais provavelmente terá originado essas sequências?



Which tree is most likely to have yielded these sequences?


6 + 8

+ + +

How many ways of obtaining the score “14” are there for each pair?

2 + 12 3 + 11 4 + 10

5 + 96 + 8

1 5 7Ideia from Gavin Naylor

How many possible combinations?

2 + 12 3 + 11 4 + 10

5 + 96 + 87 + 98 + 7


6 + 8

+ + +


2 + 12 3 + 11 4 + 10

5 + 96 + 8

1 5 7


2 + 12 3 + 11 4 + 10

5 + 96 + 87 + 98 + 7

1/6 x 1/8

= 1/481/6 x 1/12 1/8 x 1/12

= 1/ 72 = 1/96

5 7

Probability of each combination?


6 + 8

+ + +


2 + 12 3 + 11 4 + 10

5 + 96 + 8

1 5 7


2 + 12 3 + 11 4 + 10

5 + 96 + 87 + 78 + 6

1/6 x 1/8

= 1/481/6 x 1/12 1/8 x 1/12

= 1/ 72 = 1/961/48 x 1 1/ 72 x 5 1/96 x 7

! Métodos principais de filogenia molecular ! Maximum likelihoodNow multiply ways of obtaining the score

“14” by the probability of any single outcome to get the likelihood.

+ + +

1/48 x 1 1/ 72 x 5 1/96 x 7

0.07290.06940.0208

Notice that none of the likelihoods are very “likely”, but (8+12) is more likely than the other two



HOW DO WE CONVERT THIS RATIONALE TO

MAXIMUM LIKELIHOOD ESTIMATIONS?


1. Calculate likelihood for each site on a specific tree.



A likelihood de uma das posições do alinhamento, neste caso a posição 5, é igual à soma de todas as possíveis reconstruções nos nós 5 e 6.


A likelihood da árvore é o produto de todas as likelihoods individuais de todos os sites do alinhamento.

A likelihood é a soma dos logaritmos das likelihoods de cada local



2. Sum up the L values for all sites on the tree.

3. Compare the L value for all possible trees.

4. Choose tree with highest L value.

Bioinformática: Inferência filogenética WHY - rcastilho.ptrcastilho.pt/BI2017/Main_files/BIOINFO_2017.pdf · Uses of phylogenies: Co-evolution • Compare divergence patterns in

Documents