Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Whole Genome based Phylogeny

Johanne Ahrenfeldt PhD student

DTU Bioinformatics

Short about me

Johanne [email protected]

•  PhD student at DTU Bioinformatics –  Whole Genome based Phylogeny

•  Graduate Engineer in Systems Biology and Bioinformatics from Technical University of Denmark

•  Working in the CGE project since 2012 – started as a student helper

Overview

•  What is Phylogeny •  SNP methods

–  CSI Phylogeny•  Nucleotide Differences

–  NDtree•  Controlled Evolution study•  Good advice

What is phylogeny?

•  Early phylogeny–  Classification–  Based on phenotypes

•  Current phylogeny–  Based on genotypes–  DNA mutations as basis for evolution

Classification

Carl Linnaeus 1707-1778

Hierarchical system KingdomPhylumClassOrderFamilyGenusSpecies

Classification depicted as a tree

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS



CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS


Species Genus Family Order Class

CEN

TER

FO

R B

IOLO

GIC

AL S

EQU

ENC

E AN

ALYS

IS

Molecular Basis for Variation: DNA Mutation

DNA mutations as basis for evolution

What are phylogenetic trees

•  Phylogenetic trees are a visual representation of the genetic relationship between species

•  Think of them as family trees •  Phylogeny can also be represented by distance

matrices

What are phylogenetic trees

•  Trees were traditionally made using aligned sequences of single genes or proteins

•  Whole genome data can be used to create trees based on –  SNP calling–  K-mer overlap–  Alignment of genomes

What is a SNP

•  A Single Nucleotide Polymorphism (SNP) is a DNA sequence variation occurring commonly* within a population (e.g. 1%) in which a Single Nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes.

How does it work

Strain A ATTCAGTAGT Strain B ATGCAGTTGA Strain C ATGCAATTGT Strain D ATCCATTAGC

Construct distance matrix


A B C D

AA

B

C

D

Make Tree


A B C D A 0 3 3 3 B 3 0 2 4 C 3 2 0 4 D 3 4 4 0

B

D

C

A

1

1

1

1

1

1

How to read phylogenetic trees

15

that would have been classified as Homo sapiens wereborn. However, rainbow trout and humans are contempo-rary species, meaning that the lineages of which they arecurrently terminal nodes have been evolving for exactly thesame amount of time since their divergence from a distantcommon ancestor. As a result, any notion that one of theselineages is “more evolved” or that one has had more time toaccumulate differences is flawed.

Misconception #8: Backwards Time Axes

Among the common misconceptions identified by Meiret al. (2007) was the tendency for many students to misreadthe time axis on evolutionary trees. Many studentsinterpreted the location of the terminal nodes as indicatingtime, for example by reading from left to right or from theleftmost tip to the root. In Fig. 17a, for example, manystudents read time as proceeding from birds (oldest) to theroot W (youngest) or from birds (oldest) to kangaroos(youngest). Neither is correct, as time extends from the rootto the terminal nodes, all of which are contemporary. Thismisinterpretation may have been exacerbated by the factthat the tree used in the quiz placed mammals—whichmany students assume to be the most “advanced” and hencemost recent group—alone on the less diverse branch at thefar right of an unbalanced, ladderized tree (unfortunately, a

tendency to place humans or some other preferred taxon atthe top or right of every tree appears to be an unshakablehabit among many phylogeneticists, although there is noobjective reason for doing so). As indicated in Fig. 5, evenon cladograms, in which the lengths of the branches are not

Fig. 16 The lineages leading to contemporary species have all beenevolving for exactly the same amount of time. Rates of morphologicalchange may vary among lineages, but the amount of time thatseparates two living lineages from their common ancestor does not.This figure shows the relationships among a sample of vertebratelineages, all of which have been evolving for exactly the same amountof time, even if some lineages have undergone more change or morebranching than others or if some taxonomically identifiable subsets ofthose lineage (e.g., teleost fishes) arose earlier than others (e.g.,mammals). It is therefore a fallacy to describe one modern species as“more evolved” than another. Note, however, that this is a cladogramrather than an ultrametric tree, such that one cannot assume that any orall of G, H, E, F, C, and B are equal, only that the total amount of timebetween root and tip is the same along each of the lineages

Fig. 17 The number of intervening nodes does not indicate overallrelatedness between lineages. The tree in a is the same in topology asthe one used in the study of Meir et al. (2007), which showed thatmany readers have a tendency misread the directionality of time onphylogenies and to count nodes when asked to determine evolutionaryrelatedness among species. Confusion may arise in this particular casebecause many people maintain the erroneous assumption thatmammals are the most “advanced” and therefore must be the youngestgroup. More generally, because the tree is unbalanced, students maytend to consider birds and mammals (separated by four internal nodeson this tree, Z, Y, X, and W) as more distantly related than turtles andmammals (separated by two internal nodes, X and W). However, this issimply an artifact of the species chosen for inclusion on the tree. Allspecies descended from ancestor X are equally related to kangaroos,with which they all share the same last common ancestor, W. Todemonstrate this, b illustrates the same tree with different patterns foreach branch, which are then spliced together in c to reveal theidentical total distance from the common ancestor W to all of theterminal nodes

134 Evo Edu Outreach (2008) 1:121–137

T. Ryan Gregory. Understanding Evolu<onary Trees. Evo Edu Outreach (2008) 1:121–137 DOI 10.1007/s12052-‐008-‐0035-‐x

How to read phylogenetic trees

16

T. Ryan Gregory. Understanding Evolu<onary Trees. Evo Edu Outreach (2008) 1:121–137 DOI 10.1007/s12052-‐008-‐0035-‐x

with their shared ancestor represent a clade (amniotes) inwhich the first two clades are nested. Adding frogs and theancestor linking them to the aforementioned species createsa yet larger clade (tetrapods). Adding fishes and thecommon ancestor of all species on this tree creates thefinal and largest clade (vertebrates). Because frogs can be

included in a clade with humans before fishes can—in otherwords, because frogs and humans share a common ancestorthat is not shared with fishes—frogs are more closelyrelated to humans than to fishes. Indeed, frogs and humansare exactly equally related to fishes through this commonancestor (recall that two cousins are equally related to athird, more distant relative).

A more rapid approach is to mentally rotate a fewinternal nodes with no effect on the topology of the tree, asshown in Fig. 11b. In this modified tree, humans are stillsister to cats and birds are sister to lizards, frogs are thensister to amniotes, and fishes are the outgroup to thetetrapods. This second tree is identical in topology and istherefore equally accurate as the first tree. However, it

Fig. 11 The order of terminal nodes is meaningless. One of the mostcommon misconceptions about evolutionary trees is that the order ofthe terminal nodes provides information about their relatedness. Onlybranching order (i.e., the sequence of internal nodes) provides thisinformation; because all internal nodes can be rotated withoutaffecting the topology (Fig. 6), the order of the tips is meaningless.Nevertheless, there is a strong tendency for readers to take the tree in aas indicating that frogs are more closely related to fishes than humansare. They are not: both frogs and humans (and birds and lizards andcats) are equally closely related to fishes because as tetrapods theyshare a common ancestor to the exclusion of bony fishes. On the otherhand, humans and cats are more closely related to each other thaneither is to any of the other species depicted because they share arecent common ancestor to the exclusion of the other species. The treein b exhibits an identical topology to the one in a and is thereforeequally valid. In this case, the same misinterpretation of “readingacross the tips” would lead to the erroneous conclusion that birds aremore closely related to fishes than cats are or that humans are moreclosely related to frogs than to lizards and birds. Because they share acommon ancestor as amniotes, birds, cats, lizards, and humans are allequally related to frogs. It is good practice to rotate a few internalnodes mentally when first examining a tree to dispel misinter-pretations based on reading the order of tips

Fig. 12 Evolutionary trends cannot be identified by reading across thetips. In addition to resulting in incorrect interpretations of relatedness(Fig. 11), reading across the tips can engender a false impression ofevolutionary trends. For example, many readers confronted with thetree in a might be tempted to infer an evolutionary trend towardincreased body size in snail species over time (or, in Fig. 11a, anincrease in complexity or intelligence over time). Unfortunately,misinterpretations such as this can be found even in the primaryscientific literature. Once again, this can be corrected simply byrotating a few internal nodes, as has been done in b, in which thetopology is the same but where the supposed trend is no longerapparent. c shows evidence of a real evolutionary trend towardincreased body size. The important consideration is internal branch-ing: In this case, there is information about ancestral states (e.g., fromfossils), and it is evident that in every branching event, the twodescendant species have been larger than their shared ancestor.Despite this being a clear evolutionary trend, there is no patternevident across the terminal nodes. Thus, reading across the tips cancreate apparent trends where there are none and can mask real trendsthat are strongly supported by historical information

130 Evo Edu Outreach (2008) 1:121–137

with their shared ancestor represent a clade (amniotes) inwhich the first two clades are nested. Adding frogs and theancestor linking them to the aforementioned species createsa yet larger clade (tetrapods). Adding fishes and thecommon ancestor of all species on this tree creates thefinal and largest clade (vertebrates). Because frogs can be

included in a clade with humans before fishes can—in otherwords, because frogs and humans share a common ancestorthat is not shared with fishes—frogs are more closelyrelated to humans than to fishes. Indeed, frogs and humansare exactly equally related to fishes through this commonancestor (recall that two cousins are equally related to athird, more distant relative).

A more rapid approach is to mentally rotate a fewinternal nodes with no effect on the topology of the tree, asshown in Fig. 11b. In this modified tree, humans are stillsister to cats and birds are sister to lizards, frogs are thensister to amniotes, and fishes are the outgroup to thetetrapods. This second tree is identical in topology and istherefore equally accurate as the first tree. However, it

Fig. 11 The order of terminal nodes is meaningless. One of the mostcommon misconceptions about evolutionary trees is that the order ofthe terminal nodes provides information about their relatedness. Onlybranching order (i.e., the sequence of internal nodes) provides thisinformation; because all internal nodes can be rotated withoutaffecting the topology (Fig. 6), the order of the tips is meaningless.Nevertheless, there is a strong tendency for readers to take the tree in aas indicating that frogs are more closely related to fishes than humansare. They are not: both frogs and humans (and birds and lizards andcats) are equally closely related to fishes because as tetrapods theyshare a common ancestor to the exclusion of bony fishes. On the otherhand, humans and cats are more closely related to each other thaneither is to any of the other species depicted because they share arecent common ancestor to the exclusion of the other species. The treein b exhibits an identical topology to the one in a and is thereforeequally valid. In this case, the same misinterpretation of “readingacross the tips” would lead to the erroneous conclusion that birds aremore closely related to fishes than cats are or that humans are moreclosely related to frogs than to lizards and birds. Because they share acommon ancestor as amniotes, birds, cats, lizards, and humans are allequally related to frogs. It is good practice to rotate a few internalnodes mentally when first examining a tree to dispel misinter-pretations based on reading the order of tips

Fig. 12 Evolutionary trends cannot be identified by reading across thetips. In addition to resulting in incorrect interpretations of relatedness(Fig. 11), reading across the tips can engender a false impression ofevolutionary trends. For example, many readers confronted with thetree in a might be tempted to infer an evolutionary trend towardincreased body size in snail species over time (or, in Fig. 11a, anincrease in complexity or intelligence over time). Unfortunately,misinterpretations such as this can be found even in the primaryscientific literature. Once again, this can be corrected simply byrotating a few internal nodes, as has been done in b, in which thetopology is the same but where the supposed trend is no longerapparent. c shows evidence of a real evolutionary trend towardincreased body size. The important consideration is internal branch-ing: In this case, there is information about ancestral states (e.g., fromfossils), and it is evident that in every branching event, the twodescendant species have been larger than their shared ancestor.Despite this being a clear evolutionary trend, there is no patternevident across the terminal nodes. Thus, reading across the tips cancreate apparent trends where there are none and can mask real trendsthat are strongly supported by historical information

130 Evo Edu Outreach (2008) 1:121–137

What is phylogeny used for

•  Classify taxonomy – The classic use

•  Outbreak detection – Increasing with WGS data

What is phylogeny used for

•  Cholera outbreak in Haiti 2010•  Listeria outbreak 2014

Whole-genome Sequencing Used to Investigate a Nationwide Outbreak of Listeriosis Caused by Ready-to-eat Delicatessen Meat, Denmark, 2014.Kvistholm Jensen et al. Clin Infect Dis. (2016) 63 (1): 64-70. doi: 10.1093/cid/ciw192

Case story

•  Vibrio Cholerae outbreak in Haiti followed the 2010 earthquake

•  Rumors said that the outbreak may have come from Nepal, travelling along with UN soldiers from Nepal

•  No proof had been given of this until the Hendriksen et al. paper in 2011

Popula<on Gene<cs of Vibrio cholerae from Nepal in 2010: Evidence on the Origin of the Hai<an Outbreak. Hendriksen et al. 23 August 2011 mBio vol. 2 no. 4 e00157-‐11. doi: 10.1128/mBio.00157-‐11

Case story

•  Data–  24 recent V. cholerae strains from Nepal–  10 previously sequenced V. cholerae isolates,

including 3 from the Haitian outbreak•  Analysis

–  Antimicrobial susceptibility testing –  PFGE (pulsed-field gel electrophoresis) to analyze for

genetic relatedness–  Whole genome sequencing, SNP identification and

phylogenetic analysis

Case story - Results

Resistance profile Suscep5ble Decreased suscep5bility

Resistant

Nepalese strains Hendriksen et al. 2011

Tetracycline Ciprofloxacin Trimethoprim, Sulfamethoxazole Nalidixic

Hai<an outbreak strains Centers for Disease Control and Preven<on, 2010

Tetracycline Ciprofloxacin Trimethoprim, Sulfamethoxazole Nalidixic


•  Pulsed-field gel electrophoresis (PFG)E–  Nepalese isolates divided in 4 groups–  Most common Haitian type in same group as four

Nepalese strains


FIG 1 Genetic relationships among V. cholerae isolates from Nepal and Haiti. A single maximum parsimony tree was reconstructed using 752 SNPs from 34whole-genome sequences. There were 184 parsimony-informative SNPs, of which 6 were homoplastic, resulting in a CI of 0.97 (excluding uninformativecharacters). The branch lengths are labeled in red, and for branches affected by homoplasy, minimum and maximum branch lengths are designated. Membersof SNP genotypic group V (16) are indicated. SNP differences among the three most closely related Nepali groups and the Haitian group are shown andcharacterized in Table S1 in the supplemental material.

TABLE 1 Different point mutations observed among the three sequenced isolates from the Haiti outbreak and the three most closely related isolatesfrom Nepala

Chromosome Position

Nucleotide or amino acid in:

Reference strain

Haitian isolate Nepalese isolate

1786 1792 1798 14 25 26

I 2787016 C C C C T T TGly Gly Gly Gly Arg Arg Arg

I 1090536 T T T T T T GIle Ile Ile Ile Ile Ile Ser

II 962762 C C C C T C CAla Ala Ala Ala Ala Ala Ala

a The reference strain is Vibrio cholerae O1 biovar El Tor strain N16961 (Bangladesh 1971). The NCBI reference sequences or accession numbers are NC_002505 for chromosome Iand NC_002506 for chromosome II.

Population Genetics of Vibrio cholerae

July/August 2011 Volume 2 Issue 4 e00157-11 ® mbio.asm.org 3

m

bio.asm.org

on March 3, 2015 - Published by

mbio.asm

.orgD

ownloaded from

10 minutes break!

snpTree

•  First online webserver for constructing phylogenetic trees based on whole genome sequencing

snpTree-‐-‐a web-‐server to iden<fy and construct SNP trees from whole genome sequence data. Leekitcharoenphon P, Kaas RS, Thomsen MC, Friis C, Rasmussen S, Aarestrup FM. BMC Genomics. 2012;13 Suppl 7:S6.

snpTree flow

CSI Phylogeny

https://cge.cbs.dtu.dk/services/CSIPhylogeny/

•  SNP identification same as snpTree•  Strict sorting of SNPs

–  Depth–  Relative depth–  Distance between SNPs–  SNP quality–  Read mapping quality

Rolf S. Kaas, Pimlapas Leekitcharoenphon, Frank M. Aarestrup, Ole Lund. Solving the Problem of Comparing Whole Bacterial Genomes across Different Sequencing Plaeorms. PLoS ONE 2014; 9(8): e104984.

CSI Phylogeny

•  Requires all SNPs to be significant–  Z-score higher than 1.96 for all SNPs

•  X is the number of reads, with the most common nucleotide at that position, and Y the number of reads with any other nucleotide.

Z = X −YX+Y

CSI Phylogeny

OutputTree build by FastTree algorithm, in Newick format

•  Branch lengths is substitutions per site at the variable sites

Matrix of SNP pair counts in text (.txt) format •  Diagonal SNP matrix

CSI Phylogeny

NDtree

https://cge.cbs.dtu.dk/services/NDtree/

Nucleotide calling•  A different approach where the main distinction is not

between if a SNP should be called or not, but between whether or not there is solid evidence for the nucleotide at the given position.

Real-‐Time Whole-‐Genome Sequencing for Rou<ne Typing, Surveillance, and Outbreak Detec<on of Verotoxigenic Escherichia coli. Joensen KG, Scheutz F, Lund O, Hasman H, Kaas RS, Nielsen EM, Aarestrup FM. J Clin Microbiol. 2014 May;52(5):1501-‐10.

NDtree

Simple mapping approach•  Cuts all reads into K-mers•  Maps all K-mers to reference genome•  Makes an ungapped consensus sequences of equal

lengths

Mapping

33

K-‐mers

Reference genome

Consensus sequence

Reference genome Genome 1 Genome 2 Genome 3 Genome 4 Genome 5 Genome 6

NDtree Nucleotide calling

–  When all reads have been mapped the significance of the base call at each position was evaluated by calculating the number of reads X having the most common nucleotide at that position, and the number of reads Y supporting other nucleotides.

A Z-score threshold is calculated

> 1.96 (or 3.29)

>90% of reads supporting the same base

Z = X −YX+Y

NDtree

Count nucleotide differences–  Method 1: Each pair of sequences was compared

and the number of nucleotide differences in positions called in all sequences was counted.

•  More accurate (Z=1.96 is used as threshold)–  Method 2: Each pair of sequences was compared

and the number of nucleotide differences in positions called in both sequences was counted.

•  More robust (Z=3.29 is used as threshold)

Method 1 – all calledSignificant posi<ons in Genome 1 Significant posi<ons in Genome 2 Significant posi<ons in Genome 3 Posi<ons used for phylogeny

Method 2 – pairwise significanceSignificant posi<ons in Genome 1 Significant posi<ons in Genome 2 Significant posi<ons in Genome 3 Posi<ons used between 1 and 2 Posi<ons used between 1 and 3 Posi<ons used between 2 and 3

NDtree

Uses two different algorithms to make two different trees•  UPGMA•  Neighbor Joining

Both algorithms are part of the PHYLIP Neighbor program package and make trees from distance matrices

UPGMA vs. Neighbor Joining

•  UPGMA works when samples have been taken the same time

•  Neighbor Joining is better when samples have been taken at different times

NDtree

Output•  distance.txt: Distance matrix - tab separated•  dist.mat: Distance matrix - PHYLIP format•  tree.nj.newick: Neighbor Joining tree - Newick format

–  Branch lengths is number of Nucleotide Differences•  tree.upgma.newick: UPGMA tree – Newick format

–  Branch lengths is number of Nucleotide Differences

Day$1$

For$each$8$hour$culture$a$sample$was$saved$for$DNA$sequencing$

Single$colonies$of$CSH114$

Choose$colony$Grow$for$8$h$$

Plate$out$Grow$for$16$h$

Plate$out$Grow$for$16$h$

Choose$colonies$Grow$for$8$h$

Choose$colonies$Grow$for$8$h$

Day$2$ Day$3$ $$$$$…$$$$$$$$$$$$$$Day$8$$

128x$

Controlled Evolution study

J. Ahrenfeldt, C. Skaarup, H. Hasman, A. G. Pedersen, F. M. Aarestrup and O. Lund. Bacterial whole genome-‐based phylogeny: construc<on of a new benchmarking dataset and assessment of some exis<ng methods. BMC Genomics (2017) 18:19

Naming the descendants Day$1 Day$2 Day$3 Day$4 Day$5

S2211

S2212

S222S2221

S2222

S2

S21

S211S2111

S2112

S212S2121

S2122

S22

S221

S121S1211

S1212

S122S1221

S1222S

S1

S11

S111S1111

S1112

S112S1121

S1122

S12

Mutations

Phylogenetic tree using NDtree (UPGMA)

2.0

S1122

S1111

S2

S221

S222

S

S2121

S1112

S1221

S2122

S21

S1121

S2211

S1211

S112

S2221

S122

S12

S11

S22

S111

S1212

S2212

S2111

S121

S211S2112

S1222

S212

S1

Phylogenetic tree using NDtree (Neighbor Joining)

S2212

S121

S222

S1112S112

S1212

S2211

S

S1222

S212

S1211

S11

S2112

S1S12

S1111

S1221

S1121

S2122

S2

S2121

S1122

S21

S2221

S211

S221

S111

S2111

S122

S22

UPGMA vs. Neighbor Joining

•  UPGMA works when samples have been taken the same time

•  Neighbor Joining is better when samples have been taken at different times

CSI Phylogeny – Default settings

0.2

1_1

1_2_1_2

1_1_2_1

f_1_1_2_1_2

k_1_2_1_2_1

b_1_2_2

g_1_1_2_2_1

1_1_2_2

1_2

1_1_2

m_1_2_2_1_1

1_2_2

o_1_2_2_2_1n_1_2_2_1_2

1_2_1_1

1_2_2_2

d_1_1_1_2_2

j_1_2_1_1_2

1_2_2_1

1_1_1

1_1_1_1

a_1_1_1_1_1l_1_2_1_2_2

c_1_1_1_2_1

1_2_1

e_1_1_2_1_1

h_1_1_2_2_2

1_1_1_2

b_1_1_1_1_2

1

f_2_1_1

i_1_2_1_1_1

0.2

S1111

S2112

S112

S2212

S11

S2122

S111

S212

S122S12

S222S22

S2111

S1112

S1122

S1212

S2121

S21

S

S121

S1221

S2221

S221

S1222

S1211

S2

S1121

S2211

S1

S211

CSI Phylogeny – Pruning disabled

So… What should I use when?

CSI Phylogeny•  Has very good statistics and a good graphical overview.•  Advantageous to use when you expect the differences

between the isolates to be larger than 5-10 mutations. •  Is fasterNDtree•  Is able to find very small differences. •  Does not take recombination into consideration. •  Works best on raw reads. If given assembled genomes,

it simulates reads.

Choosing a reference genome

For comparison of very closely related isolates, a better level of detail is given by using a closely related reference genome.

What defines an outbreak

•  We can’t tell for certain•  It depends on the species •  But a rule of thump is:

–  Within 10 SNPs it is definitely an outbreak–  Within 30 SNPs it might be an outbreak–  Above 60 SNPs it is most likely not an outbreak

Thank you for listening

•  Questions?

Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Documents