Whole Genome based Phylogeny Johanne Ahrenfeldt PhD student DTU Bioinformatics
Whole Genome based Phylogeny
Johanne Ahrenfeldt PhD student
DTU Bioinformatics
Short about me
Johanne [email protected]
• PhD student at DTU Bioinformatics – Whole Genome based Phylogeny
• Graduate Engineer in Systems Biology and Bioinformatics from Technical University of Denmark
• Working in the CGE project since 2012 – started as a student helper
Overview
• What is Phylogeny • SNP methods
– CSI Phylogeny• Nucleotide Differences
– NDtree• Controlled Evolution study• Good advice
What is phylogeny?
• Early phylogeny– Classification– Based on phenotypes
• Current phylogeny– Based on genotypes– DNA mutations as basis for evolution
Classification
Carl Linnaeus 1707-1778
Hierarchical system KingdomPhylumClassOrderFamilyGenusSpecies
Classification depicted as a tree
CE
NTE
R F
OR
BIO
LOG
ICA
L S
EQ
UE
NC
E A
NA
LYS
IS
Classification depicted as a tree
Classification depicted as a tree
CE
NTE
R F
OR
BIO
LOG
ICA
L S
EQ
UE
NC
E A
NA
LYS
IS
Classification depicted as a tree
Species Genus Family Order Class
CEN
TER
FO
R B
IOLO
GIC
AL S
EQU
ENC
E AN
ALYS
IS
Molecular Basis for Variation: DNA Mutation
DNA mutations as basis for evolution
What are phylogenetic trees
• Phylogenetic trees are a visual representation of the genetic relationship between species
• Think of them as family trees • Phylogeny can also be represented by distance
matrices
What are phylogenetic trees
• Trees were traditionally made using aligned sequences of single genes or proteins
• Whole genome data can be used to create trees based on – SNP calling– K-mer overlap– Alignment of genomes
What is a SNP
• A Single Nucleotide Polymorphism (SNP) is a DNA sequence variation occurring commonly* within a population (e.g. 1%) in which a Single Nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes.
How does it work
Strain A ATTCAGTAGT Strain B ATGCAGTTGA Strain C ATGCAATTGT Strain D ATCCATTAGC
Construct distance matrix
Strain A ATTCAGTAGT Strain B ATGCAGTTGA Strain C ATGCAATTGT Strain D ATCCATTAGC
A B C D
AA
B
C
D
Make Tree
Strain A ATTCAGTAGT Strain B ATGCAGTTGA Strain C ATGCAATTGT Strain D ATCCATTAGC
A B C D A 0 3 3 3 B 3 0 2 4 C 3 2 0 4 D 3 4 4 0
B
D
C
A
1
1
1
1
1
1
How to read phylogenetic trees
15
that would have been classified as Homo sapiens wereborn. However, rainbow trout and humans are contempo-rary species, meaning that the lineages of which they arecurrently terminal nodes have been evolving for exactly thesame amount of time since their divergence from a distantcommon ancestor. As a result, any notion that one of theselineages is “more evolved” or that one has had more time toaccumulate differences is flawed.
Misconception #8: Backwards Time Axes
Among the common misconceptions identified by Meiret al. (2007) was the tendency for many students to misreadthe time axis on evolutionary trees. Many studentsinterpreted the location of the terminal nodes as indicatingtime, for example by reading from left to right or from theleftmost tip to the root. In Fig. 17a, for example, manystudents read time as proceeding from birds (oldest) to theroot W (youngest) or from birds (oldest) to kangaroos(youngest). Neither is correct, as time extends from the rootto the terminal nodes, all of which are contemporary. Thismisinterpretation may have been exacerbated by the factthat the tree used in the quiz placed mammals—whichmany students assume to be the most “advanced” and hencemost recent group—alone on the less diverse branch at thefar right of an unbalanced, ladderized tree (unfortunately, a
tendency to place humans or some other preferred taxon atthe top or right of every tree appears to be an unshakablehabit among many phylogeneticists, although there is noobjective reason for doing so). As indicated in Fig. 5, evenon cladograms, in which the lengths of the branches are not
Fig. 16 The lineages leading to contemporary species have all beenevolving for exactly the same amount of time. Rates of morphologicalchange may vary among lineages, but the amount of time thatseparates two living lineages from their common ancestor does not.This figure shows the relationships among a sample of vertebratelineages, all of which have been evolving for exactly the same amountof time, even if some lineages have undergone more change or morebranching than others or if some taxonomically identifiable subsets ofthose lineage (e.g., teleost fishes) arose earlier than others (e.g.,mammals). It is therefore a fallacy to describe one modern species as“more evolved” than another. Note, however, that this is a cladogramrather than an ultrametric tree, such that one cannot assume that any orall of G, H, E, F, C, and B are equal, only that the total amount of timebetween root and tip is the same along each of the lineages
Fig. 17 The number of intervening nodes does not indicate overallrelatedness between lineages. The tree in a is the same in topology asthe one used in the study of Meir et al. (2007), which showed thatmany readers have a tendency misread the directionality of time onphylogenies and to count nodes when asked to determine evolutionaryrelatedness among species. Confusion may arise in this particular casebecause many people maintain the erroneous assumption thatmammals are the most “advanced” and therefore must be the youngestgroup. More generally, because the tree is unbalanced, students maytend to consider birds and mammals (separated by four internal nodeson this tree, Z, Y, X, and W) as more distantly related than turtles andmammals (separated by two internal nodes, X and W). However, this issimply an artifact of the species chosen for inclusion on the tree. Allspecies descended from ancestor X are equally related to kangaroos,with which they all share the same last common ancestor, W. Todemonstrate this, b illustrates the same tree with different patterns foreach branch, which are then spliced together in c to reveal theidentical total distance from the common ancestor W to all of theterminal nodes
134 Evo Edu Outreach (2008) 1:121–137
T. Ryan Gregory. Understanding Evolu<onary Trees. Evo Edu Outreach (2008) 1:121–137 DOI 10.1007/s12052-‐008-‐0035-‐x
How to read phylogenetic trees
16
T. Ryan Gregory. Understanding Evolu<onary Trees. Evo Edu Outreach (2008) 1:121–137 DOI 10.1007/s12052-‐008-‐0035-‐x
with their shared ancestor represent a clade (amniotes) inwhich the first two clades are nested. Adding frogs and theancestor linking them to the aforementioned species createsa yet larger clade (tetrapods). Adding fishes and thecommon ancestor of all species on this tree creates thefinal and largest clade (vertebrates). Because frogs can be
included in a clade with humans before fishes can—in otherwords, because frogs and humans share a common ancestorthat is not shared with fishes—frogs are more closelyrelated to humans than to fishes. Indeed, frogs and humansare exactly equally related to fishes through this commonancestor (recall that two cousins are equally related to athird, more distant relative).
A more rapid approach is to mentally rotate a fewinternal nodes with no effect on the topology of the tree, asshown in Fig. 11b. In this modified tree, humans are stillsister to cats and birds are sister to lizards, frogs are thensister to amniotes, and fishes are the outgroup to thetetrapods. This second tree is identical in topology and istherefore equally accurate as the first tree. However, it
Fig. 11 The order of terminal nodes is meaningless. One of the mostcommon misconceptions about evolutionary trees is that the order ofthe terminal nodes provides information about their relatedness. Onlybranching order (i.e., the sequence of internal nodes) provides thisinformation; because all internal nodes can be rotated withoutaffecting the topology (Fig. 6), the order of the tips is meaningless.Nevertheless, there is a strong tendency for readers to take the tree in aas indicating that frogs are more closely related to fishes than humansare. They are not: both frogs and humans (and birds and lizards andcats) are equally closely related to fishes because as tetrapods theyshare a common ancestor to the exclusion of bony fishes. On the otherhand, humans and cats are more closely related to each other thaneither is to any of the other species depicted because they share arecent common ancestor to the exclusion of the other species. The treein b exhibits an identical topology to the one in a and is thereforeequally valid. In this case, the same misinterpretation of “readingacross the tips” would lead to the erroneous conclusion that birds aremore closely related to fishes than cats are or that humans are moreclosely related to frogs than to lizards and birds. Because they share acommon ancestor as amniotes, birds, cats, lizards, and humans are allequally related to frogs. It is good practice to rotate a few internalnodes mentally when first examining a tree to dispel misinter-pretations based on reading the order of tips
Fig. 12 Evolutionary trends cannot be identified by reading across thetips. In addition to resulting in incorrect interpretations of relatedness(Fig. 11), reading across the tips can engender a false impression ofevolutionary trends. For example, many readers confronted with thetree in a might be tempted to infer an evolutionary trend towardincreased body size in snail species over time (or, in Fig. 11a, anincrease in complexity or intelligence over time). Unfortunately,misinterpretations such as this can be found even in the primaryscientific literature. Once again, this can be corrected simply byrotating a few internal nodes, as has been done in b, in which thetopology is the same but where the supposed trend is no longerapparent. c shows evidence of a real evolutionary trend towardincreased body size. The important consideration is internal branch-ing: In this case, there is information about ancestral states (e.g., fromfossils), and it is evident that in every branching event, the twodescendant species have been larger than their shared ancestor.Despite this being a clear evolutionary trend, there is no patternevident across the terminal nodes. Thus, reading across the tips cancreate apparent trends where there are none and can mask real trendsthat are strongly supported by historical information
130 Evo Edu Outreach (2008) 1:121–137
with their shared ancestor represent a clade (amniotes) inwhich the first two clades are nested. Adding frogs and theancestor linking them to the aforementioned species createsa yet larger clade (tetrapods). Adding fishes and thecommon ancestor of all species on this tree creates thefinal and largest clade (vertebrates). Because frogs can be
included in a clade with humans before fishes can—in otherwords, because frogs and humans share a common ancestorthat is not shared with fishes—frogs are more closelyrelated to humans than to fishes. Indeed, frogs and humansare exactly equally related to fishes through this commonancestor (recall that two cousins are equally related to athird, more distant relative).
A more rapid approach is to mentally rotate a fewinternal nodes with no effect on the topology of the tree, asshown in Fig. 11b. In this modified tree, humans are stillsister to cats and birds are sister to lizards, frogs are thensister to amniotes, and fishes are the outgroup to thetetrapods. This second tree is identical in topology and istherefore equally accurate as the first tree. However, it
Fig. 11 The order of terminal nodes is meaningless. One of the mostcommon misconceptions about evolutionary trees is that the order ofthe terminal nodes provides information about their relatedness. Onlybranching order (i.e., the sequence of internal nodes) provides thisinformation; because all internal nodes can be rotated withoutaffecting the topology (Fig. 6), the order of the tips is meaningless.Nevertheless, there is a strong tendency for readers to take the tree in aas indicating that frogs are more closely related to fishes than humansare. They are not: both frogs and humans (and birds and lizards andcats) are equally closely related to fishes because as tetrapods theyshare a common ancestor to the exclusion of bony fishes. On the otherhand, humans and cats are more closely related to each other thaneither is to any of the other species depicted because they share arecent common ancestor to the exclusion of the other species. The treein b exhibits an identical topology to the one in a and is thereforeequally valid. In this case, the same misinterpretation of “readingacross the tips” would lead to the erroneous conclusion that birds aremore closely related to fishes than cats are or that humans are moreclosely related to frogs than to lizards and birds. Because they share acommon ancestor as amniotes, birds, cats, lizards, and humans are allequally related to frogs. It is good practice to rotate a few internalnodes mentally when first examining a tree to dispel misinter-pretations based on reading the order of tips
Fig. 12 Evolutionary trends cannot be identified by reading across thetips. In addition to resulting in incorrect interpretations of relatedness(Fig. 11), reading across the tips can engender a false impression ofevolutionary trends. For example, many readers confronted with thetree in a might be tempted to infer an evolutionary trend towardincreased body size in snail species over time (or, in Fig. 11a, anincrease in complexity or intelligence over time). Unfortunately,misinterpretations such as this can be found even in the primaryscientific literature. Once again, this can be corrected simply byrotating a few internal nodes, as has been done in b, in which thetopology is the same but where the supposed trend is no longerapparent. c shows evidence of a real evolutionary trend towardincreased body size. The important consideration is internal branch-ing: In this case, there is information about ancestral states (e.g., fromfossils), and it is evident that in every branching event, the twodescendant species have been larger than their shared ancestor.Despite this being a clear evolutionary trend, there is no patternevident across the terminal nodes. Thus, reading across the tips cancreate apparent trends where there are none and can mask real trendsthat are strongly supported by historical information
130 Evo Edu Outreach (2008) 1:121–137
What is phylogeny used for
• Classify taxonomy – The classic use
• Outbreak detection – Increasing with WGS data
What is phylogeny used for
• Cholera outbreak in Haiti 2010• Listeria outbreak 2014
Whole-genome Sequencing Used to Investigate a Nationwide Outbreak of Listeriosis Caused by Ready-to-eat Delicatessen Meat, Denmark, 2014.Kvistholm Jensen et al. Clin Infect Dis. (2016) 63 (1): 64-70. doi: 10.1093/cid/ciw192
Case story
• Vibrio Cholerae outbreak in Haiti followed the 2010 earthquake
• Rumors said that the outbreak may have come from Nepal, travelling along with UN soldiers from Nepal
• No proof had been given of this until the Hendriksen et al. paper in 2011
Popula<on Gene<cs of Vibrio cholerae from Nepal in 2010: Evidence on the Origin of the Hai<an Outbreak. Hendriksen et al. 23 August 2011 mBio vol. 2 no. 4 e00157-‐11. doi: 10.1128/mBio.00157-‐11
Case story
• Data– 24 recent V. cholerae strains from Nepal– 10 previously sequenced V. cholerae isolates,
including 3 from the Haitian outbreak• Analysis
– Antimicrobial susceptibility testing – PFGE (pulsed-field gel electrophoresis) to analyze for
genetic relatedness– Whole genome sequencing, SNP identification and
phylogenetic analysis
Case story - Results
Resistance profile Suscep5ble Decreased suscep5bility
Resistant
Nepalese strains Hendriksen et al. 2011
Tetracycline Ciprofloxacin Trimethoprim, Sulfamethoxazole Nalidixic
Hai<an outbreak strains Centers for Disease Control and Preven<on, 2010
Tetracycline Ciprofloxacin Trimethoprim, Sulfamethoxazole Nalidixic
Case story - Results
• Pulsed-field gel electrophoresis (PFG)E– Nepalese isolates divided in 4 groups– Most common Haitian type in same group as four
Nepalese strains
Case story - Results
FIG 1 Genetic relationships among V. cholerae isolates from Nepal and Haiti. A single maximum parsimony tree was reconstructed using 752 SNPs from 34whole-genome sequences. There were 184 parsimony-informative SNPs, of which 6 were homoplastic, resulting in a CI of 0.97 (excluding uninformativecharacters). The branch lengths are labeled in red, and for branches affected by homoplasy, minimum and maximum branch lengths are designated. Membersof SNP genotypic group V (16) are indicated. SNP differences among the three most closely related Nepali groups and the Haitian group are shown andcharacterized in Table S1 in the supplemental material.
TABLE 1 Different point mutations observed among the three sequenced isolates from the Haiti outbreak and the three most closely related isolatesfrom Nepala
Chromosome Position
Nucleotide or amino acid in:
Reference strain
Haitian isolate Nepalese isolate
1786 1792 1798 14 25 26
I 2787016 C C C C T T TGly Gly Gly Gly Arg Arg Arg
I 1090536 T T T T T T GIle Ile Ile Ile Ile Ile Ser
II 962762 C C C C T C CAla Ala Ala Ala Ala Ala Ala
a The reference strain is Vibrio cholerae O1 biovar El Tor strain N16961 (Bangladesh 1971). The NCBI reference sequences or accession numbers are NC_002505 for chromosome Iand NC_002506 for chromosome II.
Population Genetics of Vibrio cholerae
July/August 2011 Volume 2 Issue 4 e00157-11 ® mbio.asm.org 3
m
bio.asm.org
on March 3, 2015 - Published by
mbio.asm
.orgD
ownloaded from
10 minutes break!
snpTree
• First online webserver for constructing phylogenetic trees based on whole genome sequencing
snpTree-‐-‐a web-‐server to iden<fy and construct SNP trees from whole genome sequence data. Leekitcharoenphon P, Kaas RS, Thomsen MC, Friis C, Rasmussen S, Aarestrup FM. BMC Genomics. 2012;13 Suppl 7:S6.
snpTree flow
CSI Phylogeny
https://cge.cbs.dtu.dk/services/CSIPhylogeny/
• SNP identification same as snpTree• Strict sorting of SNPs
– Depth– Relative depth– Distance between SNPs– SNP quality– Read mapping quality
Rolf S. Kaas, Pimlapas Leekitcharoenphon, Frank M. Aarestrup, Ole Lund. Solving the Problem of Comparing Whole Bacterial Genomes across Different Sequencing Plaeorms. PLoS ONE 2014; 9(8): e104984.
CSI Phylogeny
• Requires all SNPs to be significant– Z-score higher than 1.96 for all SNPs
• X is the number of reads, with the most common nucleotide at that position, and Y the number of reads with any other nucleotide.
Z = X −YX+Y
CSI Phylogeny
OutputTree build by FastTree algorithm, in Newick format
• Branch lengths is substitutions per site at the variable sites
Matrix of SNP pair counts in text (.txt) format • Diagonal SNP matrix
CSI Phylogeny
NDtree
https://cge.cbs.dtu.dk/services/NDtree/
Nucleotide calling• A different approach where the main distinction is not
between if a SNP should be called or not, but between whether or not there is solid evidence for the nucleotide at the given position.
Real-‐Time Whole-‐Genome Sequencing for Rou<ne Typing, Surveillance, and Outbreak Detec<on of Verotoxigenic Escherichia coli. Joensen KG, Scheutz F, Lund O, Hasman H, Kaas RS, Nielsen EM, Aarestrup FM. J Clin Microbiol. 2014 May;52(5):1501-‐10.
NDtree
Simple mapping approach• Cuts all reads into K-mers• Maps all K-mers to reference genome• Makes an ungapped consensus sequences of equal
lengths
Mapping
33
K-‐mers
Reference genome
Consensus sequence
Reference genome Genome 1 Genome 2 Genome 3 Genome 4 Genome 5 Genome 6
NDtree Nucleotide calling
– When all reads have been mapped the significance of the base call at each position was evaluated by calculating the number of reads X having the most common nucleotide at that position, and the number of reads Y supporting other nucleotides.
A Z-score threshold is calculated
> 1.96 (or 3.29)
>90% of reads supporting the same base
Z = X −YX+Y
NDtree
Count nucleotide differences– Method 1: Each pair of sequences was compared
and the number of nucleotide differences in positions called in all sequences was counted.
• More accurate (Z=1.96 is used as threshold)– Method 2: Each pair of sequences was compared
and the number of nucleotide differences in positions called in both sequences was counted.
• More robust (Z=3.29 is used as threshold)
Method 1 – all calledSignificant posi<ons in Genome 1 Significant posi<ons in Genome 2 Significant posi<ons in Genome 3 Posi<ons used for phylogeny
Method 2 – pairwise significanceSignificant posi<ons in Genome 1 Significant posi<ons in Genome 2 Significant posi<ons in Genome 3 Posi<ons used between 1 and 2 Posi<ons used between 1 and 3 Posi<ons used between 2 and 3
NDtree
Uses two different algorithms to make two different trees• UPGMA• Neighbor Joining
Both algorithms are part of the PHYLIP Neighbor program package and make trees from distance matrices
UPGMA vs. Neighbor Joining
• UPGMA works when samples have been taken the same time
• Neighbor Joining is better when samples have been taken at different times
NDtree
Output• distance.txt: Distance matrix - tab separated• dist.mat: Distance matrix - PHYLIP format• tree.nj.newick: Neighbor Joining tree - Newick format
– Branch lengths is number of Nucleotide Differences• tree.upgma.newick: UPGMA tree – Newick format
– Branch lengths is number of Nucleotide Differences
Day$1$
For$each$8$hour$culture$a$sample$was$saved$for$DNA$sequencing$
Single$colonies$of$CSH114$
Choose$colony$Grow$for$8$h$$
Plate$out$Grow$for$16$h$
Plate$out$Grow$for$16$h$
Choose$colonies$Grow$for$8$h$
Choose$colonies$Grow$for$8$h$
Day$2$ Day$3$ $$$$$…$$$$$$$$$$$$$$Day$8$$
128x$
Controlled Evolution study
J. Ahrenfeldt, C. Skaarup, H. Hasman, A. G. Pedersen, F. M. Aarestrup and O. Lund. Bacterial whole genome-‐based phylogeny: construc<on of a new benchmarking dataset and assessment of some exis<ng methods. BMC Genomics (2017) 18:19
Naming the descendants Day$1 Day$2 Day$3 Day$4 Day$5
S2211
S2212
S222S2221
S2222
S2
S21
S211S2111
S2112
S212S2121
S2122
S22
S221
S121S1211
S1212
S122S1221
S1222S
S1
S11
S111S1111
S1112
S112S1121
S1122
S12
Mutations
Phylogenetic tree using NDtree (UPGMA)
2.0
S1122
S1111
S2
S221
S222
S
S2121
S1112
S1221
S2122
S21
S1121
S2211
S1211
S112
S2221
S122
S12
S11
S22
S111
S1212
S2212
S2111
S121
S211S2112
S1222
S212
S1
Phylogenetic tree using NDtree (Neighbor Joining)
S2212
S121
S222
S1112S112
S1212
S2211
S
S1222
S212
S1211
S11
S2112
S1S12
S1111
S1221
S1121
S2122
S2
S2121
S1122
S21
S2221
S211
S221
S111
S2111
S122
S22
UPGMA vs. Neighbor Joining
• UPGMA works when samples have been taken the same time
• Neighbor Joining is better when samples have been taken at different times
CSI Phylogeny – Default settings
0.2
1_1
1_2_1_2
1_1_2_1
f_1_1_2_1_2
k_1_2_1_2_1
b_1_2_2
g_1_1_2_2_1
1_1_2_2
1_2
1_1_2
m_1_2_2_1_1
1_2_2
o_1_2_2_2_1n_1_2_2_1_2
1_2_1_1
1_2_2_2
d_1_1_1_2_2
j_1_2_1_1_2
1_2_2_1
1_1_1
1_1_1_1
a_1_1_1_1_1l_1_2_1_2_2
c_1_1_1_2_1
1_2_1
e_1_1_2_1_1
h_1_1_2_2_2
1_1_1_2
b_1_1_1_1_2
1
f_2_1_1
i_1_2_1_1_1
0.2
S1111
S2112
S112
S2212
S11
S2122
S111
S212
S122S12
S222S22
S2111
S1112
S1122
S1212
S2121
S21
S
S121
S1221
S2221
S221
S1222
S1211
S2
S1121
S2211
S1
S211
CSI Phylogeny – Pruning disabled
So… What should I use when?
CSI Phylogeny• Has very good statistics and a good graphical overview.• Advantageous to use when you expect the differences
between the isolates to be larger than 5-10 mutations. • Is fasterNDtree• Is able to find very small differences. • Does not take recombination into consideration. • Works best on raw reads. If given assembled genomes,
it simulates reads.
Choosing a reference genome
For comparison of very closely related isolates, a better level of detail is given by using a closely related reference genome.
What defines an outbreak
• We can’t tell for certain• It depends on the species • But a rule of thump is:
– Within 10 SNPs it is definitely an outbreak– Within 30 SNPs it might be an outbreak– Above 60 SNPs it is most likely not an outbreak
Thank you for listening
• Questions?