Top Banner
Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant
24

Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

Phylogeny- based on whole genome data

Johanne Ahrenfeldt Research Assistant

Page 2: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

Overview

• What is Phylogeny and what can it be used for

• Single Nucleotide Polymorphism (SNP) methods- snpTree and CSI Phylogeny

• Nucleotide Differences- NDtree

• Controlled Evolution study• What services for which data

Page 3: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

What is phylogenetic trees

• Trees are traditionally made using aligned sequences of single genes or proteins

• Whole genome data may be used to create trees based on – SNP calling– K-mer overlap

Page 4: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

What is a SNP

• A Single Nucleotide Polymorphism (SNP) is a DNA sequence variation occurring commonly* within a population (e.g. 1%) in which a Single Nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes.

Page 5: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

What is phylogeny used for

• Classify taxonomy – the classic use• Outbreak detection – becoming more

increasing with WGS data

Page 6: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

How does it work

Strain A ATTCAGTAStrain B ATGCAGTCStrain C ATGCAATCStrain D ATTCAGTC

Page 7: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

Construct distance matrix

Strain A ATTCAGTAStrain B ATGCAGTCStrain C ATGCAATCStrain D ATTCAGTC

A B C DA 0 2 3 1B 2 0 1 1 C 3 1 0 2D 1 1 2 0

Page 8: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

Make Tree

Strain AATTCAGTAStrain B ATGCAGTCStrain C ATGCAATCStrain D ATTCAGTC A B C DA 0 2 3 1B 2 0 1 1 C 3 1 0 2D 1 1 2 0

A

D

BC

Page 9: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

snpTree

• First online webserver for constructing phylogenetic trees based on whole genome sequencing

snpTree--a web-server to identify and construct SNP trees from whole genome sequence data. Leekitcharoenphon P, Kaas RS, Thomsen MC, Friis C, Rasmussen S, Aarestrup FM. BMC Genomics. 2012;13 Suppl 7:S6.

Page 10: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

snpTree flow

Page 11: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

snpTree output

Page 12: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

CSIPhylogeny

• Finds SNPs in the same manner as snpTree• More strict sorting of SNPs• Requires all SNPs to be significant

– Z-score higher than 1.96 for all SNPs

• X is the number of reads, of the most common nucleotide at that position, and Y the number of reads with any other nucleotide.

Page 13: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

NDtree

• A different approach where the main distinction is not between if a SNP should be called or not, but between whether there is solid evidence for what nucleotide should be called or not.

Page 14: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

NDtree

• When all reads had been mapped the significance of the base call at each position was evaluates by calculating the number of reads X having the most common nucleotide at that position, and the number of reads Y supporting other nucleotides.

• A Z-score threshold was calculated as:

> 1.96 (or 3.29)

• >90% of reads supporting the same base

Page 15: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

NDtree

• Count nucleotide differences– Method 1: Each pair of sequences was

compared and the number of nucleotide differences in positions called in all sequences was counted.

• More accurate (Z=1.96 is used as threshold)

– Method 2: Each pair of sequences was compared and the number of nucleotide differences in positions called in both sequences was counted.

• More robust (Z=3.29 is used as threshold)

Page 16: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

NDtree

• A matrix with these numbers was given as input to a UPGMA algorithm implemented in the neighbor program.

• Simplest method:– First cluster closest strains

• Merge those strains to one point– Distance of cluster to another strain is average of

distances for each member in cluster to that strain

– Then merge second closest– Repeat until all strains are clustered

Page 17: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

Controlled Evolution study

This study was performed as part of my Master Thesis

Page 18: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

Naming the descendants

Page 19: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

Mutations

Page 20: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

Phylogenetic tree using NDtree (UPGMA)

Page 21: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

Phylogenetic tree using Neighbor Joining

Page 22: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

UPGMA vs. Neighbor Joining

• UPGMA works well when samples have been taken the same time

• Neighbor joining is better when samples have been taken at different times

Page 23: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

CSI Phylogeny

Page 24: Phylogeny - based on whole genome data Johanne Ahrenfeldt Research Assistant.

So… What should I use when?

• CSI Phylogeny is advantageous to use when you expect the differences between the samples to be larger than 5-10 mutations

• NDtree on the other hand is able to find these small differences, but may not be strict enough to handle very large differences.