Top Banner
PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun
44

PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

PHYLOGENETIC TREESIntroduction to Computational Biology

CIS 786

With Dr. Barry Cohen

Tuesday, May 7, 2001

Paul Wood

Yanchun Song

Chaowei Sun

Page 2: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Introduction

Paul Wood

Yanchun Song

Chaowei Sun

Page 3: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

What is a Phylogenetic Tree?

• Phylogenetic trees are representations of the similarity or dissimilarity—among both existing & extinct living individuals &—across a set of characteristics or features.

• Similarity of molecular and physical systems provide compelling evidence that all life on earth arose from a common ancestry.

Carl R. Woese, Interpreting the universal phylogenetic tree, Proc. Natl. Acad. Sci. USA, Vol. 97, Issue 15, 8392-8396, July 18, 2000http://www.pnas.org/cgi/content/full/97/15/8392

Page 4: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

• Shall I thee to a summers day?– W. Shakespeare, Sonnet 18

• There is a between Homer and Hesiod, between Æschylus and Euripides…

– P. Shelley, Prometheus Unbound

• Life all around me…All in the loom, and oh

What ! Woodlands, meadows,…– E. L. Masters, Spoon River Anthology

• If the foolish call them “flowers”/Need the wiser tell? // If the savants “ ” them/It is just as well.

– E. Dickenson, Part 1: Life, XCIV

SIMILARITY

PATTERNS

Why do we study Phylogenetic Trees?

COMPARE

CLASSIFY

…because humans need to….fill in blanks…

…and understand in our own language…

Page 5: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

What are some applications of “phylogenetic” trees?

Computational Linguistics• Manning, Christopher D. and Heinrich Schutze, Foundations of Statistical

Natural Language Processing, MIT Press, Cambridge Massachusetts, 1999. http://www.aclweb.org/archive/fsnlp-ch1.pdf

Archaeological Statistics• Archaeological Statistics: Brief Bibliography

http://ad.trafficmp.com/tmpad/banner/itrack.asp?rv=3.0&id=16&nojs=1

Broad Historical and Technical Overview• Discriminant Analysis and Clustering, Panel on Discriminant Analysis,

Classification, and Clustering, Committee on Applied and Theoretical Statistics Board on Mathematical Sciences, Commission on Physical Sciences, Mathematics, and Resources National Research Council, NATIONAL ACADEMY PRESS, Washington, D.C. 1988 http://www.ulib.org/webRoot/Books/National_Academy_Press_Books/discrim_analysis/discr001.htm

Page 6: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Phylogenetic trees are used to study locations,

migrations, lives, health & cultures of populations.

Velda

Helena Tara

Katrina

Ursula

Xenia

Jasmine

http://www.oxfordancestors.com/daughters.html

Page 7: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Phylogenetic trees are used to study physical &

genetic variability, evolution of species.

http://www.oxfordancestors.com/daughters.html

Page 8: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Which areas of the genome provide mutant data to create phylogenetic trees?

Y-Chromosome

MitochondrialControl Region

Autosomes

Page 9: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

How do we get data for computational biology?

Concentrationgradient

Homogenize

Detergent(Sodium Dodecyl Sulphate SDS)

+

+

Phenol

GeneticMaterial

InsolubleProtein

Phenol

Remove Upper Phase

Cesium Chloride

+

SPIN40 hrs @

40,000 RPM

RNARNA

RNARNA

CsCs

Cs

Cs

RNA

STEP 1: Eukaryotic Biochemical Protocol is……kind of like washing greasy dishes!

LowWeight

MediumWeight

HighWeight

Page 10: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

How do we get sequence data?

RNARNA

RNARNA

CsCs

Cs

Cs

RNA

STEP 2: Cut up DNA using one of “two” methods… &

STEP 3: Label fragments using one of “two” methods…

2 b: Maxam-Gilbert

2 a: Sanger (Dideoxy)

EtOH+

+

RestrictionEnzymes

32Phosphate

GelElectro-phoresis

AutoRadiography

Fluorescent

Dye

FluorescenceSpectroscopy

~ 4 Reactions

~ 4 Reactions

GelElectro-phoresis

3a:

3b:

Page 11: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

What is the rate of evolutionary change…or…how many mutants can we expect?

• Estimates vary depending upon assessment method and location within the genome

• “…134 independent mtDNA lineages spanning 327 generations found ~2.5 mutations per site per 1000 yrs.”

– A high observed substitution rate in the human mitochondrial DNA control region. Parsons TJ, Muniec DS, Sullivan K, Woodyatt N, Alliston-Greiner R, Wilson MR, Berry DL, Holland KA, Weedn VW, Gill P, Holland MM. Nat Genet 1997 Apr; 15(4):363-8. Armed Forces DNA Identification Laboratory, Armed Forces Institute of Pathology, Rockville, Maryland 20850, USA. http://www.mhrc.net/mitochondria.htm

– M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt. (1978) A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure, M. O. Dayhoff, (Ed.). National Biomedical Research Foundation, Vol. 5, Suppl. 3, chapter 22, 345-352)

Page 12: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

What do sequence data and input files typically look like?

263 2821 AY053096 cacgggagct …variable region... 2822 AY053097 cacgggagct …variable region... 2823 AY053098 cacgggagct …variable region... 282.263

!Domain=Data property=Coding CodonStart=1;#W._Pygmy_(1)_{African} TTC TTT CAT GGG#W._Pygmy_(6)_{African} ... ... ... ...#Kung_(7)_{African} ... .C. ... ... .T.#Kung_(9)_{African} ... ... ... ... ...#Kung_(10)_{African} ... ... ... ... ...#Kung_(13)_{African} ... ... .G. ... ...

PHYLIP INPUT FILE (SEQUENCE)

MEGA INPUT FILE (SEQUENCE)

 A  B  C  D  E

 B  2

 C  4  4

 D  6  6  6

 E  6  6  6  4

 F  8  8  8  8  8

DISTANCE MATRIX

Page 13: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

What are some of the major classifications of algorithms & software applications?

Count of Software Applications by Type and Platform

Unix/Source Code DOS Windows Mac VMS

General-purpose 6 5 5 3 3Parsimony 12 12 13 5 3Distance matrix 27 21 20 15 4Compute distances 22 16 17 14 6Maximum likelihood 23 5 13 14 5Quartets methods 7 5 0 4 1Artificial Intelligence 1 0 0 0 0Invariants 2 2 2 2 2Tree rearrangement 4 2 3 5 1Recombination 9 2 1 2 0Bootstrapping and other measures 16 15 9 11 2Clocks, dating, and stratigraphy 10 2 6 10 0

PHYLIP, PAUP & MEGA are represented across most categories. PHYLIP is the most widely distributed and used. PAUP is most frequently cited in publications. MEGA has a nice GUI and is user friendly. http://evolution.genetics.washington.edu/phylip/software.html

Page 14: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Yanchun Song

Page 15: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Two Types of Data

• Distance-based: – The input is a matrix of distances between the

species (e.g., the alignment score between them or the fraction of residues they agree on).

• Character-based: – Examine each character (e.g., a base in a

specific position in the DNA) separately

Page 16: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Pairwise Distance

• Model of Jukes and Cantor– Each base in the DNA sequence has an equal

chance of mutating, and when it does, it is replaced by some other nucleotide uniformly.

• Distance dij:

– The fraction f of sites u where residues xu

i and x

uj differ (presupposing an alignment of the

two sequences).T. H. Jukes and C. Cantor, Mammalian Protein Metabolism, Chapter Evolution of protein molecules, pages 21-132, Academic Press, New York, 1969

Page 17: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

How to Make a Tree?

• Clustering methods: – UPGMA – Neighbor-joining

• Parsimony:

Page 18: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Clustering Method: UPGMA

• UPGMA: Unweighted Pair Group Method with Arithmetic Mean

• Di,j between two clusters of species Ci and

C

j:

d(p, q) – distance function between species,

ni = |Ci| and nj = |Cj|.

i jCp Cqji

ji qpdnn

D ),(1

,

http://www.math.tau.ac.il/~rshamir/algmb/00/scribe00/html/lec08/node21.html

Page 19: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Algorithm• Initialization:

– Initialize n clusters with the given species, one species per cluster. – Size of each cluster: ni ← 1; assign a leaf for each species.

• Iteration: – Find minimal Dij,

– Create a new cluster (ij), which has n(ij) = ni + nj members.

– Connect i and j to the new node (ij), each given length Di,j /2. – Compute the distance from (ij) to all other clusters as a weighted average of the

distances from its components:

– Replace the columns and rows of clusters i and in D with cluster (ij), with D(ij),k computed as above.

• Termination: – until there is only one cluster left.

kjji

iki

ji

ikij D

nn

nD

nn

nD ,,),(

Page 20: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

UPGMA Example

 A  B  C  D  E

 B  2

 C  4  4

 D  6  6  6

 E  6  6  6  4

 F  8  8  8  8  8

http://www.icp.ucl.ac.be/~opperd/private/upgma.html

Page 21: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

UPGMA Example (cont’d)

 A  B  C  D  E

 B  2

 C  4  4

 D  6  6  6

 E  6  6  6  4

 F  8  8  8  8  8

 A,B  C  D  E

 C  4

 D  6  6

 E  6  6  4

 F  8  8  8  8

kjji

iki

ji

ikij D

nn

nD

nn

nD ,,),(

D(A,B),C = (DAC + DBC) / 2 = 4 D(A,B),D = (DAD + DBD) / 2 = 6 D(A,B),E = (DAE + DBE) / 2 = 6 D(A,B),F = (DAF + DBF) / 2 = 8

http://www.icp.ucl.ac.be/~opperd/private/upgma.html

Page 22: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

UPGMA Example (cont’d)

 A,B  C  D,E

 C  4

 D,E  6  6

 F  8  8  8

 AB,C  D,E

 D,E  6

 F  8  8

 ABC,DE

 F  8

http://www.icp.ucl.ac.be/~opperd/private/upgma.html

Page 23: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Additivity

• Given a tree, its edge lengths are said to be additive if the distance between any pair of leaves is the sum of the lengths of the edges on the path connecting them.

Page 24: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Additivity

Dim = Dik + Dkm

Djm = Djk + Dkm

Dij = Dik + Djk i

j

k

m

2ijjmim

km

DDDD

Page 25: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

The idea of Neighbor-joining

• Distance of i from the rest of the tree:

• To find neighboring nodes i and j:

min(Di,j – (ui + uj) )

i m

j n

0.1 0.1 0.1

0.40.4

k l

ik

kii n

Du

)2(,

)(2

1,)(, jijiiji uuDD

)(2

1,)(, ijjiijj uuDD

R. Durbin, et al, Additivity and neighbour-joining, Biological Sequence Analysis, p. 169-173, Cambridge Univ. Press, 1999.

Page 26: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Algorithm: Neighbor-Joining

• Initialization:– Define T to be the set of leaf nodes, one for each given sequence, and put n =

T.

• Iteration:– For each species, compute . – Choose a pair i, j in T for which Di,j – (ui + uj) is minimal.– Join i and j to a new cluster k=(ij). Calculate the branch lengths from i and j

to the new node k as: Di,k=1/2(Di,j+ ui – uj), Dj,k=1/2(Di,j+ uj – ui)

– Compute the distances between k and each other cluster: Dk,m=1/2(Di,m+ Dj,m – Di,j), mT

– Remove i and j from T and add k.

• Termination:– When T consists of only two nodes i and j, connect the remaining nodes by a

branch of length Dij.

ik

kii n

Du

)2(,

Page 27: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Chaowei Sun

Page 28: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

MEGA 2MEGA 2

• Molecular Evolutionary Genetics Analysis

• Provides tools for exploring and analyzing DNA and protein sequences from evolutionary perspectives

Page 29: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

History of MEGA

• MEGA 1

DOS-Based

• MEGA 2

User-friendly interface

Windows

Macintosh

Sun Workstation

Linux

Page 30: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Input

• Character Sequence - DNA/RNA - Protein• Distance Matrix• Import data from other formats, PHYLIP, XML,

etc.

Page 31: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Character Sequence

Page 32: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Distance Matrix

Page 33: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Methods and Algorithms

• methods for constructing phylogenetic trees from molecular data.

1. UPGMA Method

2. Neighbor-Joining (NJ) Method

3. Minimum Evolution (ME) Method

4. Maximum Parsimony (MP) Method

Page 34: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Unweighted Pair Group Method with Arithmetic Mean - UPGMA

• Assumes a constant rate of evolution

• sequential clustering method

• Produces a rooted tree

• edge lengths - time measured by a molecular clock

Page 35: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Neighbor-Joining - NJ

• No assumption

• finds neighbors sequentially that may minimize the total length of the tree

• produces an unrooted tree

• root - midpoint of the longest route connecting two taxa in the tree

Page 36: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Minimum Evolution - ME

• Finds a topology with the smallest sum of branch lengths

• time-consuming: sum of branches for all topologies have to be evaluated

Page 37: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Maximum Parsimony - MP

• Finds a topology that requires the smallest number of changes (substitution)

• For each topology – sums up total number of substitutions

Page 38: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Output - UPGMA

Branch lengthare equal

- constant rate

Page 39: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Unrooted Tree - NJ

Root

Page 40: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Output - NJ

Branch lengthis proportional

to distance

Page 41: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Output - ME

Page 42: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Comparison

Parsimony

Minimum EvolutionUPGMA

Neighbor-Joining

Optimality criterion Clustering algorithm

Computational Method

Distance

Characters

Page 43: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Comparison – Cont’d

• UPGMA, Neighbor-Joining

• Minimum Evolution, Maximum Parsimony

- Fast O(n2), Large dataset- depends upon the order in which we add sequences to the tree

- Time consuming, NP-Complete- use an explicit function relating the trees to the data

Page 44: PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun.

Thank you and enjoy the finals…