Top Banner
Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau
33

Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Mar 27, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Phylogenetic Tree Construction using Pathway Analysis

Bioengineering 190C Project By: Harry Choi

Nick LinGabe Kwong

Li Yan Christina Yau

Page 2: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Background Traditional Approach

Comparison of single orthologs between organisms

Distance matrix generation from similarity scores

Hierarchical Clustering Tree Construction

Disadvantage Sensitive to choice of gene for comparison Possible inconsistency of trees generated

Page 3: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Our Approach N annotated organisms to be clustered

Reference organism is chosen Pathway in the reference organism is chosen Pool of orthologs in the N organisms is

generated by BLAST Analysis of pool of ortholog generated vector

representing each organisms Distance calculation from vectors Hierarchical Clustering Tree Construction

Page 4: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Rationale for Approach Pathway takes into account multiple

genes Individual differences between genes not

directly taken into account All genes considered are related to each

other in cellular function Better conservation of actual function than

sequence identities

Better consistency in trees generated

Page 5: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Program Design Modular design divided into following portions:

BLAST Analysis of BLAST results Hierarchical Clustering Tree Construction

Design allows for reuse of components in different applications with minor changes

Design allows individual subrountines to be used recursively to generate desired results with minimal changes

Page 6: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Step 1: BLAST

Conserved proteins from Wnt pathway will be used as example

Page 7: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Wnt pathway

Wnt proteins form a family of highly

conserved secreted signaling molecules that regulate cell-to-cell interactions during embryogenesis. Wnt genes and Wnt signaling are also implicated in cancer. Wnt pathway is found in many organisms such as: Drosophila, Caenorhabditis elegans, Xenopus, Chiecken, Mouse, Zebrafish, and Human.

Page 8: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Wnt pathway (cont)Choose 6 most conserved proteins from this pathway as seed proteins:

Wnt Frizzled Dsh Apc Axin Tcf

(Roel Nusse, 2002)

Page 9: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

5 organisms Drosophila: 54455 sequences Mouse: 77143 sequences C. elegans: 62256 sequences Zebrafish: 3069 sequences Xenopus: 5174 sequences

Page 10: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

StrategySeed protein (pr1) from Organism 1 (O1) blast against 4 other

organisms:

Secondary seed proteins (pr 2, …, 9) blast against respective 4 other organism:

O1 O2 O3 O4 O5pr1 pr2

pr3pr4pr5pr6

pr7Pr8

pr9

O1 O2 O3 O4 O5

:::

:::

pr5 :::

:::

O1 O2 O3 O4 O5

pr11pr18pr19

pr3 pr5pr6

pr20

pr7pr13pr21pr22

pr9pr23

O1 O2 O3 O4 O5

pr1pr24

pr3pr25pr26

pr4 pr7pr14pr15

pr9pr16pr17pr23

O1 O2 O3 O4 O5

pr1pr10pr11

pr2 pr4pr6pr12

pr8 pr13pr14pr15

pr16pr17

Page 11: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Example input seed sequence for BLAST

>gi|85190|pir||A29650 wingless (wg) protein precursor - fruit fly

(Drosophila melanogaster)

MDISYIFVICLMALCSGGSSLSQVEGKQKSGRGRGSMWWGIAKVGEPNNITPIMYMDPAIHSTLRRKQRRLVRDNPGVLGALVKGANLAISECQHQFRNRRWNCSTRNFSRGKNLFGKIVDRGCRETSFIYAITSAAVTHSIARACSEGTIESCTCDYSHQSRSPQANHQAGSVAGVRDWEWGGCSDNIGFGFKFSREFVDTGERGRNLREKMNLHNNEAGRAHVQAEMRQECKCHGMSGSCTVKTCWMRLANFRVIGDNLKARFDGATRVQVTNSLRATNALAPVSPNAAGSNSVGSNGLIIPQSGLVYGEEEERMLNDHMPDILLENSHPISKIHHPNMPSPNSLPQAGQRGGRNGRRQGRKHNRYHFQLNPHNPEHKPPGSKDLVYLEPSPSFCEKNLRQGILGTHGRQCNETSLGVDGCGLMCCGRGYRRDEVVVVERCACTFHWCCEVKCKLCRTKKVIYTCL

(fasta format: start with “>”)

Page 12: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Example output file from BLAST: 15 secondary seed proteins

wg_85190 wg_celegans_7508752 1.70e-41wg_85190 wg_celegans_3880389 1.70e-41wg_85190 wg_celegans_17539494 1.70e-41wg_85190 wg_zebrafish_103816 1.20e-80wg_85190 wg_zebrafish_833600 1.20e-80wg_85190 wg_zebrafish_18859559 1.20e-80wg_85190 wg_zebrafish_139740 1.20e-80wg_85190 wg_xenopus_65236 1.40e-76wg_85190 wg_xenopus_69039 1.40e-76wg_85190 wg_xenopus_139748 1.40e-76wg_85190 wg_mouse_293671 2.50e-78wg_85190 wg_mouse_387388 2.50e-78wg_85190 wg_mouse_69037 2.50e-78wg_85190 wg_mouse_13529431 2.50e-78wg_85190 wg_mouse_139744 2.50e-78

Page 13: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Example output file from BLAST (cont)

wg_celegans_7508752 wg_celegans_7508752_drosophila_6537292 1.30e-90wg_celegans_7508752 wg_celegans_7508752_drosophila_12018324 1.30e-90wg_celegans_7508752 wg_celegans_7508752_xenopus_422628 1.10e-96wg_celegans_7508752 wg_celegans_7508752_xenopus_313268 1.10e-96wg_celegans_7508752 wg_celegans_7508752_xenopus_465484 1.10e-96wg_celegans_7508752 wg_celegans_7508752_mouse_202406 2.40e-96wg_celegans_7508752 wg_celegans_7508752_mouse_227507 2.40e-96wg_celegans_7508752 wg_celegans_7508752_mouse_111253 2.40e-96wg_celegans_7508752 wg_celegans_7508752_mouse_14789729 2.40e-96wg_celegans_7508752 wg_celegans_7508752_mouse_6678599 2.40e-96wg_celegans_7508752 wg_celegans_7508752_mouse_14424475 2.40e-96wg_celegans_7508752 wg_celegans_7508752_zebrafish_1256778 2.30e-94wg_celegans_7508752 wg_celegans_7508752_zebrafish_18859567 2.30e-94wg_celegans_7508752 wg_celegans_7508752_zebrafish_2501662 2.30e-94

Page 14: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Step 2: Analysis of BLAST Results

ie. Metric Determination

Page 15: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Metric Determination Common Algorithms used to

calculate a distance metric from similarity scores include (1-%Identity) and S = e(-d/2) (Shepard 1987).

A different algorithm is used for this project.

Page 16: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Rules Metric must Satisfy The distance between a gene and

itself must be zero Dii = 0. Communitive property: Dij = Dji. Triangular inequality: Dij + Dik

Djk.

i

j

k

Dij

Dik

Djk

Page 17: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Our Algorithm Determine unique gene pool from

all the organisms that meet the threshold for a particular gene in pathway.

Gene pool

Wg-Drosophila Celegans_17531491

g2

g3g4

g2g3

g2

g4g1

Yesg1Is g1 Unique? Is g2 unique?

No

Page 18: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Gene VectorsDrosophila Mouse Zebrafish...

g1g2g3

.

.

.gn

Genepool of entire Wnt pathway

100...1

011...0

000...1

Homologous gn found in Zebrafish

No Homolog of gn found in Mouse.

Page 19: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Euclidean Distance Vectors are in N dimensional space Determine Euclidean Distance by

taking the root of the differences squared.

Dij = (Di1-Dj1)2 + …+ (Din-Djn)2

= (1-0)2 + (1-1)2 + (0-1)2 + …

Page 20: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Distance Matrix

0

0

00

..

.

O1 O2 O3 . . . . . . . . . On

O1

O2

O3

. . . On

D21

D31

Dn1

D32

Since Euclidean distances commuteMatrix is Triangular.

Page 21: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Step 3: Hierarchical Clustering

Page 22: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Hierarchical Clustering

There are two types of clustering: Successive Fusions (Agglomerative Clustering) Separation (Divisive Clustering)

Page 23: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Hierarchical Clustering

In this project, agglomerative clustering algorithm has been employed

Idea: The most similar objects are first grouped. These are then merged according to their similarities, until all are fused into one single cluster

Page 24: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Hierarchical Clustering

Any N x N triangular matrix containing the pairwise distances between the organisms

D = {djk}

Input of the clustering program:

Page 25: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Hierarchical Clustering Feed the NXN matrix as the input and

the clustering method will output a (N-1)X(N-1) matrix In this case, it will be a 4X4 matrix:

Page 26: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Hierarchical Clustering

New Distances are determined between the new group and each of the remaining organisms

Page 27: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Hierarchical Clustering Continue with the

clustering until all the organisms fused into one cluster

D(135)(24) = min {d (135)(2) , d(135)(4) } = min {7, 6} = 6

d (135)2 = min {d (35)2, d 12} = min{7, 9} = 7

d (135)4 = min {d (35)4, d 14} = min{8,6} = 6

Page 28: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Hierarchical ClusteringThe outputs of each run:

The names of the organisms that are grouped together

The distance between the two organisms

After N-1 number of iterations, the outputs are saved to a file and they will be used to draw the phylogenetics tree.

Page 29: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Step 4: Phylogenetic Tree Construction

Page 30: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Tree Construction Sample input

Flat file of clusters and distancese.g. sample1.txtA B 4.5B C 5.2E D 5.8C E 12.4

Or e.g. sample2.txtA B 4.5A B C 5.2E D 5.8A B C E D 12.4

Page 31: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Tree Construction Sample Input (continued)

Requirements for input file: Each line must represent one cluster First entries are leaves in the cluster Last entry is the distance No more than two new leaves can be added in a

cluster Each entry must be delimited by a tab

Flexibility File can have all leaves in the cluster or a new leaf

and any leaf from previous clusters Subroutine can be reuse to generate tree from any

file by modifiying one line of code

Page 32: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Tree Construction Method

Subclass intree Read in file line by line Add new leaves to Vector leaves Array elts tracks the number of leaves added to

the vector in each cluster Array d tracks the distances between elements in

the cluster Subclass treed

Convert distances to pixels Draw tree and leaves in Jframe Draw scale of distance in Jframe

Page 33: Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau.

Tree Construction Sample Output

A

B

C

D

E

1 2 3 4 5 6 7 8