Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau
Mar 27, 2015
Phylogenetic Tree Construction using Pathway Analysis
Bioengineering 190C Project By: Harry Choi
Nick LinGabe Kwong
Li Yan Christina Yau
Background Traditional Approach
Comparison of single orthologs between organisms
Distance matrix generation from similarity scores
Hierarchical Clustering Tree Construction
Disadvantage Sensitive to choice of gene for comparison Possible inconsistency of trees generated
Our Approach N annotated organisms to be clustered
Reference organism is chosen Pathway in the reference organism is chosen Pool of orthologs in the N organisms is
generated by BLAST Analysis of pool of ortholog generated vector
representing each organisms Distance calculation from vectors Hierarchical Clustering Tree Construction
Rationale for Approach Pathway takes into account multiple
genes Individual differences between genes not
directly taken into account All genes considered are related to each
other in cellular function Better conservation of actual function than
sequence identities
Better consistency in trees generated
Program Design Modular design divided into following portions:
BLAST Analysis of BLAST results Hierarchical Clustering Tree Construction
Design allows for reuse of components in different applications with minor changes
Design allows individual subrountines to be used recursively to generate desired results with minimal changes
Step 1: BLAST
Conserved proteins from Wnt pathway will be used as example
Wnt pathway
Wnt proteins form a family of highly
conserved secreted signaling molecules that regulate cell-to-cell interactions during embryogenesis. Wnt genes and Wnt signaling are also implicated in cancer. Wnt pathway is found in many organisms such as: Drosophila, Caenorhabditis elegans, Xenopus, Chiecken, Mouse, Zebrafish, and Human.
Wnt pathway (cont)Choose 6 most conserved proteins from this pathway as seed proteins:
Wnt Frizzled Dsh Apc Axin Tcf
(Roel Nusse, 2002)
5 organisms Drosophila: 54455 sequences Mouse: 77143 sequences C. elegans: 62256 sequences Zebrafish: 3069 sequences Xenopus: 5174 sequences
StrategySeed protein (pr1) from Organism 1 (O1) blast against 4 other
organisms:
Secondary seed proteins (pr 2, …, 9) blast against respective 4 other organism:
O1 O2 O3 O4 O5pr1 pr2
pr3pr4pr5pr6
pr7Pr8
pr9
O1 O2 O3 O4 O5
:::
:::
pr5 :::
:::
O1 O2 O3 O4 O5
pr11pr18pr19
pr3 pr5pr6
pr20
pr7pr13pr21pr22
pr9pr23
O1 O2 O3 O4 O5
pr1pr24
pr3pr25pr26
pr4 pr7pr14pr15
pr9pr16pr17pr23
O1 O2 O3 O4 O5
pr1pr10pr11
pr2 pr4pr6pr12
pr8 pr13pr14pr15
pr16pr17
Example input seed sequence for BLAST
>gi|85190|pir||A29650 wingless (wg) protein precursor - fruit fly
(Drosophila melanogaster)
MDISYIFVICLMALCSGGSSLSQVEGKQKSGRGRGSMWWGIAKVGEPNNITPIMYMDPAIHSTLRRKQRRLVRDNPGVLGALVKGANLAISECQHQFRNRRWNCSTRNFSRGKNLFGKIVDRGCRETSFIYAITSAAVTHSIARACSEGTIESCTCDYSHQSRSPQANHQAGSVAGVRDWEWGGCSDNIGFGFKFSREFVDTGERGRNLREKMNLHNNEAGRAHVQAEMRQECKCHGMSGSCTVKTCWMRLANFRVIGDNLKARFDGATRVQVTNSLRATNALAPVSPNAAGSNSVGSNGLIIPQSGLVYGEEEERMLNDHMPDILLENSHPISKIHHPNMPSPNSLPQAGQRGGRNGRRQGRKHNRYHFQLNPHNPEHKPPGSKDLVYLEPSPSFCEKNLRQGILGTHGRQCNETSLGVDGCGLMCCGRGYRRDEVVVVERCACTFHWCCEVKCKLCRTKKVIYTCL
(fasta format: start with “>”)
Example output file from BLAST: 15 secondary seed proteins
wg_85190 wg_celegans_7508752 1.70e-41wg_85190 wg_celegans_3880389 1.70e-41wg_85190 wg_celegans_17539494 1.70e-41wg_85190 wg_zebrafish_103816 1.20e-80wg_85190 wg_zebrafish_833600 1.20e-80wg_85190 wg_zebrafish_18859559 1.20e-80wg_85190 wg_zebrafish_139740 1.20e-80wg_85190 wg_xenopus_65236 1.40e-76wg_85190 wg_xenopus_69039 1.40e-76wg_85190 wg_xenopus_139748 1.40e-76wg_85190 wg_mouse_293671 2.50e-78wg_85190 wg_mouse_387388 2.50e-78wg_85190 wg_mouse_69037 2.50e-78wg_85190 wg_mouse_13529431 2.50e-78wg_85190 wg_mouse_139744 2.50e-78
Example output file from BLAST (cont)
wg_celegans_7508752 wg_celegans_7508752_drosophila_6537292 1.30e-90wg_celegans_7508752 wg_celegans_7508752_drosophila_12018324 1.30e-90wg_celegans_7508752 wg_celegans_7508752_xenopus_422628 1.10e-96wg_celegans_7508752 wg_celegans_7508752_xenopus_313268 1.10e-96wg_celegans_7508752 wg_celegans_7508752_xenopus_465484 1.10e-96wg_celegans_7508752 wg_celegans_7508752_mouse_202406 2.40e-96wg_celegans_7508752 wg_celegans_7508752_mouse_227507 2.40e-96wg_celegans_7508752 wg_celegans_7508752_mouse_111253 2.40e-96wg_celegans_7508752 wg_celegans_7508752_mouse_14789729 2.40e-96wg_celegans_7508752 wg_celegans_7508752_mouse_6678599 2.40e-96wg_celegans_7508752 wg_celegans_7508752_mouse_14424475 2.40e-96wg_celegans_7508752 wg_celegans_7508752_zebrafish_1256778 2.30e-94wg_celegans_7508752 wg_celegans_7508752_zebrafish_18859567 2.30e-94wg_celegans_7508752 wg_celegans_7508752_zebrafish_2501662 2.30e-94
Step 2: Analysis of BLAST Results
ie. Metric Determination
Metric Determination Common Algorithms used to
calculate a distance metric from similarity scores include (1-%Identity) and S = e(-d/2) (Shepard 1987).
A different algorithm is used for this project.
Rules Metric must Satisfy The distance between a gene and
itself must be zero Dii = 0. Communitive property: Dij = Dji. Triangular inequality: Dij + Dik
Djk.
i
j
k
Dij
Dik
Djk
Our Algorithm Determine unique gene pool from
all the organisms that meet the threshold for a particular gene in pathway.
Gene pool
Wg-Drosophila Celegans_17531491
g2
g3g4
g2g3
g2
g4g1
Yesg1Is g1 Unique? Is g2 unique?
No
Gene VectorsDrosophila Mouse Zebrafish...
g1g2g3
.
.
.gn
Genepool of entire Wnt pathway
100...1
011...0
000...1
Homologous gn found in Zebrafish
No Homolog of gn found in Mouse.
Euclidean Distance Vectors are in N dimensional space Determine Euclidean Distance by
taking the root of the differences squared.
Dij = (Di1-Dj1)2 + …+ (Din-Djn)2
= (1-0)2 + (1-1)2 + (0-1)2 + …
Distance Matrix
0
0
00
..
.
O1 O2 O3 . . . . . . . . . On
O1
O2
O3
. . . On
D21
D31
Dn1
D32
Since Euclidean distances commuteMatrix is Triangular.
Step 3: Hierarchical Clustering
Hierarchical Clustering
There are two types of clustering: Successive Fusions (Agglomerative Clustering) Separation (Divisive Clustering)
Hierarchical Clustering
In this project, agglomerative clustering algorithm has been employed
Idea: The most similar objects are first grouped. These are then merged according to their similarities, until all are fused into one single cluster
Hierarchical Clustering
Any N x N triangular matrix containing the pairwise distances between the organisms
D = {djk}
Input of the clustering program:
Hierarchical Clustering Feed the NXN matrix as the input and
the clustering method will output a (N-1)X(N-1) matrix In this case, it will be a 4X4 matrix:
Hierarchical Clustering
New Distances are determined between the new group and each of the remaining organisms
Hierarchical Clustering Continue with the
clustering until all the organisms fused into one cluster
D(135)(24) = min {d (135)(2) , d(135)(4) } = min {7, 6} = 6
d (135)2 = min {d (35)2, d 12} = min{7, 9} = 7
d (135)4 = min {d (35)4, d 14} = min{8,6} = 6
Hierarchical ClusteringThe outputs of each run:
The names of the organisms that are grouped together
The distance between the two organisms
After N-1 number of iterations, the outputs are saved to a file and they will be used to draw the phylogenetics tree.
Step 4: Phylogenetic Tree Construction
Tree Construction Sample input
Flat file of clusters and distancese.g. sample1.txtA B 4.5B C 5.2E D 5.8C E 12.4
Or e.g. sample2.txtA B 4.5A B C 5.2E D 5.8A B C E D 12.4
Tree Construction Sample Input (continued)
Requirements for input file: Each line must represent one cluster First entries are leaves in the cluster Last entry is the distance No more than two new leaves can be added in a
cluster Each entry must be delimited by a tab
Flexibility File can have all leaves in the cluster or a new leaf
and any leaf from previous clusters Subroutine can be reuse to generate tree from any
file by modifiying one line of code
Tree Construction Method
Subclass intree Read in file line by line Add new leaves to Vector leaves Array elts tracks the number of leaves added to
the vector in each cluster Array d tracks the distances between elements in
the cluster Subclass treed
Convert distances to pixels Draw tree and leaves in Jframe Draw scale of distance in Jframe
Tree Construction Sample Output
A
B
C
D
E
1 2 3 4 5 6 7 8